MAINTAINERS | 2 + include/linux/bpf-cgroup-defs.h | 6 + include/linux/bpf-cgroup.h | 16 ++ include/linux/bpf.h | 10 + include/linux/bpf_oom.h | 46 ++++ include/linux/memcontrol.h | 4 +- include/linux/oom.h | 13 + include/linux/psi_types.h | 4 + include/trace/events/psi.h | 27 ++ include/uapi/linux/bpf.h | 3 + kernel/bpf/bpf_struct_ops.c | 77 +++++- kernel/bpf/cgroup.c | 46 ++++ kernel/bpf/verifier.c | 5 + kernel/sched/psi.c | 7 + mm/Makefile | 2 +- mm/bpf_oom.c | 192 +++++++++++++ mm/memcontrol.c | 2 - mm/oom_kill.c | 202 ++++++++++++++ tools/include/uapi/linux/bpf.h | 1 + tools/lib/bpf/libbpf.c | 22 +- tools/lib/bpf/libbpf.h | 14 + tools/lib/bpf/libbpf.map | 1 + tools/testing/selftests/bpf/cgroup_helpers.c | 45 +++ tools/testing/selftests/bpf/cgroup_helpers.h | 3 + tools/testing/selftests/bpf/config | 1 + .../selftests/bpf/prog_tests/test_oom.c | 256 ++++++++++++++++++ .../selftests/bpf/prog_tests/test_psi.c | 225 +++++++++++++++ tools/testing/selftests/bpf/progs/test_oom.c | 111 ++++++++ tools/testing/selftests/bpf/progs/test_psi.c | 90 ++++++ 29 files changed, 1412 insertions(+), 21 deletions(-) create mode 100644 include/linux/bpf_oom.h create mode 100644 include/trace/events/psi.h create mode 100644 mm/bpf_oom.c create mode 100644 tools/testing/selftests/bpf/prog_tests/test_oom.c create mode 100644 tools/testing/selftests/bpf/prog_tests/test_psi.c create mode 100644 tools/testing/selftests/bpf/progs/test_oom.c create mode 100644 tools/testing/selftests/bpf/progs/test_psi.c
This patchset adds an ability to customize the out of memory
handling using bpf.
It focuses on two parts:
1) OOM handling policy,
2) PSI-based OOM invocation.
The idea to use bpf for customizing the OOM handling is not new, but
unlike the previous proposal [1], which augmented the existing task
ranking policy, this one tries to be as generic as possible and
leverage the full power of the modern bpf.
It provides a generic interface which is called before the existing OOM
killer code and allows implementing any policy, e.g. picking a victim
task or memory cgroup or potentially even releasing memory in other
ways, e.g. deleting tmpfs files (the last one might require some
additional but relatively simple changes).
The past attempt to implement memory-cgroup aware policy [2] showed
that there are multiple opinions on what the best policy is. As it's
highly workload-dependent and specific to a concrete way of organizing
workloads, the structure of the cgroup tree etc, a customizable
bpf-based implementation is preferable over an in-kernel implementation
with a dozen of sysctls.
The second part is related to the fundamental question on when to
declare the OOM event. It's a trade-off between the risk of
unnecessary OOM kills and associated work losses and the risk of
infinite trashing and effective soft lockups. In the last few years
several PSI-based userspace solutions were developed (e.g. OOMd [3] or
systemd-OOMd [4]). The common idea was to use userspace daemons to
implement custom OOM logic as well as rely on PSI monitoring to avoid
stalls. In this scenario the userspace daemon was supposed to handle
the majority of OOMs, while the in-kernel OOM killer worked as the
last resort measure to guarantee that the system would never deadlock
on the memory. But this approach creates additional infrastructure
churn: userspace OOM daemon is a separate entity which needs to be
deployed, updated, monitored. A completely different pipeline needs to
be built to monitor both types of OOM events and collect associated
logs. A userspace daemon is more restricted in terms on what data is
available to it. Implementing a daemon which can work reliably under a
heavy memory pressure in the system is also tricky.
This patchset includes the code, tests and many ideas from the patchset
of JP Kobryn, which implemented bpf kfuncs to provide a faster method
to access memcg data [5].
[1]: https://lwn.net/ml/linux-kernel/20230810081319.65668-1-zhouchuyi@bytedance.com/
[2]: https://lore.kernel.org/lkml/20171130152824.1591-1-guro@fb.com/
[3]: https://github.com/facebookincubator/oomd
[4]: https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
[5]: https://lkml.org/lkml/2025/10/15/1554
---
v3:
1) Replaced bpf_psi struct ops with a tracepoint in psi_avgs_work() (Tejun H.)
2) Updated bpf_oom struct ops:
- removed bpf_oom_ctx, passing bpf_struct_ops_link instead (by Alexei S.)
- removed handle_cgroup_offline callback.
3) Updated kfuncs:
- bpf_out_of_memory() dropped constraint_text argument (by Michal H.)
- bpf_oom_kill_process() added check for OOM_SCORE_ADJ_MIN.
4) Libbpf: updated bpf_map__attach_struct_ops_opts to use target_fd. (by Alexei S.)
v2:
1) A single bpf_oom can be attached system-wide and a single bpf_oom per memcg.
(by Alexei Starovoitov)
2) Initial support for attaching struct ops to cgroups (Martin KaFai Lau,
Andrii Nakryiko and others)
3) bpf memcontrol kfuncs enhancements and tests (co-developed by JP Kobryn)
4) Many mall-ish fixes and cleanups (suggested by Andrew Morton, Suren Baghdasaryan,
Andrii Nakryiko and Kumar Kartikeya Dwivedi)
5) bpf_out_of_memory() is taking u64 flags instead of bool wait_on_oom_lock
(suggested by Kumar Kartikeya Dwivedi)
6) bpf_get_mem_cgroup() got KF_RCU flag (suggested by Kumar Kartikeya Dwivedi)
7) cgroup online and offline callbacks for bpf_psi, cgroup offline for bpf_oom
v1:
1) Both OOM and PSI parts are now implemented using bpf struct ops,
providing a path the future extensions (suggested by Kumar Kartikeya Dwivedi,
Song Liu and Matt Bobrowski)
2) It's possible to create PSI triggers from BPF, no need for an additional
userspace agent. (suggested by Suren Baghdasaryan)
Also there is now a callback for the cgroup release event.
3) Added an ability to block on oom_lock instead of bailing out (suggested by Michal Hocko)
4) Added bpf_task_is_oom_victim (suggested by Michal Hocko)
5) PSI callbacks are scheduled using a separate workqueue (suggested by Suren Baghdasaryan)
RFC:
https://lwn.net/ml/all/20250428033617.3797686-1-roman.gushchin@linux.dev/
JP Kobryn (1):
bpf: selftests: add config for psi
Roman Gushchin (16):
bpf: move bpf_struct_ops_link into bpf.h
bpf: allow attaching struct_ops to cgroups
libbpf: fix return value on memory allocation failure
libbpf: introduce bpf_map__attach_struct_ops_opts()
bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL
mm: define mem_cgroup_get_from_ino() outside of CONFIG_SHRINKER_DEBUG
mm: introduce BPF OOM struct ops
mm: introduce bpf_oom_kill_process() bpf kfunc
mm: introduce bpf_out_of_memory() BPF kfunc
mm: introduce bpf_task_is_oom_victim() kfunc
bpf: selftests: introduce read_cgroup_file() helper
bpf: selftests: BPF OOM struct ops test
sched: psi: add a trace point to psi_avgs_work()
sched: psi: add cgroup_id field to psi_group structure
bpf: allow calling bpf_out_of_memory() from a PSI tracepoint
bpf: selftests: PSI struct ops test
MAINTAINERS | 2 +
include/linux/bpf-cgroup-defs.h | 6 +
include/linux/bpf-cgroup.h | 16 ++
include/linux/bpf.h | 10 +
include/linux/bpf_oom.h | 46 ++++
include/linux/memcontrol.h | 4 +-
include/linux/oom.h | 13 +
include/linux/psi_types.h | 4 +
include/trace/events/psi.h | 27 ++
include/uapi/linux/bpf.h | 3 +
kernel/bpf/bpf_struct_ops.c | 77 +++++-
kernel/bpf/cgroup.c | 46 ++++
kernel/bpf/verifier.c | 5 +
kernel/sched/psi.c | 7 +
mm/Makefile | 2 +-
mm/bpf_oom.c | 192 +++++++++++++
mm/memcontrol.c | 2 -
mm/oom_kill.c | 202 ++++++++++++++
tools/include/uapi/linux/bpf.h | 1 +
tools/lib/bpf/libbpf.c | 22 +-
tools/lib/bpf/libbpf.h | 14 +
tools/lib/bpf/libbpf.map | 1 +
tools/testing/selftests/bpf/cgroup_helpers.c | 45 +++
tools/testing/selftests/bpf/cgroup_helpers.h | 3 +
tools/testing/selftests/bpf/config | 1 +
.../selftests/bpf/prog_tests/test_oom.c | 256 ++++++++++++++++++
.../selftests/bpf/prog_tests/test_psi.c | 225 +++++++++++++++
tools/testing/selftests/bpf/progs/test_oom.c | 111 ++++++++
tools/testing/selftests/bpf/progs/test_psi.c | 90 ++++++
29 files changed, 1412 insertions(+), 21 deletions(-)
create mode 100644 include/linux/bpf_oom.h
create mode 100644 include/trace/events/psi.h
create mode 100644 mm/bpf_oom.c
create mode 100644 tools/testing/selftests/bpf/prog_tests/test_oom.c
create mode 100644 tools/testing/selftests/bpf/prog_tests/test_psi.c
create mode 100644 tools/testing/selftests/bpf/progs/test_oom.c
create mode 100644 tools/testing/selftests/bpf/progs/test_psi.c
--
2.52.0
On Mon 26-01-26 18:44:03, Roman Gushchin wrote: > This patchset adds an ability to customize the out of memory > handling using bpf. > > It focuses on two parts: > 1) OOM handling policy, > 2) PSI-based OOM invocation. > > The idea to use bpf for customizing the OOM handling is not new, but > unlike the previous proposal [1], which augmented the existing task > ranking policy, this one tries to be as generic as possible and > leverage the full power of the modern bpf. > > It provides a generic interface which is called before the existing OOM > killer code and allows implementing any policy, e.g. picking a victim > task or memory cgroup or potentially even releasing memory in other > ways, e.g. deleting tmpfs files (the last one might require some > additional but relatively simple changes). Are you planning to write any highlevel documentation on how to use the existing infrastructure to implement proper/correct OOM handlers with these generic interfaces? -- Michal Hocko SUSE Labs
Michal Hocko <mhocko@suse.com> writes: > On Mon 26-01-26 18:44:03, Roman Gushchin wrote: >> This patchset adds an ability to customize the out of memory >> handling using bpf. >> >> It focuses on two parts: >> 1) OOM handling policy, >> 2) PSI-based OOM invocation. >> >> The idea to use bpf for customizing the OOM handling is not new, but >> unlike the previous proposal [1], which augmented the existing task >> ranking policy, this one tries to be as generic as possible and >> leverage the full power of the modern bpf. >> >> It provides a generic interface which is called before the existing OOM >> killer code and allows implementing any policy, e.g. picking a victim >> task or memory cgroup or potentially even releasing memory in other >> ways, e.g. deleting tmpfs files (the last one might require some >> additional but relatively simple changes). > > Are you planning to write any highlevel documentation on how to use the > existing infrastructure to implement proper/correct OOM handlers with > these generic interfaces? What do you expect from such a document, can you, please, elaborate? I'm asking because the main promise of bpf is to provide some sort of a safe playground, so anyone can experiment with writing their bpf implementations (like sched_ext schedulers or bpf oom policies) with minimum risk. Yes, it might work sub-optimally and kill too many tasks, but it won't crash or deadlock the system. So in way I don't want to prescribe the "right way" of writing oom handler, but it totally makes sense to provide an example. As of now the best way to get an example of a bpf handler is to look into the commit "[PATCH bpf-next v3 12/17] bpf: selftests: BPF OOM struct ops test". Another viable idea (also suggested by Andrew Morton) is to develop a production ready memcg-aware OOM killer in BPF, put the source code into the kernel tree and make it loadable by default (obviously under a config option). Myself or one of my colleagues will try to explore it a bit later: the tricky part is this by-default loading because there are no existing precedents. Thanks!
On Tue 27-01-26 21:01:48, Roman Gushchin wrote: > Michal Hocko <mhocko@suse.com> writes: > > > On Mon 26-01-26 18:44:03, Roman Gushchin wrote: > >> This patchset adds an ability to customize the out of memory > >> handling using bpf. > >> > >> It focuses on two parts: > >> 1) OOM handling policy, > >> 2) PSI-based OOM invocation. > >> > >> The idea to use bpf for customizing the OOM handling is not new, but > >> unlike the previous proposal [1], which augmented the existing task > >> ranking policy, this one tries to be as generic as possible and > >> leverage the full power of the modern bpf. > >> > >> It provides a generic interface which is called before the existing OOM > >> killer code and allows implementing any policy, e.g. picking a victim > >> task or memory cgroup or potentially even releasing memory in other > >> ways, e.g. deleting tmpfs files (the last one might require some > >> additional but relatively simple changes). > > > > Are you planning to write any highlevel documentation on how to use the > > existing infrastructure to implement proper/correct OOM handlers with > > these generic interfaces? > > What do you expect from such a document, can you, please, elaborate? Sure. Essentially an expected structure of the handler. What is the API it can use, what is has to do and what it must not do. Essentially a single place you can read and get enough information to start developing your oom handler. > I'm asking because the main promise of bpf is to provide some sort > of a safe playground, so anyone can experiment with writing their > bpf implementations (like sched_ext schedulers or bpf oom policies) > with minimum risk. Yes, it might work sub-optimally and kill too many > tasks, but it won't crash or deadlock the system. > So in way I don't want to prescribe the "right way" of writing > oom handler, but it totally makes sense to provide an example. > > As of now the best way to get an example of a bpf handler is to look > into the commit "[PATCH bpf-next v3 12/17] bpf: selftests: BPF OOM > struct ops test". Examples are really great but having a central place to document available API is much more helpful IMHO. The generally scattered nature of BPF hooks makes it really hard to even know what is available to oom handlers to use. > Another viable idea (also suggested by Andrew Morton) is to develop > a production ready memcg-aware OOM killer in BPF, put the source code > into the kernel tree and make it loadable by default (obviously under a > config option). Myself or one of my colleagues will try to explore it a > bit later: the tricky part is this by-default loading because there are > no existing precedents. It certainly makes sense to have trusted implementation of a commonly requested oom policy that we couldn't implement due to specific nature that doesn't really apply to many users. And have that in the tree. I am not thrilled about auto-loading because this could be easily done by a simple tooling. -- Michal Hocko SUSE Labs
On Wed, Jan 28, 2026 at 12:06 AM Michal Hocko <mhocko@suse.com> wrote: > > > > Another viable idea (also suggested by Andrew Morton) is to develop > > a production ready memcg-aware OOM killer in BPF, put the source code > > into the kernel tree and make it loadable by default (obviously under a > > config option). Myself or one of my colleagues will try to explore it a > > bit later: the tricky part is this by-default loading because there are > > no existing precedents. > > It certainly makes sense to have trusted implementation of a commonly > requested oom policy that we couldn't implement due to specific nature > that doesn't really apply to many users. And have that in the tree. I am > not thrilled about auto-loading because this could be easily done by a > simple tooling. Production ready bpf-oom program(s) must be part of this set. We've seen enough attempts to add bpf st_ops in various parts of the kernel without providing realistic bpf progs that will drive those hooks. It's great to have flexibility and people need to have a freedom to develop their own bpf-oom policy, but the author of the patch set who's advocating for the new bpf hooks must provide their real production progs and share their real use case with the community. It's not cool to hide it. In that sense enabling auto-loading without requiring an end user to install the toolchain and build bpf programs/rust/whatnot is necessary too. bpf-oom can be a self contained part of vmlinux binary. We already have a mechanism to do that. This way the end user doesn't need to be a bpf expert, doesn't need to install clang, build the tools, etc. They can just enable fancy new bpf-oom policy and see whether it's helping their apps or not while knowing nothing about bpf.
On Wed, Jan 28, 2026 at 08:59:34AM -0800, Alexei Starovoitov wrote: > On Wed, Jan 28, 2026 at 12:06 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > Another viable idea (also suggested by Andrew Morton) is to develop > > > a production ready memcg-aware OOM killer in BPF, put the source code > > > into the kernel tree and make it loadable by default (obviously under a > > > config option). Myself or one of my colleagues will try to explore it a > > > bit later: the tricky part is this by-default loading because there are > > > no existing precedents. > > > > It certainly makes sense to have trusted implementation of a commonly > > requested oom policy that we couldn't implement due to specific nature > > that doesn't really apply to many users. And have that in the tree. I am > > not thrilled about auto-loading because this could be easily done by a > > simple tooling. > > Production ready bpf-oom program(s) must be part of this set. > We've seen enough attempts to add bpf st_ops in various parts of > the kernel without providing realistic bpf progs that will drive > those hooks. It's great to have flexibility and people need > to have a freedom to develop their own bpf-oom policy, but > the author of the patch set who's advocating for the new > bpf hooks must provide their real production progs and > share their real use case with the community. > It's not cool to hide it. > In that sense enabling auto-loading without requiring an end user > to install the toolchain and build bpf programs/rust/whatnot > is necessary too. > bpf-oom can be a self contained part of vmlinux binary. > We already have a mechanism to do that. > This way the end user doesn't need to be a bpf expert, doesn't need > to install clang, build the tools, etc. > They can just enable fancy new bpf-oom policy and see whether > it's helping their apps or not while knowing nothing about bpf. For the auto-loading capability you speak of here, I'm currently interpreting it as being some form of conceptually similar extension to the BPF preload functionality. Have I understood this correctly? If so, I feel as though something like this would be a completely independent stream of work, orthogonal to this BPF OOM feature, right? Or, is that you'd like this new auto-loading capability completed as a hard prerequisite before pulling in the BPF OOM feature?
On Sun, Feb 1, 2026 at 7:26 PM Matt Bobrowski <mattbobrowski@google.com> wrote: > > On Wed, Jan 28, 2026 at 08:59:34AM -0800, Alexei Starovoitov wrote: > > On Wed, Jan 28, 2026 at 12:06 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > Another viable idea (also suggested by Andrew Morton) is to develop > > > > a production ready memcg-aware OOM killer in BPF, put the source code > > > > into the kernel tree and make it loadable by default (obviously under a > > > > config option). Myself or one of my colleagues will try to explore it a > > > > bit later: the tricky part is this by-default loading because there are > > > > no existing precedents. > > > > > > It certainly makes sense to have trusted implementation of a commonly > > > requested oom policy that we couldn't implement due to specific nature > > > that doesn't really apply to many users. And have that in the tree. I am > > > not thrilled about auto-loading because this could be easily done by a > > > simple tooling. > > > > Production ready bpf-oom program(s) must be part of this set. > > We've seen enough attempts to add bpf st_ops in various parts of > > the kernel without providing realistic bpf progs that will drive > > those hooks. It's great to have flexibility and people need > > to have a freedom to develop their own bpf-oom policy, but > > the author of the patch set who's advocating for the new > > bpf hooks must provide their real production progs and > > share their real use case with the community. > > It's not cool to hide it. > > In that sense enabling auto-loading without requiring an end user > > to install the toolchain and build bpf programs/rust/whatnot > > is necessary too. > > bpf-oom can be a self contained part of vmlinux binary. > > We already have a mechanism to do that. > > This way the end user doesn't need to be a bpf expert, doesn't need > > to install clang, build the tools, etc. > > They can just enable fancy new bpf-oom policy and see whether > > it's helping their apps or not while knowing nothing about bpf. > > For the auto-loading capability you speak of here, I'm currently > interpreting it as being some form of conceptually similar extension > to the BPF preload functionality. Have I understood this correctly? If > so, I feel as though something like this would be a completely > independent stream of work, orthogonal to this BPF OOM feature, right? > Or, is that you'd like this new auto-loading capability completed as a > hard prerequisite before pulling in the BPF OOM feature? It's not a hard prerequisite, but it has to be thought through. bpf side is ready today. bpf preload is an example of it. The oom side needs to design an interface to do it. sysctl to enable builtin bpf-oom policy is probably too rigid. Maybe a file in cgroupfs? Writing a name of bpf-oom policy would trigger load and attach to that cgroup. Or you can plug it exactly like bpf preload: when bpffs is mounted all builtin bpf progs get loaded and create ".debug" files in bpffs. I recall we discussed an ability to create files in bpffs from tracepoints. This way bpffs can replicate cgroupfs directory structure without user space involvement. New cgroup -> new directory in cgroupfs -> tracepoint -> bpf prog -> new directory in bpffs -> create "enable_bpf_oom.debug" file in there. Writing to that file we trigger bpf prog that will attach bpf-oom prog to that cgroup. Could be any combination of the above or something else, but needs to be designed and agreed upon. Otherwise, I'm afraid, we will have bpf-oom progs in selftests and users who want to experiment with it would need kernel source code, clang, etc to try it. We need to lower the barrier to use it.
On Mon, Feb 02, 2026 at 09:50:05AM -0800, Alexei Starovoitov wrote: > On Sun, Feb 1, 2026 at 7:26 PM Matt Bobrowski <mattbobrowski@google.com> wrote: > > > > On Wed, Jan 28, 2026 at 08:59:34AM -0800, Alexei Starovoitov wrote: > > > On Wed, Jan 28, 2026 at 12:06 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > > > > > > Another viable idea (also suggested by Andrew Morton) is to develop > > > > > a production ready memcg-aware OOM killer in BPF, put the source code > > > > > into the kernel tree and make it loadable by default (obviously under a > > > > > config option). Myself or one of my colleagues will try to explore it a > > > > > bit later: the tricky part is this by-default loading because there are > > > > > no existing precedents. > > > > > > > > It certainly makes sense to have trusted implementation of a commonly > > > > requested oom policy that we couldn't implement due to specific nature > > > > that doesn't really apply to many users. And have that in the tree. I am > > > > not thrilled about auto-loading because this could be easily done by a > > > > simple tooling. > > > > > > Production ready bpf-oom program(s) must be part of this set. > > > We've seen enough attempts to add bpf st_ops in various parts of > > > the kernel without providing realistic bpf progs that will drive > > > those hooks. It's great to have flexibility and people need > > > to have a freedom to develop their own bpf-oom policy, but > > > the author of the patch set who's advocating for the new > > > bpf hooks must provide their real production progs and > > > share their real use case with the community. > > > It's not cool to hide it. > > > In that sense enabling auto-loading without requiring an end user > > > to install the toolchain and build bpf programs/rust/whatnot > > > is necessary too. > > > bpf-oom can be a self contained part of vmlinux binary. > > > We already have a mechanism to do that. > > > This way the end user doesn't need to be a bpf expert, doesn't need > > > to install clang, build the tools, etc. > > > They can just enable fancy new bpf-oom policy and see whether > > > it's helping their apps or not while knowing nothing about bpf. > > > > For the auto-loading capability you speak of here, I'm currently > > interpreting it as being some form of conceptually similar extension > > to the BPF preload functionality. Have I understood this correctly? If > > so, I feel as though something like this would be a completely > > independent stream of work, orthogonal to this BPF OOM feature, right? > > Or, is that you'd like this new auto-loading capability completed as a > > hard prerequisite before pulling in the BPF OOM feature? > > It's not a hard prerequisite, but it has to be thought through. > bpf side is ready today. bpf preload is an example of it. > The oom side needs to design an interface to do it. > sysctl to enable builtin bpf-oom policy is probably too rigid. > Maybe a file in cgroupfs? Writing a name of bpf-oom policy would > trigger load and attach to that cgroup. > Or you can plug it exactly like bpf preload: > when bpffs is mounted all builtin bpf progs get loaded and create > ".debug" files in bpffs. > > I recall we discussed an ability to create files in bpffs from > tracepoints. This way bpffs can replicate cgroupfs directory > structure without user space involvement. New cgroup -> new directory > in cgroupfs -> tracepoint -> bpf prog -> new directory in bpffs > -> create "enable_bpf_oom.debug" file in there. > Writing to that file we trigger bpf prog that will attach bpf-oom > prog to that cgroup. > Could be any combination of the above or something else, > but needs to be designed and agreed upon. > Otherwise, I'm afraid, we will have bpf-oom progs in selftests > and users who want to experiment with it would need kernel source > code, clang, etc to try it. We need to lower the barrier to use it. OK, I see what you're saying here. I'll have a chat to Roman about this and see what his thoughts are on it.
Alexei Starovoitov <alexei.starovoitov@gmail.com> writes: > On Wed, Jan 28, 2026 at 12:06 AM Michal Hocko <mhocko@suse.com> wrote: >> >> >> > Another viable idea (also suggested by Andrew Morton) is to develop >> > a production ready memcg-aware OOM killer in BPF, put the source code >> > into the kernel tree and make it loadable by default (obviously under a >> > config option). Myself or one of my colleagues will try to explore it a >> > bit later: the tricky part is this by-default loading because there are >> > no existing precedents. >> >> It certainly makes sense to have trusted implementation of a commonly >> requested oom policy that we couldn't implement due to specific nature >> that doesn't really apply to many users. And have that in the tree. I am >> not thrilled about auto-loading because this could be easily done by a >> simple tooling. > > Production ready bpf-oom program(s) must be part of this set. > We've seen enough attempts to add bpf st_ops in various parts of > the kernel without providing realistic bpf progs that will drive > those hooks. It's great to have flexibility and people need > to have a freedom to develop their own bpf-oom policy, but > the author of the patch set who's advocating for the new > bpf hooks must provide their real production progs and > share their real use case with the community. > It's not cool to hide it. In my case it's not about hiding, it's a chicken and egg problem: the upstream first model contradicts with the idea to include the production results into the patchset. In other words, I want to settle down the interface before shipping something to prod. I guess the compromise here is to initially include a bpf oom policy inspired by what systemd-oomd does and what is proven to work for a broad range of users. Policies suited for large datacenters can be added later, but also their generic usefulness might be limited by the need of proprietary userspace orchestration engines. > In that sense enabling auto-loading without requiring an end user > to install the toolchain and build bpf programs/rust/whatnot > is necessary too. > bpf-oom can be a self contained part of vmlinux binary. > We already have a mechanism to do that. > This way the end user doesn't need to be a bpf expert, doesn't need > to install clang, build the tools, etc. > They can just enable fancy new bpf-oom policy and see whether > it's helping their apps or not while knowing nothing about bpf. Fully agree here. Will implement in v4. Thanks!
On Wed, Jan 28, 2026 at 10:23 AM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > Alexei Starovoitov <alexei.starovoitov@gmail.com> writes: > > > On Wed, Jan 28, 2026 at 12:06 AM Michal Hocko <mhocko@suse.com> wrote: > >> > >> > >> > Another viable idea (also suggested by Andrew Morton) is to develop > >> > a production ready memcg-aware OOM killer in BPF, put the source code > >> > into the kernel tree and make it loadable by default (obviously under a > >> > config option). Myself or one of my colleagues will try to explore it a > >> > bit later: the tricky part is this by-default loading because there are > >> > no existing precedents. > >> > >> It certainly makes sense to have trusted implementation of a commonly > >> requested oom policy that we couldn't implement due to specific nature > >> that doesn't really apply to many users. And have that in the tree. I am > >> not thrilled about auto-loading because this could be easily done by a > >> simple tooling. > > > > Production ready bpf-oom program(s) must be part of this set. > > We've seen enough attempts to add bpf st_ops in various parts of > > the kernel without providing realistic bpf progs that will drive > > those hooks. It's great to have flexibility and people need > > to have a freedom to develop their own bpf-oom policy, but > > the author of the patch set who's advocating for the new > > bpf hooks must provide their real production progs and > > share their real use case with the community. > > It's not cool to hide it. > > In my case it's not about hiding, it's a chicken and egg problem: > the upstream first model contradicts with the idea to include the > production results into the patchset. In other words, I want to settle > down the interface before shipping something to prod. > > I guess the compromise here is to initially include a bpf oom policy > inspired by what systemd-oomd does and what is proven to work for a > broad range of users. Works for me. > Policies suited for large datacenters can be > added later, but also their generic usefulness might be limited by the > need of proprietary userspace orchestration engines. Agree. That's the flexibility part that makes the whole thing worth while and the reason to do such oom policy as bpf progs. But something tangible and useful needs to be there from day one. systmed-oomd-like sounds very reasonable.
© 2016 - 2026 Red Hat, Inc.