[v1] mm: BPF OOM

[PATCH v1 00/14] mm: BPF OOM

Posted by Roman Gushchin 1 month, 2 weeks ago

This patchset adds an ability to customize the out of memory
handling using bpf.

It focuses on two parts:
1) OOM handling policy,
2) PSI-based OOM invocation.

The idea to use bpf for customizing the OOM handling is not new, but
unlike the previous proposal [1], which augmented the existing task
ranking policy, this one tries to be as generic as possible and
leverage the full power of the modern bpf.

It provides a generic interface which is called before the existing OOM
killer code and allows implementing any policy, e.g. picking a victim
task or memory cgroup or potentially even releasing memory in other
ways, e.g. deleting tmpfs files (the last one might require some
additional but relatively simple changes).

The past attempt to implement memory-cgroup aware policy [2] showed
that there are multiple opinions on what the best policy is.  As it's
highly workload-dependent and specific to a concrete way of organizing
workloads, the structure of the cgroup tree etc, a customizable
bpf-based implementation is preferable over a in-kernel implementation
with a dozen on sysctls.

The second part is related to the fundamental question on when to
declare the OOM event. It's a trade-off between the risk of
unnecessary OOM kills and associated work losses and the risk of
infinite trashing and effective soft lockups.  In the last few years
several PSI-based userspace solutions were developed (e.g. OOMd [3] or
systemd-OOMd [4]). The common idea was to use userspace daemons to
implement custom OOM logic as well as rely on PSI monitoring to avoid
stalls. In this scenario the userspace daemon was supposed to handle
the majority of OOMs, while the in-kernel OOM killer worked as the
last resort measure to guarantee that the system would never deadlock
on the memory. But this approach creates additional infrastructure
churn: userspace OOM daemon is a separate entity which needs to be
deployed, updated, monitored. A completely different pipeline needs to
be built to monitor both types of OOM events and collect associated
logs. A userspace daemon is more restricted in terms on what data is
available to it. Implementing a daemon which can work reliably under a
heavy memory pressure in the system is also tricky.

[1]: https://lwn.net/ml/linux-kernel/20230810081319.65668-1-zhouchuyi@bytedance.com/
[2]: https://lore.kernel.org/lkml/20171130152824.1591-1-guro@fb.com/
[3]: https://github.com/facebookincubator/oomd
[4]: https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html

----

v1:
  1) Both OOM and PSI parts are now implemented using bpf struct ops,
     providing a path the future extensions (suggested by Kumar Kartikeya Dwivedi,
     Song Liu and Matt Bobrowski)
  2) It's possible to create PSI triggers from BPF, no need for an additional
     userspace agent. (suggested by Suren Baghdasaryan)
     Also there is now a callback for the cgroup release event.
  3) Added an ability to block on oom_lock instead of bailing out (suggested by Michal Hocko)
  4) Added bpf_task_is_oom_victim (suggested by Michal Hocko)
  5) PSI callbacks are scheduled using a separate workqueue (suggested by Suren Baghdasaryan)

RFC:
  https://lwn.net/ml/all/20250428033617.3797686-1-roman.gushchin@linux.dev/


Roman Gushchin (14):
  mm: introduce bpf struct ops for OOM handling
  bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL
  mm: introduce bpf_oom_kill_process() bpf kfunc
  mm: introduce bpf kfuncs to deal with memcg pointers
  mm: introduce bpf_get_root_mem_cgroup() bpf kfunc
  mm: introduce bpf_out_of_memory() bpf kfunc
  mm: allow specifying custom oom constraint for bpf triggers
  mm: introduce bpf_task_is_oom_victim() kfunc
  bpf: selftests: introduce read_cgroup_file() helper
  bpf: selftests: bpf OOM handler test
  sched: psi: refactor psi_trigger_create()
  sched: psi: implement psi trigger handling using bpf
  sched: psi: implement bpf_psi_create_trigger() kfunc
  bpf: selftests: psi struct ops test

 include/linux/bpf_oom.h                       |  49 +++
 include/linux/bpf_psi.h                       |  71 ++++
 include/linux/memcontrol.h                    |   2 +
 include/linux/oom.h                           |  12 +
 include/linux/psi.h                           |  15 +-
 include/linux/psi_types.h                     |  72 +++-
 kernel/bpf/verifier.c                         |   5 +
 kernel/cgroup/cgroup.c                        |  14 +-
 kernel/sched/bpf_psi.c                        | 337 ++++++++++++++++++
 kernel/sched/build_utility.c                  |   4 +
 kernel/sched/psi.c                            | 130 +++++--
 mm/Makefile                                   |   4 +
 mm/bpf_memcontrol.c                           | 166 +++++++++
 mm/bpf_oom.c                                  | 157 ++++++++
 mm/oom_kill.c                                 | 182 +++++++++-
 tools/testing/selftests/bpf/cgroup_helpers.c  |  39 ++
 tools/testing/selftests/bpf/cgroup_helpers.h  |   2 +
 .../selftests/bpf/prog_tests/test_oom.c       | 229 ++++++++++++
 .../selftests/bpf/prog_tests/test_psi.c       | 224 ++++++++++++
 tools/testing/selftests/bpf/progs/test_oom.c  | 108 ++++++
 tools/testing/selftests/bpf/progs/test_psi.c  |  76 ++++
 21 files changed, 1845 insertions(+), 53 deletions(-)
 create mode 100644 include/linux/bpf_oom.h
 create mode 100644 include/linux/bpf_psi.h
 create mode 100644 kernel/sched/bpf_psi.c
 create mode 100644 mm/bpf_memcontrol.c
 create mode 100644 mm/bpf_oom.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_oom.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_psi.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_oom.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_psi.c

-- 
2.50.1

Re: [PATCH v1 00/14] mm: BPF OOM

Posted by Shakeel Butt 1 month, 2 weeks ago

On Mon, Aug 18, 2025 at 10:01:22AM -0700, Roman Gushchin wrote:
> This patchset adds an ability to customize the out of memory
> handling using bpf.
> 
> It focuses on two parts:
> 1) OOM handling policy,
> 2) PSI-based OOM invocation.
> 
> The idea to use bpf for customizing the OOM handling is not new, but
> unlike the previous proposal [1], which augmented the existing task
> ranking policy, this one tries to be as generic as possible and
> leverage the full power of the modern bpf.
> 
> It provides a generic interface which is called before the existing OOM
> killer code and allows implementing any policy, e.g. picking a victim
> task or memory cgroup or potentially even releasing memory in other
> ways, e.g. deleting tmpfs files (the last one might require some
> additional but relatively simple changes).

The releasing memory part is really interesting and useful. I can see
much more reliable and targetted oom reaping with this approach.

> 
> The past attempt to implement memory-cgroup aware policy [2] showed
> that there are multiple opinions on what the best policy is.  As it's
> highly workload-dependent and specific to a concrete way of organizing
> workloads, the structure of the cgroup tree etc,

and user space policies like Google has very clear priorities among
concurrently running workloads while many other users do not.

> a customizable
> bpf-based implementation is preferable over a in-kernel implementation
> with a dozen on sysctls.

+1

> 
> The second part is related to the fundamental question on when to
> declare the OOM event. It's a trade-off between the risk of
> unnecessary OOM kills and associated work losses and the risk of
> infinite trashing and effective soft lockups.  In the last few years
> several PSI-based userspace solutions were developed (e.g. OOMd [3] or
> systemd-OOMd [4]

and Android's LMKD (https://source.android.com/docs/core/perf/lmkd) uses
PSI too.

> ). The common idea was to use userspace daemons to
> implement custom OOM logic as well as rely on PSI monitoring to avoid
> stalls. In this scenario the userspace daemon was supposed to handle
> the majority of OOMs, while the in-kernel OOM killer worked as the
> last resort measure to guarantee that the system would never deadlock
> on the memory. But this approach creates additional infrastructure
> churn: userspace OOM daemon is a separate entity which needs to be
> deployed, updated, monitored. A completely different pipeline needs to
> be built to monitor both types of OOM events and collect associated
> logs. A userspace daemon is more restricted in terms on what data is
> available to it. Implementing a daemon which can work reliably under a
> heavy memory pressure in the system is also tricky.

Thanks for raising this and it is really challenging on very aggressive
overcommitted system. The userspace oom-killer needs cpu (or scheduling)
and memory guarantees as it needs to run and collect stats to decide who
to kill. Even with that, it can still get stuck in some global kernel
locks (I remember at Google I have seen their userspace oom-killer which
was a thread in borglet stuck on cgroup mutex or kernfs lock or
something). Anyways I see a lot of potential of this BPF based
oom-killer.

Orthogonally I am wondering if we can enable actions other than killing.
For example some workloads might prefer to get frozen or migrated away
instead of being killed.

Re: [PATCH v1 00/14] mm: BPF OOM

Posted by Roman Gushchin 1 month, 2 weeks ago

Shakeel Butt <shakeel.butt@linux.dev> writes:

> On Mon, Aug 18, 2025 at 10:01:22AM -0700, Roman Gushchin wrote:
>> This patchset adds an ability to customize the out of memory
>> handling using bpf.
>> 
>> It focuses on two parts:
>> 1) OOM handling policy,
>> 2) PSI-based OOM invocation.
>> 
>> The idea to use bpf for customizing the OOM handling is not new, but
>> unlike the previous proposal [1], which augmented the existing task
>> ranking policy, this one tries to be as generic as possible and
>> leverage the full power of the modern bpf.
>> 
>> It provides a generic interface which is called before the existing OOM
>> killer code and allows implementing any policy, e.g. picking a victim
>> task or memory cgroup or potentially even releasing memory in other
>> ways, e.g. deleting tmpfs files (the last one might require some
>> additional but relatively simple changes).
>
> The releasing memory part is really interesting and useful. I can see
> much more reliable and targetted oom reaping with this approach.
>
>> 
>> The past attempt to implement memory-cgroup aware policy [2] showed
>> that there are multiple opinions on what the best policy is.  As it's
>> highly workload-dependent and specific to a concrete way of organizing
>> workloads, the structure of the cgroup tree etc,
>
> and user space policies like Google has very clear priorities among
> concurrently running workloads while many other users do not.
>
>> a customizable
>> bpf-based implementation is preferable over a in-kernel implementation
>> with a dozen on sysctls.
>
> +1
>
>> 
>> The second part is related to the fundamental question on when to
>> declare the OOM event. It's a trade-off between the risk of
>> unnecessary OOM kills and associated work losses and the risk of
>> infinite trashing and effective soft lockups.  In the last few years
>> several PSI-based userspace solutions were developed (e.g. OOMd [3] or
>> systemd-OOMd [4]
>
> and Android's LMKD (https://source.android.com/docs/core/perf/lmkd) uses
> PSI too.
>
>> ). The common idea was to use userspace daemons to
>> implement custom OOM logic as well as rely on PSI monitoring to avoid
>> stalls. In this scenario the userspace daemon was supposed to handle
>> the majority of OOMs, while the in-kernel OOM killer worked as the
>> last resort measure to guarantee that the system would never deadlock
>> on the memory. But this approach creates additional infrastructure
>> churn: userspace OOM daemon is a separate entity which needs to be
>> deployed, updated, monitored. A completely different pipeline needs to
>> be built to monitor both types of OOM events and collect associated
>> logs. A userspace daemon is more restricted in terms on what data is
>> available to it. Implementing a daemon which can work reliably under a
>> heavy memory pressure in the system is also tricky.
>
> Thanks for raising this and it is really challenging on very aggressive
> overcommitted system. The userspace oom-killer needs cpu (or scheduling)
> and memory guarantees as it needs to run and collect stats to decide who
> to kill. Even with that, it can still get stuck in some global kernel
> locks (I remember at Google I have seen their userspace oom-killer which
> was a thread in borglet stuck on cgroup mutex or kernfs lock or
> something). Anyways I see a lot of potential of this BPF based
> oom-killer.
>
> Orthogonally I am wondering if we can enable actions other than killing.
> For example some workloads might prefer to get frozen or migrated away
> instead of being killed.

Absolutely, PSI events handling in the kernel (via BPF) opens a broad
range of possibilities. e.g. we can tune cgroup knobs, freeze/unfreeze
tasks, remove tmpfs files, promote/demote memory to other tiers, etc.
I was also thinking about tuning the readahead based on the memory
pressure.

Thanks!

Re: [PATCH v1 00/14] mm: BPF OOM

Posted by Suren Baghdasaryan 1 month, 2 weeks ago

On Mon, Aug 18, 2025 at 10:01 AM Roman Gushchin
<roman.gushchin@linux.dev> wrote:
>
> This patchset adds an ability to customize the out of memory
> handling using bpf.
>
> It focuses on two parts:
> 1) OOM handling policy,
> 2) PSI-based OOM invocation.
>
> The idea to use bpf for customizing the OOM handling is not new, but
> unlike the previous proposal [1], which augmented the existing task
> ranking policy, this one tries to be as generic as possible and
> leverage the full power of the modern bpf.
>
> It provides a generic interface which is called before the existing OOM
> killer code and allows implementing any policy, e.g. picking a victim
> task or memory cgroup or potentially even releasing memory in other
> ways, e.g. deleting tmpfs files (the last one might require some
> additional but relatively simple changes).
>
> The past attempt to implement memory-cgroup aware policy [2] showed
> that there are multiple opinions on what the best policy is.  As it's
> highly workload-dependent and specific to a concrete way of organizing
> workloads, the structure of the cgroup tree etc, a customizable
> bpf-based implementation is preferable over a in-kernel implementation
> with a dozen on sysctls.

s/on/of ?


>
> The second part is related to the fundamental question on when to
> declare the OOM event. It's a trade-off between the risk of
> unnecessary OOM kills and associated work losses and the risk of
> infinite trashing and effective soft lockups.  In the last few years
> several PSI-based userspace solutions were developed (e.g. OOMd [3] or
> systemd-OOMd [4]). The common idea was to use userspace daemons to
> implement custom OOM logic as well as rely on PSI monitoring to avoid
> stalls. In this scenario the userspace daemon was supposed to handle
> the majority of OOMs, while the in-kernel OOM killer worked as the
> last resort measure to guarantee that the system would never deadlock
> on the memory. But this approach creates additional infrastructure
> churn: userspace OOM daemon is a separate entity which needs to be
> deployed, updated, monitored. A completely different pipeline needs to
> be built to monitor both types of OOM events and collect associated
> logs. A userspace daemon is more restricted in terms on what data is
> available to it. Implementing a daemon which can work reliably under a
> heavy memory pressure in the system is also tricky.
>
> [1]: https://lwn.net/ml/linux-kernel/20230810081319.65668-1-zhouchuyi@bytedance.com/
> [2]: https://lore.kernel.org/lkml/20171130152824.1591-1-guro@fb.com/
> [3]: https://github.com/facebookincubator/oomd
> [4]: https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
>
> ----
>
> v1:
>   1) Both OOM and PSI parts are now implemented using bpf struct ops,
>      providing a path the future extensions (suggested by Kumar Kartikeya Dwivedi,
>      Song Liu and Matt Bobrowski)
>   2) It's possible to create PSI triggers from BPF, no need for an additional
>      userspace agent. (suggested by Suren Baghdasaryan)
>      Also there is now a callback for the cgroup release event.
>   3) Added an ability to block on oom_lock instead of bailing out (suggested by Michal Hocko)
>   4) Added bpf_task_is_oom_victim (suggested by Michal Hocko)
>   5) PSI callbacks are scheduled using a separate workqueue (suggested by Suren Baghdasaryan)
>
> RFC:
>   https://lwn.net/ml/all/20250428033617.3797686-1-roman.gushchin@linux.dev/
>
>
> Roman Gushchin (14):
>   mm: introduce bpf struct ops for OOM handling
>   bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL
>   mm: introduce bpf_oom_kill_process() bpf kfunc
>   mm: introduce bpf kfuncs to deal with memcg pointers
>   mm: introduce bpf_get_root_mem_cgroup() bpf kfunc
>   mm: introduce bpf_out_of_memory() bpf kfunc
>   mm: allow specifying custom oom constraint for bpf triggers
>   mm: introduce bpf_task_is_oom_victim() kfunc
>   bpf: selftests: introduce read_cgroup_file() helper
>   bpf: selftests: bpf OOM handler test
>   sched: psi: refactor psi_trigger_create()
>   sched: psi: implement psi trigger handling using bpf
>   sched: psi: implement bpf_psi_create_trigger() kfunc
>   bpf: selftests: psi struct ops test
>
>  include/linux/bpf_oom.h                       |  49 +++
>  include/linux/bpf_psi.h                       |  71 ++++
>  include/linux/memcontrol.h                    |   2 +
>  include/linux/oom.h                           |  12 +
>  include/linux/psi.h                           |  15 +-
>  include/linux/psi_types.h                     |  72 +++-
>  kernel/bpf/verifier.c                         |   5 +
>  kernel/cgroup/cgroup.c                        |  14 +-
>  kernel/sched/bpf_psi.c                        | 337 ++++++++++++++++++
>  kernel/sched/build_utility.c                  |   4 +
>  kernel/sched/psi.c                            | 130 +++++--
>  mm/Makefile                                   |   4 +
>  mm/bpf_memcontrol.c                           | 166 +++++++++
>  mm/bpf_oom.c                                  | 157 ++++++++
>  mm/oom_kill.c                                 | 182 +++++++++-
>  tools/testing/selftests/bpf/cgroup_helpers.c  |  39 ++
>  tools/testing/selftests/bpf/cgroup_helpers.h  |   2 +
>  .../selftests/bpf/prog_tests/test_oom.c       | 229 ++++++++++++
>  .../selftests/bpf/prog_tests/test_psi.c       | 224 ++++++++++++
>  tools/testing/selftests/bpf/progs/test_oom.c  | 108 ++++++
>  tools/testing/selftests/bpf/progs/test_psi.c  |  76 ++++
>  21 files changed, 1845 insertions(+), 53 deletions(-)
>  create mode 100644 include/linux/bpf_oom.h
>  create mode 100644 include/linux/bpf_psi.h
>  create mode 100644 kernel/sched/bpf_psi.c
>  create mode 100644 mm/bpf_memcontrol.c
>  create mode 100644 mm/bpf_oom.c
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/test_oom.c
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/test_psi.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_oom.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_psi.c
>
> --
> 2.50.1
>

Re: [PATCH v1 00/14] mm: BPF OOM

Posted by Roman Gushchin 1 month, 2 weeks ago

Suren Baghdasaryan <surenb@google.com> writes:

> On Mon, Aug 18, 2025 at 10:01 AM Roman Gushchin
> <roman.gushchin@linux.dev> wrote:
>>
>> This patchset adds an ability to customize the out of memory
>> handling using bpf.
>>
>> It focuses on two parts:
>> 1) OOM handling policy,
>> 2) PSI-based OOM invocation.
>>
>> The idea to use bpf for customizing the OOM handling is not new, but
>> unlike the previous proposal [1], which augmented the existing task
>> ranking policy, this one tries to be as generic as possible and
>> leverage the full power of the modern bpf.
>>
>> It provides a generic interface which is called before the existing OOM
>> killer code and allows implementing any policy, e.g. picking a victim
>> task or memory cgroup or potentially even releasing memory in other
>> ways, e.g. deleting tmpfs files (the last one might require some
>> additional but relatively simple changes).
>>
>> The past attempt to implement memory-cgroup aware policy [2] showed
>> that there are multiple opinions on what the best policy is.  As it's
>> highly workload-dependent and specific to a concrete way of organizing
>> workloads, the structure of the cgroup tree etc, a customizable
>> bpf-based implementation is preferable over a in-kernel implementation
>> with a dozen on sysctls.
>
> s/on/of ?

Fixed, thanks.