[v1] mm: BPF OOM

[PATCH rfc 00/12] mm: BPF OOM

Posted by Roman Gushchin 9 months, 2 weeks ago

This patchset adds an ability to customize the out of memory
handling using bpf.

It focuses on two parts:
1) OOM handling policy,
2) PSI-based OOM invocation.

The idea to use bpf for customizing the OOM handling is not new, but
unlike the previous proposal [1], which augmented the existing task
ranking-based policy, this one tries to be as generic as possible and
leverage the full power of the modern bpf.

It provides a generic hook which is called before the existing OOM
killer code and allows implementing any policy, e.g.  picking a victim
task or memory cgroup or potentially even releasing memory in other
ways, e.g. deleting tmpfs files (the last one might require some
additional but relatively simple changes).

The past attempt to implement memory-cgroup aware policy [2] showed
that there are multiple opinions on what the best policy is.  As it's
highly workload-dependent and specific to a concrete way of organizing
workloads, the structure of the cgroup tree etc, a customizable
bpf-based implementation is preferable over a in-kernel implementation
with a dozen on sysctls.

The second part is related to the fundamental question on when to
declare the OOM event. It's a trade-off between the risk of
unnecessary OOM kills and associated work losses and the risk of
infinite trashing and effective soft lockups.  In the last few years
several PSI-based userspace solutions were developed (e.g. OOMd [3] or
systemd-OOMd [4]). The common idea was to use userspace daemons to
implement custom OOM logic as well as rely on PSI monitoring to avoid
stalls. In this scenario the userspace daemon was supposed to handle
the majority of OOMs, while the in-kernel OOM killer worked as the
last resort measure to guarantee that the system would never deadlock
on the memory. But this approach creates additional infrastructure
churn: userspace OOM daemon is a separate entity which needs to be
deployed, updated, monitored. A completely different pipeline needs to
be built to monitor both types of OOM events and collect associated
logs. A userspace daemon is more restricted in terms on what data is
available to it. Implementing a daemon which can work reliably under a
heavy memory pressure in the system is also tricky.

[1]: https://lwn.net/ml/linux-kernel/20230810081319.65668-1-zhouchuyi@bytedance.com/
[2]: https://lore.kernel.org/lkml/20171130152824.1591-1-guro@fb.com/
[3]: https://github.com/facebookincubator/oomd
[4]: https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html

----

This is an RFC version, which is not intended to be merged in the current form.
Open questions/TODOs:
1) Program type/attachment type for the bpf_handle_out_of_memory() hook.
   It has to be able to return a value, to be sleepable (to use cgroup iterators)
   and to have trusted arguments to pass oom_control down to bpf_oom_kill_process().
   Current patchset has a workaround (patch "bpf: treat fmodret tracing program's
   arguments as trusted"), which is not safe. One option is to fake acquire/release
   semantics for the oom_control pointer. Other option is to introduce a completely
   new attachment or program type, similar to lsm hooks.
2) Currently lockdep complaints about a potential circular dependency because
   sleepable bpf_handle_out_of_memory() hook calls might_fault() under oom_lock.
   One way to fix it is to make it non-sleepable, but then it will require some
   additional work to allow it using cgroup iterators. It's intervened with 1).
3) What kind of hierarchical features are required? Do we want to nest oom policies?
   Do we want to attach oom policies to cgroups? I think it's too complicated,
   but if we want a full hierarchical support, it might be required.
   Patch "mm: introduce bpf_get_root_mem_cgroup() bpf kfunc" exposes the true root
   memcg, which is potentially outside of the ns of the loading process. Does
   it require some additional capabilities checks? Should it be removed?
4) Documentation is lacking and will be added in the next version.


Roman Gushchin (12):
  mm: introduce a bpf hook for OOM handling
  bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL
  bpf: treat fmodret tracing program's arguments as trusted
  mm: introduce bpf_oom_kill_process() bpf kfunc
  mm: introduce bpf kfuncs to deal with memcg pointers
  mm: introduce bpf_get_root_mem_cgroup() bpf kfunc
  bpf: selftests: introduce read_cgroup_file() helper
  bpf: selftests: bpf OOM handler test
  sched: psi: bpf hook to handle psi events
  mm: introduce bpf_out_of_memory() bpf kfunc
  bpf: selftests: introduce open_cgroup_file() helper
  bpf: selftests: psi handler test

 include/linux/memcontrol.h                   |   2 +
 include/linux/oom.h                          |   5 +
 kernel/bpf/btf.c                             |   9 +-
 kernel/bpf/verifier.c                        |   5 +
 kernel/sched/psi.c                           |  36 ++-
 mm/Makefile                                  |   3 +
 mm/bpf_memcontrol.c                          | 108 +++++++++
 mm/oom_kill.c                                | 140 +++++++++++
 tools/testing/selftests/bpf/cgroup_helpers.c |  67 ++++++
 tools/testing/selftests/bpf/cgroup_helpers.h |   3 +
 tools/testing/selftests/bpf/prog_tests/oom.c | 227 ++++++++++++++++++
 tools/testing/selftests/bpf/prog_tests/psi.c | 234 +++++++++++++++++++
 tools/testing/selftests/bpf/progs/test_oom.c | 103 ++++++++
 tools/testing/selftests/bpf/progs/test_psi.c |  43 ++++
 14 files changed, 983 insertions(+), 2 deletions(-)
 create mode 100644 mm/bpf_memcontrol.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/oom.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/psi.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_oom.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_psi.c

-- 
2.49.0.901.g37484f566f-goog

Re: [PATCH rfc 00/12] mm: BPF OOM

Posted by Suren Baghdasaryan 9 months, 1 week ago

On Sun, Apr 27, 2025 at 8:36 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> This patchset adds an ability to customize the out of memory
> handling using bpf.
>
> It focuses on two parts:
> 1) OOM handling policy,
> 2) PSI-based OOM invocation.
>
> The idea to use bpf for customizing the OOM handling is not new, but
> unlike the previous proposal [1], which augmented the existing task
> ranking-based policy, this one tries to be as generic as possible and
> leverage the full power of the modern bpf.
>
> It provides a generic hook which is called before the existing OOM
> killer code and allows implementing any policy, e.g.  picking a victim
> task or memory cgroup or potentially even releasing memory in other
> ways, e.g. deleting tmpfs files (the last one might require some
> additional but relatively simple changes).
>
> The past attempt to implement memory-cgroup aware policy [2] showed
> that there are multiple opinions on what the best policy is.  As it's
> highly workload-dependent and specific to a concrete way of organizing
> workloads, the structure of the cgroup tree etc, a customizable
> bpf-based implementation is preferable over a in-kernel implementation
> with a dozen on sysctls.
>
> The second part is related to the fundamental question on when to
> declare the OOM event. It's a trade-off between the risk of
> unnecessary OOM kills and associated work losses and the risk of
> infinite trashing and effective soft lockups.  In the last few years
> several PSI-based userspace solutions were developed (e.g. OOMd [3] or
> systemd-OOMd [4]). The common idea was to use userspace daemons to
> implement custom OOM logic as well as rely on PSI monitoring to avoid
> stalls. In this scenario the userspace daemon was supposed to handle
> the majority of OOMs, while the in-kernel OOM killer worked as the
> last resort measure to guarantee that the system would never deadlock
> on the memory. But this approach creates additional infrastructure
> churn: userspace OOM daemon is a separate entity which needs to be
> deployed, updated, monitored. A completely different pipeline needs to
> be built to monitor both types of OOM events and collect associated
> logs. A userspace daemon is more restricted in terms on what data is
> available to it. Implementing a daemon which can work reliably under a
> heavy memory pressure in the system is also tricky.

I didn't read the whole patchset yet but want to mention couple
features that we should not forget:
- memory reaping. Maybe you already call oom_reap_task_mm() after BPF
oom-handler kills a process or maybe BPF handler is expected to
implement it?
- kill reporting to userspace. I think BPF handler would be expected
to implement it?

>
> [1]: https://lwn.net/ml/linux-kernel/20230810081319.65668-1-zhouchuyi@bytedance.com/
> [2]: https://lore.kernel.org/lkml/20171130152824.1591-1-guro@fb.com/
> [3]: https://github.com/facebookincubator/oomd
> [4]: https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
>
> ----
>
> This is an RFC version, which is not intended to be merged in the current form.
> Open questions/TODOs:
> 1) Program type/attachment type for the bpf_handle_out_of_memory() hook.
>    It has to be able to return a value, to be sleepable (to use cgroup iterators)
>    and to have trusted arguments to pass oom_control down to bpf_oom_kill_process().
>    Current patchset has a workaround (patch "bpf: treat fmodret tracing program's
>    arguments as trusted"), which is not safe. One option is to fake acquire/release
>    semantics for the oom_control pointer. Other option is to introduce a completely
>    new attachment or program type, similar to lsm hooks.
> 2) Currently lockdep complaints about a potential circular dependency because
>    sleepable bpf_handle_out_of_memory() hook calls might_fault() under oom_lock.
>    One way to fix it is to make it non-sleepable, but then it will require some
>    additional work to allow it using cgroup iterators. It's intervened with 1).
> 3) What kind of hierarchical features are required? Do we want to nest oom policies?
>    Do we want to attach oom policies to cgroups? I think it's too complicated,
>    but if we want a full hierarchical support, it might be required.
>    Patch "mm: introduce bpf_get_root_mem_cgroup() bpf kfunc" exposes the true root
>    memcg, which is potentially outside of the ns of the loading process. Does
>    it require some additional capabilities checks? Should it be removed?
> 4) Documentation is lacking and will be added in the next version.
>
>
> Roman Gushchin (12):
>   mm: introduce a bpf hook for OOM handling
>   bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL
>   bpf: treat fmodret tracing program's arguments as trusted
>   mm: introduce bpf_oom_kill_process() bpf kfunc
>   mm: introduce bpf kfuncs to deal with memcg pointers
>   mm: introduce bpf_get_root_mem_cgroup() bpf kfunc
>   bpf: selftests: introduce read_cgroup_file() helper
>   bpf: selftests: bpf OOM handler test
>   sched: psi: bpf hook to handle psi events
>   mm: introduce bpf_out_of_memory() bpf kfunc
>   bpf: selftests: introduce open_cgroup_file() helper
>   bpf: selftests: psi handler test
>
>  include/linux/memcontrol.h                   |   2 +
>  include/linux/oom.h                          |   5 +
>  kernel/bpf/btf.c                             |   9 +-
>  kernel/bpf/verifier.c                        |   5 +
>  kernel/sched/psi.c                           |  36 ++-
>  mm/Makefile                                  |   3 +
>  mm/bpf_memcontrol.c                          | 108 +++++++++
>  mm/oom_kill.c                                | 140 +++++++++++
>  tools/testing/selftests/bpf/cgroup_helpers.c |  67 ++++++
>  tools/testing/selftests/bpf/cgroup_helpers.h |   3 +
>  tools/testing/selftests/bpf/prog_tests/oom.c | 227 ++++++++++++++++++
>  tools/testing/selftests/bpf/prog_tests/psi.c | 234 +++++++++++++++++++
>  tools/testing/selftests/bpf/progs/test_oom.c | 103 ++++++++
>  tools/testing/selftests/bpf/progs/test_psi.c |  43 ++++
>  14 files changed, 983 insertions(+), 2 deletions(-)
>  create mode 100644 mm/bpf_memcontrol.c
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/oom.c
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/psi.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_oom.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_psi.c
>
> --
> 2.49.0.901.g37484f566f-goog
>

Re: [PATCH rfc 00/12] mm: BPF OOM

Posted by Roman Gushchin 9 months, 1 week ago

On Tue, Apr 29, 2025 at 03:44:14PM -0700, Suren Baghdasaryan wrote:
> On Sun, Apr 27, 2025 at 8:36 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> >
> > This patchset adds an ability to customize the out of memory
> > handling using bpf.
> >
> > It focuses on two parts:
> > 1) OOM handling policy,
> > 2) PSI-based OOM invocation.
> >
> > The idea to use bpf for customizing the OOM handling is not new, but
> > unlike the previous proposal [1], which augmented the existing task
> > ranking-based policy, this one tries to be as generic as possible and
> > leverage the full power of the modern bpf.
> >
> > It provides a generic hook which is called before the existing OOM
> > killer code and allows implementing any policy, e.g.  picking a victim
> > task or memory cgroup or potentially even releasing memory in other
> > ways, e.g. deleting tmpfs files (the last one might require some
> > additional but relatively simple changes).
> >
> > The past attempt to implement memory-cgroup aware policy [2] showed
> > that there are multiple opinions on what the best policy is.  As it's
> > highly workload-dependent and specific to a concrete way of organizing
> > workloads, the structure of the cgroup tree etc, a customizable
> > bpf-based implementation is preferable over a in-kernel implementation
> > with a dozen on sysctls.
> >
> > The second part is related to the fundamental question on when to
> > declare the OOM event. It's a trade-off between the risk of
> > unnecessary OOM kills and associated work losses and the risk of
> > infinite trashing and effective soft lockups.  In the last few years
> > several PSI-based userspace solutions were developed (e.g. OOMd [3] or
> > systemd-OOMd [4]). The common idea was to use userspace daemons to
> > implement custom OOM logic as well as rely on PSI monitoring to avoid
> > stalls. In this scenario the userspace daemon was supposed to handle
> > the majority of OOMs, while the in-kernel OOM killer worked as the
> > last resort measure to guarantee that the system would never deadlock
> > on the memory. But this approach creates additional infrastructure
> > churn: userspace OOM daemon is a separate entity which needs to be
> > deployed, updated, monitored. A completely different pipeline needs to
> > be built to monitor both types of OOM events and collect associated
> > logs. A userspace daemon is more restricted in terms on what data is
> > available to it. Implementing a daemon which can work reliably under a
> > heavy memory pressure in the system is also tricky.
> 
> I didn't read the whole patchset yet but want to mention couple
> features that we should not forget:
> - memory reaping. Maybe you already call oom_reap_task_mm() after BPF
> oom-handler kills a process or maybe BPF handler is expected to
> implement it?
> - kill reporting to userspace. I think BPF handler would be expected
> to implement it?

The patchset implements the bpf_oom_kill_process() helper, which kills
the desired process the same way as the kernel oom killer: with the help
of the oom reaper, dumping corresponding stats into dmesg and bumping
corresponding memcg- and system-level stats.

For additional reporting generic bpf<->userspace interaction mechanims
can be used, e.g. ringbuffer.

Thanks!

Re: [PATCH rfc 00/12] mm: BPF OOM

Posted by Michal Hocko 9 months, 2 weeks ago

On Mon 28-04-25 03:36:05, Roman Gushchin wrote:
> This patchset adds an ability to customize the out of memory
> handling using bpf.
> 
> It focuses on two parts:
> 1) OOM handling policy,
> 2) PSI-based OOM invocation.
> 
> The idea to use bpf for customizing the OOM handling is not new, but
> unlike the previous proposal [1], which augmented the existing task
> ranking-based policy, this one tries to be as generic as possible and
> leverage the full power of the modern bpf.
> 
> It provides a generic hook which is called before the existing OOM
> killer code and allows implementing any policy, e.g.  picking a victim
> task or memory cgroup or potentially even releasing memory in other
> ways, e.g. deleting tmpfs files (the last one might require some
> additional but relatively simple changes).

Makes sense to me. I still have a slight concern though. We have 3
different oom handlers smashed into a single one with special casing
involved. This is manageable (although not great) for the in kernel
code but I am wondering whether we should do better for BPF based OOM
implementations. Would it make sense to have different callbacks for
cpuset, memcg and global oom killer handlers?

I can see you have already added some helper functions to deal with
memcgs but I do not see anything to iterate processes or find a process to
kill etc. Is that functionality generally available (sorry I am not
really familiar with BPF all that much so please bear with me)?

I like the way how you naturalely hooked into existing OOM primitives
like oom_kill_process but I do not see tsk_is_oom_victim exposed. Are
you waiting for a first user that needs to implement oom victim
synchronization or do you plan to integrate that into tasks iterators?
I am mostly asking because it is exactly these kind of details that
make the current in kernel oom handler quite complex and it would be
great if custom ones do not have to reproduce that complexity and only
focus on the high level policy.

> The second part is related to the fundamental question on when to
> declare the OOM event. It's a trade-off between the risk of
> unnecessary OOM kills and associated work losses and the risk of
> infinite trashing and effective soft lockups.  In the last few years
> several PSI-based userspace solutions were developed (e.g. OOMd [3] or
> systemd-OOMd [4]). The common idea was to use userspace daemons to
> implement custom OOM logic as well as rely on PSI monitoring to avoid
> stalls.

This makes sense to me as well. I have to admit I am not fully familiar
with PSI integration into sched code but from what I can see the
evaluation is done on regular bases from the worker context kicked off
from the scheduler code. There shouldn't be any locking constrains which
is good. Is there any risk if the oom handler took too long though?

Also an important question. I can see selftests which are using the
infrastructure. But have you tried to implement a real OOM handler with
this proposed infrastructure?

> [1]: https://lwn.net/ml/linux-kernel/20230810081319.65668-1-zhouchuyi@bytedance.com/
> [2]: https://lore.kernel.org/lkml/20171130152824.1591-1-guro@fb.com/
> [3]: https://github.com/facebookincubator/oomd
> [4]: https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
> 
> ----
> 
> This is an RFC version, which is not intended to be merged in the current form.
> Open questions/TODOs:
> 1) Program type/attachment type for the bpf_handle_out_of_memory() hook.
>    It has to be able to return a value, to be sleepable (to use cgroup iterators)
>    and to have trusted arguments to pass oom_control down to bpf_oom_kill_process().
>    Current patchset has a workaround (patch "bpf: treat fmodret tracing program's
>    arguments as trusted"), which is not safe. One option is to fake acquire/release
>    semantics for the oom_control pointer. Other option is to introduce a completely
>    new attachment or program type, similar to lsm hooks.
> 2) Currently lockdep complaints about a potential circular dependency because
>    sleepable bpf_handle_out_of_memory() hook calls might_fault() under oom_lock.
>    One way to fix it is to make it non-sleepable, but then it will require some
>    additional work to allow it using cgroup iterators. It's intervened with 1).

I cannot see this in the code. Could you be more specific please? Where
is this might_fault coming from? Is this BPF constrain?

> 3) What kind of hierarchical features are required? Do we want to nest oom policies?
>    Do we want to attach oom policies to cgroups? I think it's too complicated,
>    but if we want a full hierarchical support, it might be required.
>    Patch "mm: introduce bpf_get_root_mem_cgroup() bpf kfunc" exposes the true root
>    memcg, which is potentially outside of the ns of the loading process. Does
>    it require some additional capabilities checks? Should it be removed?

Yes, let's start simple and see where we get from there.

> 4) Documentation is lacking and will be added in the next version.

+1

Thanks!
-- 
Michal Hocko
SUSE Labs

Re: [PATCH rfc 00/12] mm: BPF OOM

Posted by Roman Gushchin 9 months, 2 weeks ago

Michal Hocko <mhocko@suse.com> writes:

> On Mon 28-04-25 03:36:05, Roman Gushchin wrote:
>> This patchset adds an ability to customize the out of memory
>> handling using bpf.
>> 
>> It focuses on two parts:
>> 1) OOM handling policy,
>> 2) PSI-based OOM invocation.
>> 
>> The idea to use bpf for customizing the OOM handling is not new, but
>> unlike the previous proposal [1], which augmented the existing task
>> ranking-based policy, this one tries to be as generic as possible and
>> leverage the full power of the modern bpf.
>> 
>> It provides a generic hook which is called before the existing OOM
>> killer code and allows implementing any policy, e.g.  picking a victim
>> task or memory cgroup or potentially even releasing memory in other
>> ways, e.g. deleting tmpfs files (the last one might require some
>> additional but relatively simple changes).
>
> Makes sense to me. I still have a slight concern though. We have 3
> different oom handlers smashed into a single one with special casing
> involved. This is manageable (although not great) for the in kernel
> code but I am wondering whether we should do better for BPF based OOM
> implementations. Would it make sense to have different callbacks for
> cpuset, memcg and global oom killer handlers?

Yes, it's certainly possible. If we go struct_ops path, we can even
have both the common hook which handles all types of OOM's and separate
hooks for each type. The user then can choose what's more convenient.
Good point.

>
> I can see you have already added some helper functions to deal with
> memcgs but I do not see anything to iterate processes or find a process to
> kill etc. Is that functionality generally available (sorry I am not
> really familiar with BPF all that much so please bear with me)?

Yes, task iterator is available since v6.7:
https://docs.ebpf.io/linux/kfuncs/bpf_iter_task_new/

>
> I like the way how you naturalely hooked into existing OOM primitives
> like oom_kill_process but I do not see tsk_is_oom_victim exposed. Are
> you waiting for a first user that needs to implement oom victim
> synchronization or do you plan to integrate that into tasks iterators?

It can be implemented in bpf directly, but I agree that it probably
deserves at least an example in the test or a separate in-kernel helper.
In-kernel helper is probably a better idea.

> I am mostly asking because it is exactly these kind of details that
> make the current in kernel oom handler quite complex and it would be
> great if custom ones do not have to reproduce that complexity and only
> focus on the high level policy.

Totally agree.

>
>> The second part is related to the fundamental question on when to
>> declare the OOM event. It's a trade-off between the risk of
>> unnecessary OOM kills and associated work losses and the risk of
>> infinite trashing and effective soft lockups.  In the last few years
>> several PSI-based userspace solutions were developed (e.g. OOMd [3] or
>> systemd-OOMd [4]). The common idea was to use userspace daemons to
>> implement custom OOM logic as well as rely on PSI monitoring to avoid
>> stalls.
>
> This makes sense to me as well. I have to admit I am not fully familiar
> with PSI integration into sched code but from what I can see the
> evaluation is done on regular bases from the worker context kicked off
> from the scheduler code. There shouldn't be any locking constrains which
> is good. Is there any risk if the oom handler took too long though?

It's a good question. In theory yes, it can affect the timing of other
PSI events. An option here is to move it into a separate work, however
I'm not sure if it worth the added complexity. I actually tried this
approach in an earlier version of this patchset, but the problem was
that the code for scheduling this work should be dynamically turned
on/off when a bpf program is attached/detached, otherwise it's an
obvious cpu overhead.
It's doable, but Idk if it's justified.

>
> Also an important question. I can see selftests which are using the
> infrastructure. But have you tried to implement a real OOM handler with
> this proposed infrastructure?

Not yet. Given the size and complexity of the infrastructure of my
current employer, it's not a short process. But we're working on it.

>
>> [1]: https://lwn.net/ml/linux-kernel/20230810081319.65668-1-zhouchuyi@bytedance.com/
>> [2]: https://lore.kernel.org/lkml/20171130152824.1591-1-guro@fb.com/
>> [3]: https://github.com/facebookincubator/oomd
>> [4]: https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
>> 
>> ----
>> 
>> This is an RFC version, which is not intended to be merged in the current form.
>> Open questions/TODOs:
>> 1) Program type/attachment type for the bpf_handle_out_of_memory() hook.
>>    It has to be able to return a value, to be sleepable (to use cgroup iterators)
>>    and to have trusted arguments to pass oom_control down to bpf_oom_kill_process().
>>    Current patchset has a workaround (patch "bpf: treat fmodret tracing program's
>>    arguments as trusted"), which is not safe. One option is to fake acquire/release
>>    semantics for the oom_control pointer. Other option is to introduce a completely
>>    new attachment or program type, similar to lsm hooks.
>> 2) Currently lockdep complaints about a potential circular dependency because
>>    sleepable bpf_handle_out_of_memory() hook calls might_fault() under oom_lock.
>>    One way to fix it is to make it non-sleepable, but then it will require some
>>    additional work to allow it using cgroup iterators. It's intervened with 1).
>
> I cannot see this in the code. Could you be more specific please? Where
> is this might_fault coming from? Is this BPF constrain?

It's in __bpf_prog_enter_sleepable(). But I hope I can make this hook
non-sleepable (by going struct_ops path) and the problem will go away.

>
>> 3) What kind of hierarchical features are required? Do we want to nest oom policies?
>>    Do we want to attach oom policies to cgroups? I think it's too complicated,
>>    but if we want a full hierarchical support, it might be required.
>>    Patch "mm: introduce bpf_get_root_mem_cgroup() bpf kfunc" exposes the true root
>>    memcg, which is potentially outside of the ns of the loading process. Does
>>    it require some additional capabilities checks? Should it be removed?
>
> Yes, let's start simple and see where we get from there.

Agree.

Thank you for taking a look and your comments/ideas!

Re: [PATCH rfc 00/12] mm: BPF OOM

Posted by Suren Baghdasaryan 9 months, 1 week ago

On Tue, Apr 29, 2025 at 7:45 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> Michal Hocko <mhocko@suse.com> writes:
>
> > On Mon 28-04-25 03:36:05, Roman Gushchin wrote:
> >> This patchset adds an ability to customize the out of memory
> >> handling using bpf.
> >>
> >> It focuses on two parts:
> >> 1) OOM handling policy,
> >> 2) PSI-based OOM invocation.
> >>
> >> The idea to use bpf for customizing the OOM handling is not new, but
> >> unlike the previous proposal [1], which augmented the existing task
> >> ranking-based policy, this one tries to be as generic as possible and
> >> leverage the full power of the modern bpf.
> >>
> >> It provides a generic hook which is called before the existing OOM
> >> killer code and allows implementing any policy, e.g.  picking a victim
> >> task or memory cgroup or potentially even releasing memory in other
> >> ways, e.g. deleting tmpfs files (the last one might require some
> >> additional but relatively simple changes).
> >
> > Makes sense to me. I still have a slight concern though. We have 3
> > different oom handlers smashed into a single one with special casing
> > involved. This is manageable (although not great) for the in kernel
> > code but I am wondering whether we should do better for BPF based OOM
> > implementations. Would it make sense to have different callbacks for
> > cpuset, memcg and global oom killer handlers?
>
> Yes, it's certainly possible. If we go struct_ops path, we can even
> have both the common hook which handles all types of OOM's and separate
> hooks for each type. The user then can choose what's more convenient.
> Good point.
>
> >
> > I can see you have already added some helper functions to deal with
> > memcgs but I do not see anything to iterate processes or find a process to
> > kill etc. Is that functionality generally available (sorry I am not
> > really familiar with BPF all that much so please bear with me)?
>
> Yes, task iterator is available since v6.7:
> https://docs.ebpf.io/linux/kfuncs/bpf_iter_task_new/
>
> >
> > I like the way how you naturalely hooked into existing OOM primitives
> > like oom_kill_process but I do not see tsk_is_oom_victim exposed. Are
> > you waiting for a first user that needs to implement oom victim
> > synchronization or do you plan to integrate that into tasks iterators?
>
> It can be implemented in bpf directly, but I agree that it probably
> deserves at least an example in the test or a separate in-kernel helper.
> In-kernel helper is probably a better idea.
>
> > I am mostly asking because it is exactly these kind of details that
> > make the current in kernel oom handler quite complex and it would be
> > great if custom ones do not have to reproduce that complexity and only
> > focus on the high level policy.
>
> Totally agree.
>
> >
> >> The second part is related to the fundamental question on when to
> >> declare the OOM event. It's a trade-off between the risk of
> >> unnecessary OOM kills and associated work losses and the risk of
> >> infinite trashing and effective soft lockups.  In the last few years
> >> several PSI-based userspace solutions were developed (e.g. OOMd [3] or
> >> systemd-OOMd [4]). The common idea was to use userspace daemons to
> >> implement custom OOM logic as well as rely on PSI monitoring to avoid
> >> stalls.
> >
> > This makes sense to me as well. I have to admit I am not fully familiar
> > with PSI integration into sched code but from what I can see the
> > evaluation is done on regular bases from the worker context kicked off
> > from the scheduler code. There shouldn't be any locking constrains which
> > is good. Is there any risk if the oom handler took too long though?
>
> It's a good question. In theory yes, it can affect the timing of other
> PSI events. An option here is to move it into a separate work, however
> I'm not sure if it worth the added complexity. I actually tried this
> approach in an earlier version of this patchset, but the problem was
> that the code for scheduling this work should be dynamically turned
> on/off when a bpf program is attached/detached, otherwise it's an
> obvious cpu overhead.
> It's doable, but Idk if it's justified.

I think this is a legitimate concern. bpf_handle_psi_event() can block
update_triggers() and delay other PSI triggers.

>
> >
> > Also an important question. I can see selftests which are using the
> > infrastructure. But have you tried to implement a real OOM handler with
> > this proposed infrastructure?
>
> Not yet. Given the size and complexity of the infrastructure of my
> current employer, it's not a short process. But we're working on it.
>
> >
> >> [1]: https://lwn.net/ml/linux-kernel/20230810081319.65668-1-zhouchuyi@bytedance.com/
> >> [2]: https://lore.kernel.org/lkml/20171130152824.1591-1-guro@fb.com/
> >> [3]: https://github.com/facebookincubator/oomd
> >> [4]: https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
> >>
> >> ----
> >>
> >> This is an RFC version, which is not intended to be merged in the current form.
> >> Open questions/TODOs:
> >> 1) Program type/attachment type for the bpf_handle_out_of_memory() hook.
> >>    It has to be able to return a value, to be sleepable (to use cgroup iterators)
> >>    and to have trusted arguments to pass oom_control down to bpf_oom_kill_process().
> >>    Current patchset has a workaround (patch "bpf: treat fmodret tracing program's
> >>    arguments as trusted"), which is not safe. One option is to fake acquire/release
> >>    semantics for the oom_control pointer. Other option is to introduce a completely
> >>    new attachment or program type, similar to lsm hooks.
> >> 2) Currently lockdep complaints about a potential circular dependency because
> >>    sleepable bpf_handle_out_of_memory() hook calls might_fault() under oom_lock.
> >>    One way to fix it is to make it non-sleepable, but then it will require some
> >>    additional work to allow it using cgroup iterators. It's intervened with 1).
> >
> > I cannot see this in the code. Could you be more specific please? Where
> > is this might_fault coming from? Is this BPF constrain?
>
> It's in __bpf_prog_enter_sleepable(). But I hope I can make this hook
> non-sleepable (by going struct_ops path) and the problem will go away.
>
> >
> >> 3) What kind of hierarchical features are required? Do we want to nest oom policies?
> >>    Do we want to attach oom policies to cgroups? I think it's too complicated,
> >>    but if we want a full hierarchical support, it might be required.
> >>    Patch "mm: introduce bpf_get_root_mem_cgroup() bpf kfunc" exposes the true root
> >>    memcg, which is potentially outside of the ns of the loading process. Does
> >>    it require some additional capabilities checks? Should it be removed?
> >
> > Yes, let's start simple and see where we get from there.
>
> Agree.
>
> Thank you for taking a look and your comments/ideas!
>

Re: [PATCH rfc 00/12] mm: BPF OOM

Posted by Suren Baghdasaryan 9 months, 1 week ago

On Tue, Apr 29, 2025 at 7:45 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> Michal Hocko <mhocko@suse.com> writes:
>
> > On Mon 28-04-25 03:36:05, Roman Gushchin wrote:
> >> This patchset adds an ability to customize the out of memory
> >> handling using bpf.
> >>
> >> It focuses on two parts:
> >> 1) OOM handling policy,
> >> 2) PSI-based OOM invocation.
> >>
> >> The idea to use bpf for customizing the OOM handling is not new, but
> >> unlike the previous proposal [1], which augmented the existing task
> >> ranking-based policy, this one tries to be as generic as possible and
> >> leverage the full power of the modern bpf.
> >>
> >> It provides a generic hook which is called before the existing OOM
> >> killer code and allows implementing any policy, e.g.  picking a victim
> >> task or memory cgroup or potentially even releasing memory in other
> >> ways, e.g. deleting tmpfs files (the last one might require some
> >> additional but relatively simple changes).
> >
> > Makes sense to me. I still have a slight concern though. We have 3
> > different oom handlers smashed into a single one with special casing
> > involved. This is manageable (although not great) for the in kernel
> > code but I am wondering whether we should do better for BPF based OOM
> > implementations. Would it make sense to have different callbacks for
> > cpuset, memcg and global oom killer handlers?
>
> Yes, it's certainly possible. If we go struct_ops path, we can even
> have both the common hook which handles all types of OOM's and separate
> hooks for each type. The user then can choose what's more convenient.
> Good point.
>
> >
> > I can see you have already added some helper functions to deal with
> > memcgs but I do not see anything to iterate processes or find a process to
> > kill etc. Is that functionality generally available (sorry I am not
> > really familiar with BPF all that much so please bear with me)?
>
> Yes, task iterator is available since v6.7:
> https://docs.ebpf.io/linux/kfuncs/bpf_iter_task_new/
>
> >
> > I like the way how you naturalely hooked into existing OOM primitives
> > like oom_kill_process but I do not see tsk_is_oom_victim exposed. Are
> > you waiting for a first user that needs to implement oom victim
> > synchronization or do you plan to integrate that into tasks iterators?
>
> It can be implemented in bpf directly, but I agree that it probably
> deserves at least an example in the test or a separate in-kernel helper.
> In-kernel helper is probably a better idea.
>
> > I am mostly asking because it is exactly these kind of details that
> > make the current in kernel oom handler quite complex and it would be
> > great if custom ones do not have to reproduce that complexity and only
> > focus on the high level policy.
>
> Totally agree.
>
> >
> >> The second part is related to the fundamental question on when to
> >> declare the OOM event. It's a trade-off between the risk of
> >> unnecessary OOM kills and associated work losses and the risk of
> >> infinite trashing and effective soft lockups.  In the last few years
> >> several PSI-based userspace solutions were developed (e.g. OOMd [3] or
> >> systemd-OOMd [4]). The common idea was to use userspace daemons to
> >> implement custom OOM logic as well as rely on PSI monitoring to avoid
> >> stalls.
> >
> > This makes sense to me as well. I have to admit I am not fully familiar
> > with PSI integration into sched code but from what I can see the
> > evaluation is done on regular bases from the worker context kicked off
> > from the scheduler code. There shouldn't be any locking constrains which
> > is good. Is there any risk if the oom handler took too long though?
>
> It's a good question. In theory yes, it can affect the timing of other
> PSI events. An option here is to move it into a separate work, however
> I'm not sure if it worth the added complexity. I actually tried this
> approach in an earlier version of this patchset, but the problem was
> that the code for scheduling this work should be dynamically turned
> on/off when a bpf program is attached/detached, otherwise it's an
> obvious cpu overhead.
> It's doable, but Idk if it's justified.
>
> >
> > Also an important question. I can see selftests which are using the
> > infrastructure. But have you tried to implement a real OOM handler with
> > this proposed infrastructure?
>
> Not yet. Given the size and complexity of the infrastructure of my
> current employer, it's not a short process. But we're working on it.

Hi Roman,
This might end up being very useful for Android. Since we have a
shared current employer, we might be able to provide an earlier test
environment for this concept on Android and speed up development of a
real OOM handler. I'll be following the development of this patchset
and will see if we can come up with an early prototype for testing.

>
> >
> >> [1]: https://lwn.net/ml/linux-kernel/20230810081319.65668-1-zhouchuyi@bytedance.com/
> >> [2]: https://lore.kernel.org/lkml/20171130152824.1591-1-guro@fb.com/
> >> [3]: https://github.com/facebookincubator/oomd
> >> [4]: https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
> >>
> >> ----
> >>
> >> This is an RFC version, which is not intended to be merged in the current form.
> >> Open questions/TODOs:
> >> 1) Program type/attachment type for the bpf_handle_out_of_memory() hook.
> >>    It has to be able to return a value, to be sleepable (to use cgroup iterators)
> >>    and to have trusted arguments to pass oom_control down to bpf_oom_kill_process().
> >>    Current patchset has a workaround (patch "bpf: treat fmodret tracing program's
> >>    arguments as trusted"), which is not safe. One option is to fake acquire/release
> >>    semantics for the oom_control pointer. Other option is to introduce a completely
> >>    new attachment or program type, similar to lsm hooks.
> >> 2) Currently lockdep complaints about a potential circular dependency because
> >>    sleepable bpf_handle_out_of_memory() hook calls might_fault() under oom_lock.
> >>    One way to fix it is to make it non-sleepable, but then it will require some
> >>    additional work to allow it using cgroup iterators. It's intervened with 1).
> >
> > I cannot see this in the code. Could you be more specific please? Where
> > is this might_fault coming from? Is this BPF constrain?
>
> It's in __bpf_prog_enter_sleepable(). But I hope I can make this hook
> non-sleepable (by going struct_ops path) and the problem will go away.
>
> >
> >> 3) What kind of hierarchical features are required? Do we want to nest oom policies?
> >>    Do we want to attach oom policies to cgroups? I think it's too complicated,
> >>    but if we want a full hierarchical support, it might be required.
> >>    Patch "mm: introduce bpf_get_root_mem_cgroup() bpf kfunc" exposes the true root
> >>    memcg, which is potentially outside of the ns of the loading process. Does
> >>    it require some additional capabilities checks? Should it be removed?
> >
> > Yes, let's start simple and see where we get from there.
>
> Agree.
>
> Thank you for taking a look and your comments/ideas!
>

Re: [PATCH rfc 00/12] mm: BPF OOM

Posted by Roman Gushchin 9 months, 1 week ago

On Tue, Apr 29, 2025 at 02:56:31PM -0700, Suren Baghdasaryan wrote:
> On Tue, Apr 29, 2025 at 7:45 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> >
> > Michal Hocko <mhocko@suse.com> writes:
> >
> > > On Mon 28-04-25 03:36:05, Roman Gushchin wrote:
> > >> This patchset adds an ability to customize the out of memory
> > >> handling using bpf.
> > >>
> > >> It focuses on two parts:
> > >> 1) OOM handling policy,
> > >> 2) PSI-based OOM invocation.
> > >>
> > >> The idea to use bpf for customizing the OOM handling is not new, but
> > >> unlike the previous proposal [1], which augmented the existing task
> > >> ranking-based policy, this one tries to be as generic as possible and
> > >> leverage the full power of the modern bpf.
> > >>
> > >> It provides a generic hook which is called before the existing OOM
> > >> killer code and allows implementing any policy, e.g.  picking a victim
> > >> task or memory cgroup or potentially even releasing memory in other
> > >> ways, e.g. deleting tmpfs files (the last one might require some
> > >> additional but relatively simple changes).
> > >
> > > Makes sense to me. I still have a slight concern though. We have 3
> > > different oom handlers smashed into a single one with special casing
> > > involved. This is manageable (although not great) for the in kernel
> > > code but I am wondering whether we should do better for BPF based OOM
> > > implementations. Would it make sense to have different callbacks for
> > > cpuset, memcg and global oom killer handlers?
> >
> > Yes, it's certainly possible. If we go struct_ops path, we can even
> > have both the common hook which handles all types of OOM's and separate
> > hooks for each type. The user then can choose what's more convenient.
> > Good point.
> >
> > >
> > > I can see you have already added some helper functions to deal with
> > > memcgs but I do not see anything to iterate processes or find a process to
> > > kill etc. Is that functionality generally available (sorry I am not
> > > really familiar with BPF all that much so please bear with me)?
> >
> > Yes, task iterator is available since v6.7:
> > https://docs.ebpf.io/linux/kfuncs/bpf_iter_task_new/
> >
> > >
> > > I like the way how you naturalely hooked into existing OOM primitives
> > > like oom_kill_process but I do not see tsk_is_oom_victim exposed. Are
> > > you waiting for a first user that needs to implement oom victim
> > > synchronization or do you plan to integrate that into tasks iterators?
> >
> > It can be implemented in bpf directly, but I agree that it probably
> > deserves at least an example in the test or a separate in-kernel helper.
> > In-kernel helper is probably a better idea.
> >
> > > I am mostly asking because it is exactly these kind of details that
> > > make the current in kernel oom handler quite complex and it would be
> > > great if custom ones do not have to reproduce that complexity and only
> > > focus on the high level policy.
> >
> > Totally agree.
> >
> > >
> > >> The second part is related to the fundamental question on when to
> > >> declare the OOM event. It's a trade-off between the risk of
> > >> unnecessary OOM kills and associated work losses and the risk of
> > >> infinite trashing and effective soft lockups.  In the last few years
> > >> several PSI-based userspace solutions were developed (e.g. OOMd [3] or
> > >> systemd-OOMd [4]). The common idea was to use userspace daemons to
> > >> implement custom OOM logic as well as rely on PSI monitoring to avoid
> > >> stalls.
> > >
> > > This makes sense to me as well. I have to admit I am not fully familiar
> > > with PSI integration into sched code but from what I can see the
> > > evaluation is done on regular bases from the worker context kicked off
> > > from the scheduler code. There shouldn't be any locking constrains which
> > > is good. Is there any risk if the oom handler took too long though?
> >
> > It's a good question. In theory yes, it can affect the timing of other
> > PSI events. An option here is to move it into a separate work, however
> > I'm not sure if it worth the added complexity. I actually tried this
> > approach in an earlier version of this patchset, but the problem was
> > that the code for scheduling this work should be dynamically turned
> > on/off when a bpf program is attached/detached, otherwise it's an
> > obvious cpu overhead.
> > It's doable, but Idk if it's justified.
> >
> > >
> > > Also an important question. I can see selftests which are using the
> > > infrastructure. But have you tried to implement a real OOM handler with
> > > this proposed infrastructure?
> >
> > Not yet. Given the size and complexity of the infrastructure of my
> > current employer, it's not a short process. But we're working on it.
> 
> Hi Roman,
> This might end up being very useful for Android. Since we have a
> shared current employer, we might be able to provide an earlier test
> environment for this concept on Android and speed up development of a
> real OOM handler. I'll be following the development of this patchset
> and will see if we can come up with an early prototype for testing.

Hi Suren,

Sounds great, thank you!

Re: [PATCH rfc 00/12] mm: BPF OOM

Posted by Matt Bobrowski 9 months, 2 weeks ago

On Mon, Apr 28, 2025 at 03:36:05AM +0000, Roman Gushchin wrote:
> This patchset adds an ability to customize the out of memory
> handling using bpf.
> 
> It focuses on two parts:
> 1) OOM handling policy,
> 2) PSI-based OOM invocation.
> 
> The idea to use bpf for customizing the OOM handling is not new, but
> unlike the previous proposal [1], which augmented the existing task
> ranking-based policy, this one tries to be as generic as possible and
> leverage the full power of the modern bpf.
> 
> It provides a generic hook which is called before the existing OOM
> killer code and allows implementing any policy, e.g.  picking a victim
> task or memory cgroup or potentially even releasing memory in other
> ways, e.g. deleting tmpfs files (the last one might require some
> additional but relatively simple changes).
> 
> The past attempt to implement memory-cgroup aware policy [2] showed
> that there are multiple opinions on what the best policy is.  As it's
> highly workload-dependent and specific to a concrete way of organizing
> workloads, the structure of the cgroup tree etc, a customizable
> bpf-based implementation is preferable over a in-kernel implementation
> with a dozen on sysctls.
> 
> The second part is related to the fundamental question on when to
> declare the OOM event. It's a trade-off between the risk of
> unnecessary OOM kills and associated work losses and the risk of
> infinite trashing and effective soft lockups.  In the last few years
> several PSI-based userspace solutions were developed (e.g. OOMd [3] or
> systemd-OOMd [4]). The common idea was to use userspace daemons to
> implement custom OOM logic as well as rely on PSI monitoring to avoid
> stalls. In this scenario the userspace daemon was supposed to handle
> the majority of OOMs, while the in-kernel OOM killer worked as the
> last resort measure to guarantee that the system would never deadlock
> on the memory. But this approach creates additional infrastructure
> churn: userspace OOM daemon is a separate entity which needs to be
> deployed, updated, monitored. A completely different pipeline needs to
> be built to monitor both types of OOM events and collect associated
> logs. A userspace daemon is more restricted in terms on what data is
> available to it. Implementing a daemon which can work reliably under a
> heavy memory pressure in the system is also tricky.
> 
> [1]: https://lwn.net/ml/linux-kernel/20230810081319.65668-1-zhouchuyi@bytedance.com/
> [2]: https://lore.kernel.org/lkml/20171130152824.1591-1-guro@fb.com/
> [3]: https://github.com/facebookincubator/oomd
> [4]: https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
> 
> ----
> 
> This is an RFC version, which is not intended to be merged in the current form.
> Open questions/TODOs:
> 1) Program type/attachment type for the bpf_handle_out_of_memory() hook.
>    It has to be able to return a value, to be sleepable (to use cgroup iterators)
>    and to have trusted arguments to pass oom_control down to bpf_oom_kill_process().
>    Current patchset has a workaround (patch "bpf: treat fmodret tracing program's
>    arguments as trusted"), which is not safe. One option is to fake acquire/release
>    semantics for the oom_control pointer. Other option is to introduce a completely
>    new attachment or program type, similar to lsm hooks.

Thinking out loud now, but rather than introducing and having a single
BPF-specific function/interface, and BPF program for that matter,
which can effectively be used to short-circuit steps from within
out_of_memory(), why not introduce a
tcp_congestion_ops/sched_ext_ops-like interface which essentially
provides a multifaceted interface for controlling OOM killing
(->select_bad_process, ->oom_kill_process, etc), optionally also from
the context of a BPF program (BPF_PROG_TYPE_STRUCT_OPS)?

I don't know whether that's what you meant by introducing a new
attachment, or program type, but an approach like this is what
immediately comes to mind when wanting to provide more than a single
implementation for a set of operations within the Linux kernel,
particularly also from the context of a BPF program.

> 2) Currently lockdep complaints about a potential circular dependency because
>    sleepable bpf_handle_out_of_memory() hook calls might_fault() under oom_lock.
>    One way to fix it is to make it non-sleepable, but then it will require some
>    additional work to allow it using cgroup iterators. It's intervened with 1).
> 3) What kind of hierarchical features are required? Do we want to nest oom policies?
>    Do we want to attach oom policies to cgroups? I think it's too complicated,
>    but if we want a full hierarchical support, it might be required.
>    Patch "mm: introduce bpf_get_root_mem_cgroup() bpf kfunc" exposes the true root
>    memcg, which is potentially outside of the ns of the loading process. Does
>    it require some additional capabilities checks? Should it be removed?
> 4) Documentation is lacking and will be added in the next version.
> 
> 
> Roman Gushchin (12):
>   mm: introduce a bpf hook for OOM handling
>   bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL
>   bpf: treat fmodret tracing program's arguments as trusted
>   mm: introduce bpf_oom_kill_process() bpf kfunc
>   mm: introduce bpf kfuncs to deal with memcg pointers
>   mm: introduce bpf_get_root_mem_cgroup() bpf kfunc
>   bpf: selftests: introduce read_cgroup_file() helper
>   bpf: selftests: bpf OOM handler test
>   sched: psi: bpf hook to handle psi events
>   mm: introduce bpf_out_of_memory() bpf kfunc
>   bpf: selftests: introduce open_cgroup_file() helper
>   bpf: selftests: psi handler test
> 
>  include/linux/memcontrol.h                   |   2 +
>  include/linux/oom.h                          |   5 +
>  kernel/bpf/btf.c                             |   9 +-
>  kernel/bpf/verifier.c                        |   5 +
>  kernel/sched/psi.c                           |  36 ++-
>  mm/Makefile                                  |   3 +
>  mm/bpf_memcontrol.c                          | 108 +++++++++
>  mm/oom_kill.c                                | 140 +++++++++++
>  tools/testing/selftests/bpf/cgroup_helpers.c |  67 ++++++
>  tools/testing/selftests/bpf/cgroup_helpers.h |   3 +
>  tools/testing/selftests/bpf/prog_tests/oom.c | 227 ++++++++++++++++++
>  tools/testing/selftests/bpf/prog_tests/psi.c | 234 +++++++++++++++++++
>  tools/testing/selftests/bpf/progs/test_oom.c | 103 ++++++++
>  tools/testing/selftests/bpf/progs/test_psi.c |  43 ++++
>  14 files changed, 983 insertions(+), 2 deletions(-)
>  create mode 100644 mm/bpf_memcontrol.c
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/oom.c
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/psi.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_oom.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_psi.c
> 
> -- 
> 2.49.0.901.g37484f566f-goog
>

Re: [PATCH rfc 00/12] mm: BPF OOM

Posted by Roman Gushchin 9 months, 2 weeks ago

On Mon, Apr 28, 2025 at 10:43:07AM +0000, Matt Bobrowski wrote:
> On Mon, Apr 28, 2025 at 03:36:05AM +0000, Roman Gushchin wrote:
> > This patchset adds an ability to customize the out of memory
> > handling using bpf.
> > 
> > It focuses on two parts:
> > 1) OOM handling policy,
> > 2) PSI-based OOM invocation.
> > 
> > The idea to use bpf for customizing the OOM handling is not new, but
> > unlike the previous proposal [1], which augmented the existing task
> > ranking-based policy, this one tries to be as generic as possible and
> > leverage the full power of the modern bpf.
> > 
> > It provides a generic hook which is called before the existing OOM
> > killer code and allows implementing any policy, e.g.  picking a victim
> > task or memory cgroup or potentially even releasing memory in other
> > ways, e.g. deleting tmpfs files (the last one might require some
> > additional but relatively simple changes).
> > 
> > The past attempt to implement memory-cgroup aware policy [2] showed
> > that there are multiple opinions on what the best policy is.  As it's
> > highly workload-dependent and specific to a concrete way of organizing
> > workloads, the structure of the cgroup tree etc, a customizable
> > bpf-based implementation is preferable over a in-kernel implementation
> > with a dozen on sysctls.
> > 
> > The second part is related to the fundamental question on when to
> > declare the OOM event. It's a trade-off between the risk of
> > unnecessary OOM kills and associated work losses and the risk of
> > infinite trashing and effective soft lockups.  In the last few years
> > several PSI-based userspace solutions were developed (e.g. OOMd [3] or
> > systemd-OOMd [4]). The common idea was to use userspace daemons to
> > implement custom OOM logic as well as rely on PSI monitoring to avoid
> > stalls. In this scenario the userspace daemon was supposed to handle
> > the majority of OOMs, while the in-kernel OOM killer worked as the
> > last resort measure to guarantee that the system would never deadlock
> > on the memory. But this approach creates additional infrastructure
> > churn: userspace OOM daemon is a separate entity which needs to be
> > deployed, updated, monitored. A completely different pipeline needs to
> > be built to monitor both types of OOM events and collect associated
> > logs. A userspace daemon is more restricted in terms on what data is
> > available to it. Implementing a daemon which can work reliably under a
> > heavy memory pressure in the system is also tricky.
> > 
> > [1]: https://lwn.net/ml/linux-kernel/20230810081319.65668-1-zhouchuyi@bytedance.com/
> > [2]: https://lore.kernel.org/lkml/20171130152824.1591-1-guro@fb.com/
> > [3]: https://github.com/facebookincubator/oomd
> > [4]: https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
> > 
> > ----
> > 
> > This is an RFC version, which is not intended to be merged in the current form.
> > Open questions/TODOs:
> > 1) Program type/attachment type for the bpf_handle_out_of_memory() hook.
> >    It has to be able to return a value, to be sleepable (to use cgroup iterators)
> >    and to have trusted arguments to pass oom_control down to bpf_oom_kill_process().
> >    Current patchset has a workaround (patch "bpf: treat fmodret tracing program's
> >    arguments as trusted"), which is not safe. One option is to fake acquire/release
> >    semantics for the oom_control pointer. Other option is to introduce a completely
> >    new attachment or program type, similar to lsm hooks.
> 
> Thinking out loud now, but rather than introducing and having a single
> BPF-specific function/interface, and BPF program for that matter,
> which can effectively be used to short-circuit steps from within
> out_of_memory(), why not introduce a
> tcp_congestion_ops/sched_ext_ops-like interface which essentially
> provides a multifaceted interface for controlling OOM killing
> (->select_bad_process, ->oom_kill_process, etc), optionally also from
> the context of a BPF program (BPF_PROG_TYPE_STRUCT_OPS)?

It's certainly an option and I thought about it. I don't think we need a bunch
of hooks though. This patchset adds 2 and they belong to completely different
subsystems (mm and sched/psi), so Idk how well they can be gathered
into a single struct ops. But maybe it's fine.

The only potentially new hook I can envision now is one to customize
the oom reporting.

Thanks for the suggestion!

Re: [PATCH rfc 00/12] mm: BPF OOM

Posted by Kumar Kartikeya Dwivedi 9 months, 2 weeks ago

On Mon, 28 Apr 2025 at 19:24, Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> On Mon, Apr 28, 2025 at 10:43:07AM +0000, Matt Bobrowski wrote:
> > On Mon, Apr 28, 2025 at 03:36:05AM +0000, Roman Gushchin wrote:
> > > This patchset adds an ability to customize the out of memory
> > > handling using bpf.
> > >
> > > It focuses on two parts:
> > > 1) OOM handling policy,
> > > 2) PSI-based OOM invocation.
> > >
> > > The idea to use bpf for customizing the OOM handling is not new, but
> > > unlike the previous proposal [1], which augmented the existing task
> > > ranking-based policy, this one tries to be as generic as possible and
> > > leverage the full power of the modern bpf.
> > >
> > > It provides a generic hook which is called before the existing OOM
> > > killer code and allows implementing any policy, e.g.  picking a victim
> > > task or memory cgroup or potentially even releasing memory in other
> > > ways, e.g. deleting tmpfs files (the last one might require some
> > > additional but relatively simple changes).
> > >
> > > The past attempt to implement memory-cgroup aware policy [2] showed
> > > that there are multiple opinions on what the best policy is.  As it's
> > > highly workload-dependent and specific to a concrete way of organizing
> > > workloads, the structure of the cgroup tree etc, a customizable
> > > bpf-based implementation is preferable over a in-kernel implementation
> > > with a dozen on sysctls.
> > >
> > > The second part is related to the fundamental question on when to
> > > declare the OOM event. It's a trade-off between the risk of
> > > unnecessary OOM kills and associated work losses and the risk of
> > > infinite trashing and effective soft lockups.  In the last few years
> > > several PSI-based userspace solutions were developed (e.g. OOMd [3] or
> > > systemd-OOMd [4]). The common idea was to use userspace daemons to
> > > implement custom OOM logic as well as rely on PSI monitoring to avoid
> > > stalls. In this scenario the userspace daemon was supposed to handle
> > > the majority of OOMs, while the in-kernel OOM killer worked as the
> > > last resort measure to guarantee that the system would never deadlock
> > > on the memory. But this approach creates additional infrastructure
> > > churn: userspace OOM daemon is a separate entity which needs to be
> > > deployed, updated, monitored. A completely different pipeline needs to
> > > be built to monitor both types of OOM events and collect associated
> > > logs. A userspace daemon is more restricted in terms on what data is
> > > available to it. Implementing a daemon which can work reliably under a
> > > heavy memory pressure in the system is also tricky.
> > >
> > > [1]: https://lwn.net/ml/linux-kernel/20230810081319.65668-1-zhouchuyi@bytedance.com/
> > > [2]: https://lore.kernel.org/lkml/20171130152824.1591-1-guro@fb.com/
> > > [3]: https://github.com/facebookincubator/oomd
> > > [4]: https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
> > >
> > > ----
> > >
> > > This is an RFC version, which is not intended to be merged in the current form.
> > > Open questions/TODOs:
> > > 1) Program type/attachment type for the bpf_handle_out_of_memory() hook.
> > >    It has to be able to return a value, to be sleepable (to use cgroup iterators)
> > >    and to have trusted arguments to pass oom_control down to bpf_oom_kill_process().
> > >    Current patchset has a workaround (patch "bpf: treat fmodret tracing program's
> > >    arguments as trusted"), which is not safe. One option is to fake acquire/release
> > >    semantics for the oom_control pointer. Other option is to introduce a completely
> > >    new attachment or program type, similar to lsm hooks.
> >
> > Thinking out loud now, but rather than introducing and having a single
> > BPF-specific function/interface, and BPF program for that matter,
> > which can effectively be used to short-circuit steps from within
> > out_of_memory(), why not introduce a
> > tcp_congestion_ops/sched_ext_ops-like interface which essentially
> > provides a multifaceted interface for controlling OOM killing
> > (->select_bad_process, ->oom_kill_process, etc), optionally also from
> > the context of a BPF program (BPF_PROG_TYPE_STRUCT_OPS)?
>
> It's certainly an option and I thought about it. I don't think we need a bunch
> of hooks though. This patchset adds 2 and they belong to completely different
> subsystems (mm and sched/psi), so Idk how well they can be gathered
> into a single struct ops. But maybe it's fine.
>
> The only potentially new hook I can envision now is one to customize
> the oom reporting.
>

If you're considering scoping it down to a particular cgroup (as you
allude to in the TODO), or building a hierarchical interface, using
struct_ops will be much better than fmod_ret etc., which is global in
nature. Even if you don't support it now. I don't think a struct_ops
is warranted only when you have more than a few callbacks. As an
illustration, sched_ext started out without supporting hierarchical
attachment, but will piggy-back on the struct_ops interface to do so
in the near future.

> Thanks for the suggestion!
>
>

Re: [PATCH rfc 00/12] mm: BPF OOM

Posted by Song Liu 9 months, 1 week ago

On Mon, Apr 28, 2025 at 6:57 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
[...]
> >
> > It's certainly an option and I thought about it. I don't think we need a bunch
> > of hooks though. This patchset adds 2 and they belong to completely different
> > subsystems (mm and sched/psi), so Idk how well they can be gathered
> > into a single struct ops. But maybe it's fine.
> >
> > The only potentially new hook I can envision now is one to customize
> > the oom reporting.
> >
>
> If you're considering scoping it down to a particular cgroup (as you
> allude to in the TODO), or building a hierarchical interface, using
> struct_ops will be much better than fmod_ret etc., which is global in
> nature. Even if you don't support it now. I don't think a struct_ops
> is warranted only when you have more than a few callbacks. As an
> illustration, sched_ext started out without supporting hierarchical
> attachment, but will piggy-back on the struct_ops interface to do so
> in the near future.

+1 for using struct_ops, which is the best way to enable BPF in
existing use cases.

Song

Re: [PATCH rfc 00/12] mm: BPF OOM

Posted by Roman Gushchin 9 months, 1 week ago

On Tue, Apr 29, 2025 at 03:56:54AM +0200, Kumar Kartikeya Dwivedi wrote:
> On Mon, 28 Apr 2025 at 19:24, Roman Gushchin <roman.gushchin@linux.dev> wrote:
> >
> > On Mon, Apr 28, 2025 at 10:43:07AM +0000, Matt Bobrowski wrote:
> > > On Mon, Apr 28, 2025 at 03:36:05AM +0000, Roman Gushchin wrote:
> > > > This patchset adds an ability to customize the out of memory
> > > > handling using bpf.
> > > >
> > > > It focuses on two parts:
> > > > 1) OOM handling policy,
> > > > 2) PSI-based OOM invocation.
> > > >
> > > > The idea to use bpf for customizing the OOM handling is not new, but
> > > > unlike the previous proposal [1], which augmented the existing task
> > > > ranking-based policy, this one tries to be as generic as possible and
> > > > leverage the full power of the modern bpf.
> > > >
> > > > It provides a generic hook which is called before the existing OOM
> > > > killer code and allows implementing any policy, e.g.  picking a victim
> > > > task or memory cgroup or potentially even releasing memory in other
> > > > ways, e.g. deleting tmpfs files (the last one might require some
> > > > additional but relatively simple changes).
> > > >
> > > > The past attempt to implement memory-cgroup aware policy [2] showed
> > > > that there are multiple opinions on what the best policy is.  As it's
> > > > highly workload-dependent and specific to a concrete way of organizing
> > > > workloads, the structure of the cgroup tree etc, a customizable
> > > > bpf-based implementation is preferable over a in-kernel implementation
> > > > with a dozen on sysctls.
> > > >
> > > > The second part is related to the fundamental question on when to
> > > > declare the OOM event. It's a trade-off between the risk of
> > > > unnecessary OOM kills and associated work losses and the risk of
> > > > infinite trashing and effective soft lockups.  In the last few years
> > > > several PSI-based userspace solutions were developed (e.g. OOMd [3] or
> > > > systemd-OOMd [4]). The common idea was to use userspace daemons to
> > > > implement custom OOM logic as well as rely on PSI monitoring to avoid
> > > > stalls. In this scenario the userspace daemon was supposed to handle
> > > > the majority of OOMs, while the in-kernel OOM killer worked as the
> > > > last resort measure to guarantee that the system would never deadlock
> > > > on the memory. But this approach creates additional infrastructure
> > > > churn: userspace OOM daemon is a separate entity which needs to be
> > > > deployed, updated, monitored. A completely different pipeline needs to
> > > > be built to monitor both types of OOM events and collect associated
> > > > logs. A userspace daemon is more restricted in terms on what data is
> > > > available to it. Implementing a daemon which can work reliably under a
> > > > heavy memory pressure in the system is also tricky.
> > > >
> > > > [1]: https://lwn.net/ml/linux-kernel/20230810081319.65668-1-zhouchuyi@bytedance.com/
> > > > [2]: https://lore.kernel.org/lkml/20171130152824.1591-1-guro@fb.com/
> > > > [3]: https://github.com/facebookincubator/oomd
> > > > [4]: https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
> > > >
> > > > ----
> > > >
> > > > This is an RFC version, which is not intended to be merged in the current form.
> > > > Open questions/TODOs:
> > > > 1) Program type/attachment type for the bpf_handle_out_of_memory() hook.
> > > >    It has to be able to return a value, to be sleepable (to use cgroup iterators)
> > > >    and to have trusted arguments to pass oom_control down to bpf_oom_kill_process().
> > > >    Current patchset has a workaround (patch "bpf: treat fmodret tracing program's
> > > >    arguments as trusted"), which is not safe. One option is to fake acquire/release
> > > >    semantics for the oom_control pointer. Other option is to introduce a completely
> > > >    new attachment or program type, similar to lsm hooks.
> > >
> > > Thinking out loud now, but rather than introducing and having a single
> > > BPF-specific function/interface, and BPF program for that matter,
> > > which can effectively be used to short-circuit steps from within
> > > out_of_memory(), why not introduce a
> > > tcp_congestion_ops/sched_ext_ops-like interface which essentially
> > > provides a multifaceted interface for controlling OOM killing
> > > (->select_bad_process, ->oom_kill_process, etc), optionally also from
> > > the context of a BPF program (BPF_PROG_TYPE_STRUCT_OPS)?
> >
> > It's certainly an option and I thought about it. I don't think we need a bunch
> > of hooks though. This patchset adds 2 and they belong to completely different
> > subsystems (mm and sched/psi), so Idk how well they can be gathered
> > into a single struct ops. But maybe it's fine.
> >
> > The only potentially new hook I can envision now is one to customize
> > the oom reporting.
> >
> 
> If you're considering scoping it down to a particular cgroup (as you
> allude to in the TODO), or building a hierarchical interface, using
> struct_ops will be much better than fmod_ret etc., which is global in
> nature. Even if you don't support it now. I don't think a struct_ops
> is warranted only when you have more than a few callbacks. As an
> illustration, sched_ext started out without supporting hierarchical
> attachment, but will piggy-back on the struct_ops interface to do so
> in the near future.

Good point! I agree.

Thanks