[PATCH rfc 10/12] mm: introduce bpf_out_of_memory() bpf kfunc

Roman Gushchin posted 12 patches 9 months, 2 weeks ago
There is a newer version of this series
[PATCH rfc 10/12] mm: introduce bpf_out_of_memory() bpf kfunc
Posted by Roman Gushchin 9 months, 2 weeks ago
Introduce bpf_out_of_memory() bpf kfunc, which allows to declare
an out of memory events and trigger the corresponding kernel OOM
handling mechanism.

It takes a trusted memcg pointer (or NULL for system-wide OOMs)
as an argument, as well as the page order.

Only one OOM can be declared and handled in the system at once,
so if the function is called in parallel to another OOM handling,
it bails out with -EBUSY.

The function is declared as sleepable. It guarantees that it won't
be called from an atomic context. It's required by the OOM handling
code, which is not guaranteed to work in a non-blocking context.

Handling of a memcg OOM almost always requires taking of the
css_set_lock spinlock. The fact that bpf_out_of_memory() is sleepable
also guarantees that it can't be called with acquired css_set_lock,
so the kernel can't deadlock on it.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 mm/oom_kill.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 2e922e75a9df..246510572e34 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1333,6 +1333,27 @@ __bpf_kfunc int bpf_oom_kill_process(struct oom_control *oc,
 	return 0;
 }
 
+__bpf_kfunc int bpf_out_of_memory(struct mem_cgroup *memcg__nullable, int order)
+{
+	struct oom_control oc = {
+		.memcg = memcg__nullable,
+		.order = order,
+	};
+	int ret = -EINVAL;
+
+	if (oc.order < 0 || oc.order > MAX_PAGE_ORDER)
+		goto out;
+
+	ret = -EBUSY;
+	if (mutex_trylock(&oom_lock)) {
+		ret = out_of_memory(&oc);
+		mutex_unlock(&oom_lock);
+	}
+
+out:
+	return ret;
+}
+
 __bpf_kfunc_end_defs();
 
 __bpf_hook_start();
@@ -1358,6 +1379,7 @@ static const struct btf_kfunc_id_set bpf_oom_hook_set = {
 
 BTF_KFUNCS_START(bpf_oom_kfuncs)
 BTF_ID_FLAGS(func, bpf_oom_kill_process, KF_SLEEPABLE | KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, bpf_out_of_memory, KF_SLEEPABLE | KF_TRUSTED_ARGS)
 BTF_KFUNCS_END(bpf_oom_kfuncs)
 
 static const struct btf_kfunc_id_set bpf_oom_kfunc_set = {
-- 
2.49.0.901.g37484f566f-goog
Re: [PATCH rfc 10/12] mm: introduce bpf_out_of_memory() bpf kfunc
Posted by Michal Hocko 9 months, 2 weeks ago
On Mon 28-04-25 03:36:15, Roman Gushchin wrote:
> Introduce bpf_out_of_memory() bpf kfunc, which allows to declare
> an out of memory events and trigger the corresponding kernel OOM
> handling mechanism.
> 
> It takes a trusted memcg pointer (or NULL for system-wide OOMs)
> as an argument, as well as the page order.
> 
> Only one OOM can be declared and handled in the system at once,
> so if the function is called in parallel to another OOM handling,
> it bails out with -EBUSY.

This makes sense for the global OOM handler because concurrent handlers
are cooperative. But is this really correct for memcg ooms which could
happen for different hierarchies? Currently we do block on oom_lock in
that case to make sure one oom doesn't starve others. Do we want the
same behavior for custom OOM handlers?

-- 
Michal Hocko
SUSE Labs
Re: [PATCH rfc 10/12] mm: introduce bpf_out_of_memory() bpf kfunc
Posted by Roman Gushchin 9 months, 2 weeks ago
On Tue, Apr 29, 2025 at 01:46:07PM +0200, Michal Hocko wrote:
> On Mon 28-04-25 03:36:15, Roman Gushchin wrote:
> > Introduce bpf_out_of_memory() bpf kfunc, which allows to declare
> > an out of memory events and trigger the corresponding kernel OOM
> > handling mechanism.
> > 
> > It takes a trusted memcg pointer (or NULL for system-wide OOMs)
> > as an argument, as well as the page order.
> > 
> > Only one OOM can be declared and handled in the system at once,
> > so if the function is called in parallel to another OOM handling,
> > it bails out with -EBUSY.
> 
> This makes sense for the global OOM handler because concurrent handlers
> are cooperative. But is this really correct for memcg ooms which could
> happen for different hierarchies? Currently we do block on oom_lock in
> that case to make sure one oom doesn't starve others. Do we want the
> same behavior for custom OOM handlers?

It's a good point and I had similar thoughts when I was working on it.
But I think it's orthogonal to the customization of the oom handling.
Even for the existing oom killer it makes no sense to serialize memcg ooms
in independent memcg subtrees. But I'm worried about the dmesg reporting,
it can become really messy for 2+ concurrent OOMs.

Also, some memory can be shared, so one OOM can eliminate a need for another
OOM, even if they look independent.

So my conclusion here is to leave things as they are until we'll get signs
of real world problems with the (lack of) concurrency between ooms.

Thanks
Re: [PATCH rfc 10/12] mm: introduce bpf_out_of_memory() bpf kfunc
Posted by Michal Hocko 9 months, 2 weeks ago
On Tue 29-04-25 21:31:35, Roman Gushchin wrote:
> On Tue, Apr 29, 2025 at 01:46:07PM +0200, Michal Hocko wrote:
> > On Mon 28-04-25 03:36:15, Roman Gushchin wrote:
> > > Introduce bpf_out_of_memory() bpf kfunc, which allows to declare
> > > an out of memory events and trigger the corresponding kernel OOM
> > > handling mechanism.
> > > 
> > > It takes a trusted memcg pointer (or NULL for system-wide OOMs)
> > > as an argument, as well as the page order.
> > > 
> > > Only one OOM can be declared and handled in the system at once,
> > > so if the function is called in parallel to another OOM handling,
> > > it bails out with -EBUSY.
> > 
> > This makes sense for the global OOM handler because concurrent handlers
> > are cooperative. But is this really correct for memcg ooms which could
> > happen for different hierarchies? Currently we do block on oom_lock in
> > that case to make sure one oom doesn't starve others. Do we want the
> > same behavior for custom OOM handlers?
> 
> It's a good point and I had similar thoughts when I was working on it.
> But I think it's orthogonal to the customization of the oom handling.
> Even for the existing oom killer it makes no sense to serialize memcg ooms
> in independent memcg subtrees. But I'm worried about the dmesg reporting,
> it can become really messy for 2+ concurrent OOMs.
> 
> Also, some memory can be shared, so one OOM can eliminate a need for another
> OOM, even if they look independent.
> 
> So my conclusion here is to leave things as they are until we'll get signs
> of real world problems with the (lack of) concurrency between ooms.

How do we learn about that happening though? I do not think we have any
counters to watch to suspect that some oom handlers cannot run.
-- 
Michal Hocko
SUSE Labs
Re: [PATCH rfc 10/12] mm: introduce bpf_out_of_memory() bpf kfunc
Posted by Roman Gushchin 9 months, 2 weeks ago
On Wed, Apr 30, 2025 at 09:27:39AM +0200, Michal Hocko wrote:
> On Tue 29-04-25 21:31:35, Roman Gushchin wrote:
> > On Tue, Apr 29, 2025 at 01:46:07PM +0200, Michal Hocko wrote:
> > > On Mon 28-04-25 03:36:15, Roman Gushchin wrote:
> > > > Introduce bpf_out_of_memory() bpf kfunc, which allows to declare
> > > > an out of memory events and trigger the corresponding kernel OOM
> > > > handling mechanism.
> > > > 
> > > > It takes a trusted memcg pointer (or NULL for system-wide OOMs)
> > > > as an argument, as well as the page order.
> > > > 
> > > > Only one OOM can be declared and handled in the system at once,
> > > > so if the function is called in parallel to another OOM handling,
> > > > it bails out with -EBUSY.
> > > 
> > > This makes sense for the global OOM handler because concurrent handlers
> > > are cooperative. But is this really correct for memcg ooms which could
> > > happen for different hierarchies? Currently we do block on oom_lock in
> > > that case to make sure one oom doesn't starve others. Do we want the
> > > same behavior for custom OOM handlers?
> > 
> > It's a good point and I had similar thoughts when I was working on it.
> > But I think it's orthogonal to the customization of the oom handling.
> > Even for the existing oom killer it makes no sense to serialize memcg ooms
> > in independent memcg subtrees. But I'm worried about the dmesg reporting,
> > it can become really messy for 2+ concurrent OOMs.
> > 
> > Also, some memory can be shared, so one OOM can eliminate a need for another
> > OOM, even if they look independent.
> > 
> > So my conclusion here is to leave things as they are until we'll get signs
> > of real world problems with the (lack of) concurrency between ooms.
> 
> How do we learn about that happening though? I do not think we have any
> counters to watch to suspect that some oom handlers cannot run.

The bpf program which declares an OOM can handle this: e.g. retry, wait
and retry, etc. We can also try to mimick the existing behavior and wait
on oom_lock (potentially splitting it into multiple locks to support
concurrent ooms in various memcgs). Do you think it's preferable?
Re: [PATCH rfc 10/12] mm: introduce bpf_out_of_memory() bpf kfunc
Posted by Michal Hocko 9 months, 1 week ago
On Wed 30-04-25 14:53:50, Roman Gushchin wrote:
> On Wed, Apr 30, 2025 at 09:27:39AM +0200, Michal Hocko wrote:
> > On Tue 29-04-25 21:31:35, Roman Gushchin wrote:
> > > On Tue, Apr 29, 2025 at 01:46:07PM +0200, Michal Hocko wrote:
> > > > On Mon 28-04-25 03:36:15, Roman Gushchin wrote:
> > > > > Introduce bpf_out_of_memory() bpf kfunc, which allows to declare
> > > > > an out of memory events and trigger the corresponding kernel OOM
> > > > > handling mechanism.
> > > > > 
> > > > > It takes a trusted memcg pointer (or NULL for system-wide OOMs)
> > > > > as an argument, as well as the page order.
> > > > > 
> > > > > Only one OOM can be declared and handled in the system at once,
> > > > > so if the function is called in parallel to another OOM handling,
> > > > > it bails out with -EBUSY.
> > > > 
> > > > This makes sense for the global OOM handler because concurrent handlers
> > > > are cooperative. But is this really correct for memcg ooms which could
> > > > happen for different hierarchies? Currently we do block on oom_lock in
> > > > that case to make sure one oom doesn't starve others. Do we want the
> > > > same behavior for custom OOM handlers?
> > > 
> > > It's a good point and I had similar thoughts when I was working on it.
> > > But I think it's orthogonal to the customization of the oom handling.
> > > Even for the existing oom killer it makes no sense to serialize memcg ooms
> > > in independent memcg subtrees. But I'm worried about the dmesg reporting,
> > > it can become really messy for 2+ concurrent OOMs.
> > > 
> > > Also, some memory can be shared, so one OOM can eliminate a need for another
> > > OOM, even if they look independent.
> > > 
> > > So my conclusion here is to leave things as they are until we'll get signs
> > > of real world problems with the (lack of) concurrency between ooms.
> > 
> > How do we learn about that happening though? I do not think we have any
> > counters to watch to suspect that some oom handlers cannot run.
> 
> The bpf program which declares an OOM can handle this: e.g. retry, wait
> and retry, etc. We can also try to mimick the existing behavior and wait
> on oom_lock (potentially splitting it into multiple locks to support
> concurrent ooms in various memcgs). Do you think it's preferable?

Yes, I would just provide different callbacks for global and memcg ooms
and do the blockin for the latter. It will be consistent with the in
kernel implementation (therefore less surprising behavior).

-- 
Michal Hocko
SUSE Labs