[PATCH bpf-next v3 07/17] mm: introduce BPF OOM struct ops

Roman Gushchin posted 17 patches 1 week, 5 days ago
Only 15 patches received!
[PATCH bpf-next v3 07/17] mm: introduce BPF OOM struct ops
Posted by Roman Gushchin 1 week, 5 days ago
Introduce a bpf struct ops for implementing custom OOM handling
policies.

It's possible to load one bpf_oom_ops for the system and one
bpf_oom_ops for every memory cgroup. In case of a memcg OOM, the
cgroup tree is traversed from the OOM'ing memcg up to the root and
corresponding BPF OOM handlers are executed until some memory is
freed. If no memory is freed, the kernel OOM killer is invoked.

The struct ops provides the bpf_handle_out_of_memory() callback,
which expected to return 1 if it was able to free some memory and 0
otherwise. If 1 is returned, the kernel also checks the bpf_memory_freed
field of the oom_control structure, which is expected to be set by
kfuncs suitable for releasing memory (which will be introduced later
in the patch series). If both are set, OOM is considered handled,
otherwise the next OOM handler in the chain is executed: e.g. BPF OOM
attached to the parent cgroup or the kernel OOM killer.

The bpf_handle_out_of_memory() callback program is sleepable to allow
using iterators, e.g. cgroup iterators. The callback receives struct
oom_control as an argument, so it can determine the scope of the OOM
event: if this is a memcg-wide or system-wide OOM. It also receives
bpf_struct_ops_link as the second argument, so it can detect the
cgroup level at which this specific instance is attached.

The bpf_handle_out_of_memory() callback is executed just before the
kernel victim task selection algorithm, so all heuristics and sysctls
like panic on oom, sysctl_oom_kill_allocating_task and
sysctl_oom_kill_allocating_task are respected.

The struct ops has the name field, which allows to define a custom
name for the implemented policy. It's printed in the OOM report
in the oom_handler=<name> format only if a bpf handler is invoked.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 MAINTAINERS                     |   2 +
 include/linux/bpf-cgroup-defs.h |   3 +
 include/linux/bpf.h             |   1 +
 include/linux/bpf_oom.h         |  46 ++++++++
 include/linux/oom.h             |   8 ++
 kernel/bpf/bpf_struct_ops.c     |  12 +-
 mm/Makefile                     |   2 +-
 mm/bpf_oom.c                    | 192 ++++++++++++++++++++++++++++++++
 mm/oom_kill.c                   |  19 ++++
 9 files changed, 282 insertions(+), 3 deletions(-)
 create mode 100644 include/linux/bpf_oom.h
 create mode 100644 mm/bpf_oom.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 491d567f7dc8..53465570c1e5 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -4807,7 +4807,9 @@ M:	Shakeel Butt <shakeel.butt@linux.dev>
 L:	bpf@vger.kernel.org
 L:	linux-mm@kvack.org
 S:	Maintained
+F:	include/linux/bpf_oom.h
 F:	mm/bpf_memcontrol.c
+F:	mm/bpf_oom.c
 
 BPF [MISC]
 L:	bpf@vger.kernel.org
diff --git a/include/linux/bpf-cgroup-defs.h b/include/linux/bpf-cgroup-defs.h
index 6c5e37190dad..52395834ce13 100644
--- a/include/linux/bpf-cgroup-defs.h
+++ b/include/linux/bpf-cgroup-defs.h
@@ -74,6 +74,9 @@ struct cgroup_bpf {
 	/* list of bpf struct ops links */
 	struct list_head struct_ops_links;
 
+	/* BPF OOM struct ops link */
+	struct bpf_struct_ops_link __rcu *bpf_oom_link;
+
 	/* reference counter used to detach bpf programs after cgroup removal */
 	struct percpu_ref refcnt;
 
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 391888eb257c..a5cee5a657b0 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -3944,6 +3944,7 @@ static inline bool bpf_is_subprog(const struct bpf_prog *prog)
 int bpf_prog_get_file_line(struct bpf_prog *prog, unsigned long ip, const char **filep,
 			   const char **linep, int *nump);
 struct bpf_prog *bpf_prog_find_from_stack(void);
+void *bpf_struct_ops_data(struct bpf_map *map);
 
 int bpf_insn_array_init(struct bpf_map *map, const struct bpf_prog *prog);
 int bpf_insn_array_ready(struct bpf_map *map);
diff --git a/include/linux/bpf_oom.h b/include/linux/bpf_oom.h
new file mode 100644
index 000000000000..c81133145c50
--- /dev/null
+++ b/include/linux/bpf_oom.h
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0+ */
+
+#ifndef __BPF_OOM_H
+#define __BPF_OOM_H
+
+struct oom_control;
+
+#define BPF_OOM_NAME_MAX_LEN 64
+
+struct bpf_oom_ops {
+	/**
+	 * @handle_out_of_memory: Out of memory bpf handler, called before
+	 * the in-kernel OOM killer.
+	 * @oc: OOM control structure
+	 * @st_link: struct ops link
+	 *
+	 * Should return 1 if some memory was freed up, otherwise
+	 * the in-kernel OOM killer is invoked.
+	 */
+	int (*handle_out_of_memory)(struct oom_control *oc,
+				    struct bpf_struct_ops_link *st_link);
+
+	/**
+	 * @name: BPF OOM policy name
+	 */
+	char name[BPF_OOM_NAME_MAX_LEN];
+};
+
+#ifdef CONFIG_BPF_SYSCALL
+/**
+ * @bpf_handle_oom: handle out of memory condition using bpf
+ * @oc: OOM control structure
+ *
+ * Returns true if some memory was freed.
+ */
+bool bpf_handle_oom(struct oom_control *oc);
+
+#else /* CONFIG_BPF_SYSCALL */
+static inline bool bpf_handle_oom(struct oom_control *oc)
+{
+	return false;
+}
+
+#endif /* CONFIG_BPF_SYSCALL */
+
+#endif /* __BPF_OOM_H */
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 7b02bc1d0a7e..c2dce336bcb4 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -51,6 +51,14 @@ struct oom_control {
 
 	/* Used to print the constraint info. */
 	enum oom_constraint constraint;
+
+#ifdef CONFIG_BPF_SYSCALL
+	/* Used by the bpf oom implementation to mark the forward progress */
+	bool bpf_memory_freed;
+
+	/* Handler name */
+	const char *bpf_handler_name;
+#endif
 };
 
 extern struct mutex oom_lock;
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index 2e361e22cfa0..6285a6d56b98 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -1009,7 +1009,7 @@ static void bpf_struct_ops_map_free(struct bpf_map *map)
 	 * in the tramopline image to finish before releasing
 	 * the trampoline image.
 	 */
-	synchronize_rcu_mult(call_rcu, call_rcu_tasks);
+	synchronize_rcu_mult(call_rcu, call_rcu_tasks, call_rcu_tasks_trace);
 
 	__bpf_struct_ops_map_free(map);
 }
@@ -1226,7 +1226,8 @@ static void bpf_struct_ops_map_link_dealloc(struct bpf_link *link)
 	if (st_link->cgroup)
 		cgroup_bpf_detach_struct_ops(st_link->cgroup, st_link);
 
-	kfree(st_link);
+	synchronize_rcu_tasks_trace();
+	kfree_rcu(st_link, link.rcu);
 }
 
 static void bpf_struct_ops_map_link_show_fdinfo(const struct bpf_link *link,
@@ -1535,3 +1536,10 @@ void bpf_map_struct_ops_info_fill(struct bpf_map_info *info, struct bpf_map *map
 
 	info->btf_vmlinux_id = btf_obj_id(st_map->btf);
 }
+
+void *bpf_struct_ops_data(struct bpf_map *map)
+{
+	struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map;
+
+	return &st_map->kvalue.data;
+}
diff --git a/mm/Makefile b/mm/Makefile
index bf46fe31dc14..e939525ba01b 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -107,7 +107,7 @@ ifdef CONFIG_SWAP
 obj-$(CONFIG_MEMCG) += swap_cgroup.o
 endif
 ifdef CONFIG_BPF_SYSCALL
-obj-$(CONFIG_MEMCG) += bpf_memcontrol.o
+obj-$(CONFIG_MEMCG) += bpf_memcontrol.o bpf_oom.o
 endif
 obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
 obj-$(CONFIG_GUP_TEST) += gup_test.o
diff --git a/mm/bpf_oom.c b/mm/bpf_oom.c
new file mode 100644
index 000000000000..ea70be6e2c26
--- /dev/null
+++ b/mm/bpf_oom.c
@@ -0,0 +1,192 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * BPF-driven OOM killer customization
+ *
+ * Author: Roman Gushchin <roman.gushchin@linux.dev>
+ */
+
+#include <linux/bpf.h>
+#include <linux/oom.h>
+#include <linux/bpf_oom.h>
+#include <linux/bpf-cgroup.h>
+#include <linux/cgroup.h>
+#include <linux/memcontrol.h>
+#include <linux/uaccess.h>
+
+static int bpf_ops_handle_oom(struct bpf_oom_ops *bpf_oom_ops,
+			      struct bpf_struct_ops_link *st_link,
+			      struct oom_control *oc)
+{
+	int ret;
+
+	oc->bpf_handler_name = &bpf_oom_ops->name[0];
+	oc->bpf_memory_freed = false;
+	pagefault_disable();
+	ret = bpf_oom_ops->handle_out_of_memory(oc, st_link);
+	pagefault_enable();
+	oc->bpf_handler_name = NULL;
+
+	return ret;
+}
+
+bool bpf_handle_oom(struct oom_control *oc)
+{
+	struct bpf_struct_ops_link *st_link;
+	struct bpf_oom_ops *bpf_oom_ops;
+	struct mem_cgroup *memcg;
+	struct bpf_map *map;
+	int ret = 0;
+
+	/*
+	 * System-wide OOMs are handled by the struct ops attached
+	 * to the root memory cgroup
+	 */
+	memcg = oc->memcg ? oc->memcg : root_mem_cgroup;
+
+	rcu_read_lock_trace();
+
+	/* Find the nearest bpf_oom_ops traversing the cgroup tree upwards */
+	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
+		st_link = rcu_dereference_check(memcg->css.cgroup->bpf.bpf_oom_link,
+						rcu_read_lock_trace_held());
+		if (!st_link)
+			continue;
+
+		map = rcu_dereference_check((st_link->map),
+					    rcu_read_lock_trace_held());
+		if (!map)
+			continue;
+
+		/* Call BPF OOM handler */
+		bpf_oom_ops = bpf_struct_ops_data(map);
+		ret = bpf_ops_handle_oom(bpf_oom_ops, st_link, oc);
+		if (ret && oc->bpf_memory_freed)
+			break;
+		ret = 0;
+	}
+
+	rcu_read_unlock_trace();
+
+	return ret && oc->bpf_memory_freed;
+}
+
+static int __handle_out_of_memory(struct oom_control *oc,
+				  struct bpf_struct_ops_link *st_link)
+{
+	return 0;
+}
+
+static struct bpf_oom_ops __bpf_oom_ops = {
+	.handle_out_of_memory = __handle_out_of_memory,
+};
+
+static const struct bpf_func_proto *
+bpf_oom_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	return tracing_prog_func_proto(func_id, prog);
+}
+
+static bool bpf_oom_ops_is_valid_access(int off, int size,
+					enum bpf_access_type type,
+					const struct bpf_prog *prog,
+					struct bpf_insn_access_aux *info)
+{
+	return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
+}
+
+static const struct bpf_verifier_ops bpf_oom_verifier_ops = {
+	.get_func_proto = bpf_oom_func_proto,
+	.is_valid_access = bpf_oom_ops_is_valid_access,
+};
+
+static int bpf_oom_ops_reg(void *kdata, struct bpf_link *link)
+{
+	struct bpf_struct_ops_link *st_link = (struct bpf_struct_ops_link *)link;
+	struct cgroup *cgrp;
+
+	/* The link is not yet fully initialized, but cgroup should be set */
+	if (!link)
+		return -EOPNOTSUPP;
+
+	cgrp = st_link->cgroup;
+	if (!cgrp)
+		return -EINVAL;
+
+	if (cmpxchg(&cgrp->bpf.bpf_oom_link, NULL, st_link))
+		return -EEXIST;
+
+	return 0;
+}
+
+static void bpf_oom_ops_unreg(void *kdata, struct bpf_link *link)
+{
+	struct bpf_struct_ops_link *st_link = (struct bpf_struct_ops_link *)link;
+	struct cgroup *cgrp;
+
+	if (!link)
+		return;
+
+	cgrp = st_link->cgroup;
+	if (!cgrp)
+		return;
+
+	WARN_ON(cmpxchg(&cgrp->bpf.bpf_oom_link, st_link, NULL) != st_link);
+}
+
+static int bpf_oom_ops_check_member(const struct btf_type *t,
+				    const struct btf_member *member,
+				    const struct bpf_prog *prog)
+{
+	u32 moff = __btf_member_bit_offset(t, member) / 8;
+
+	switch (moff) {
+	case offsetof(struct bpf_oom_ops, handle_out_of_memory):
+		if (!prog)
+			return -EINVAL;
+		break;
+	}
+
+	return 0;
+}
+
+static int bpf_oom_ops_init_member(const struct btf_type *t,
+				   const struct btf_member *member,
+				   void *kdata, const void *udata)
+{
+	const struct bpf_oom_ops *uops = udata;
+	struct bpf_oom_ops *ops = kdata;
+	u32 moff = __btf_member_bit_offset(t, member) / 8;
+
+	switch (moff) {
+	case offsetof(struct bpf_oom_ops, name):
+		if (uops->name[0])
+			strscpy_pad(ops->name, uops->name, sizeof(ops->name));
+		else
+			strscpy_pad(ops->name, "bpf_defined_policy");
+		return 1;
+	}
+	return 0;
+}
+
+static int bpf_oom_ops_init(struct btf *btf)
+{
+	return 0;
+}
+
+static struct bpf_struct_ops bpf_oom_bpf_ops = {
+	.verifier_ops = &bpf_oom_verifier_ops,
+	.reg = bpf_oom_ops_reg,
+	.unreg = bpf_oom_ops_unreg,
+	.check_member = bpf_oom_ops_check_member,
+	.init_member = bpf_oom_ops_init_member,
+	.init = bpf_oom_ops_init,
+	.name = "bpf_oom_ops",
+	.owner = THIS_MODULE,
+	.cfi_stubs = &__bpf_oom_ops
+};
+
+static int __init bpf_oom_struct_ops_init(void)
+{
+	return register_bpf_struct_ops(&bpf_oom_bpf_ops, bpf_oom_ops);
+}
+late_initcall(bpf_oom_struct_ops_init);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5eb11fbba704..44bbcf033804 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -45,6 +45,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/cred.h>
 #include <linux/nmi.h>
+#include <linux/bpf_oom.h>
 
 #include <asm/tlb.h>
 #include "internal.h"
@@ -246,6 +247,15 @@ static const char * const oom_constraint_text[] = {
 	[CONSTRAINT_MEMCG] = "CONSTRAINT_MEMCG",
 };
 
+static const char *oom_handler_name(struct oom_control *oc)
+{
+#ifdef CONFIG_BPF_SYSCALL
+	if (oc->bpf_handler_name)
+		return oc->bpf_handler_name;
+#endif
+	return NULL;
+}
+
 /*
  * Determine the type of allocation constraint.
  */
@@ -461,6 +471,8 @@ static void dump_header(struct oom_control *oc)
 	pr_warn("%s invoked oom-killer: gfp_mask=%#x(%pGg), order=%d, oom_score_adj=%hd\n",
 		current->comm, oc->gfp_mask, &oc->gfp_mask, oc->order,
 			current->signal->oom_score_adj);
+	if (oom_handler_name(oc))
+		pr_warn("oom bpf handler: %s\n", oom_handler_name(oc));
 	if (!IS_ENABLED(CONFIG_COMPACTION) && oc->order)
 		pr_warn("COMPACTION is disabled!!!\n");
 
@@ -1168,6 +1180,13 @@ bool out_of_memory(struct oom_control *oc)
 		return true;
 	}
 
+	/*
+	 * Let bpf handle the OOM first. If it was able to free up some memory,
+	 * bail out. Otherwise fall back to the kernel OOM killer.
+	 */
+	if (bpf_handle_oom(oc))
+		return true;
+
 	select_bad_process(oc);
 	/* Found nothing?!?! */
 	if (!oc->chosen) {
-- 
2.52.0
Re: [PATCH bpf-next v3 07/17] mm: introduce BPF OOM struct ops
Posted by Martin KaFai Lau 1 week, 2 days ago
On 1/26/26 6:44 PM, Roman Gushchin wrote:
> +bool bpf_handle_oom(struct oom_control *oc)
> +{
> +	struct bpf_struct_ops_link *st_link;
> +	struct bpf_oom_ops *bpf_oom_ops;
> +	struct mem_cgroup *memcg;
> +	struct bpf_map *map;
> +	int ret = 0;
> +
> +	/*
> +	 * System-wide OOMs are handled by the struct ops attached
> +	 * to the root memory cgroup
> +	 */
> +	memcg = oc->memcg ? oc->memcg : root_mem_cgroup;
> +
> +	rcu_read_lock_trace();
> +
> +	/* Find the nearest bpf_oom_ops traversing the cgroup tree upwards */
> +	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
> +		st_link = rcu_dereference_check(memcg->css.cgroup->bpf.bpf_oom_link,
> +						rcu_read_lock_trace_held());
> +		if (!st_link)
> +			continue;
> +
> +		map = rcu_dereference_check((st_link->map),
> +					    rcu_read_lock_trace_held());
> +		if (!map)
> +			continue;
> +
> +		/* Call BPF OOM handler */
> +		bpf_oom_ops = bpf_struct_ops_data(map);
> +		ret = bpf_ops_handle_oom(bpf_oom_ops, st_link, oc);
> +		if (ret && oc->bpf_memory_freed)
> +			break;
> +		ret = 0;
> +	}
> +
> +	rcu_read_unlock_trace();
> +
> +	return ret && oc->bpf_memory_freed;
> +}
> +

[ ... ]

> +static int bpf_oom_ops_reg(void *kdata, struct bpf_link *link)
> +{
> +	struct bpf_struct_ops_link *st_link = (struct bpf_struct_ops_link *)link;
> +	struct cgroup *cgrp;
> +
> +	/* The link is not yet fully initialized, but cgroup should be set */
> +	if (!link)
> +		return -EOPNOTSUPP;
> +
> +	cgrp = st_link->cgroup;
> +	if (!cgrp)
> +		return -EINVAL;
> +
> +	if (cmpxchg(&cgrp->bpf.bpf_oom_link, NULL, st_link))
> +		return -EEXIST;
iiuc, this will allow only one oom_ops to be attached to a cgroup. 
Considering oom_ops is the only user of the cgrp->bpf.struct_ops_links 
(added in patch 2), the list should have only one element for now.

Copy some context from the patch 2 commit log.

 > This change doesn't answer the question how bpf programs belonging
 > to these struct ops'es will be executed. It will be done individually
 > for every bpf struct ops which supports this.
 >
 > Please, note that unlike "normal" bpf programs, struct ops'es
 > are not propagated to cgroup sub-trees.

There are NONE, BPF_F_ALLOW_OVERRIDE, and BPF_F_ALLOW_MULTI, which one 
may be closer to the bpf_handle_oom() semantic. If it needs to change 
the ordering (or allow multi) in the future, does it need a new flag or 
the existing BPF_F_xxx flags can be used.
Re: [PATCH bpf-next v3 07/17] mm: introduce BPF OOM struct ops
Posted by Roman Gushchin 1 week, 1 day ago
Martin KaFai Lau <martin.lau@linux.dev> writes:

> On 1/26/26 6:44 PM, Roman Gushchin wrote:
>> +bool bpf_handle_oom(struct oom_control *oc)
>> +{
>> +	struct bpf_struct_ops_link *st_link;
>> +	struct bpf_oom_ops *bpf_oom_ops;
>> +	struct mem_cgroup *memcg;
>> +	struct bpf_map *map;
>> +	int ret = 0;
>> +
>> +	/*
>> +	 * System-wide OOMs are handled by the struct ops attached
>> +	 * to the root memory cgroup
>> +	 */
>> +	memcg = oc->memcg ? oc->memcg : root_mem_cgroup;
>> +
>> +	rcu_read_lock_trace();
>> +
>> +	/* Find the nearest bpf_oom_ops traversing the cgroup tree upwards */
>> +	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
>> +		st_link = rcu_dereference_check(memcg->css.cgroup->bpf.bpf_oom_link,
>> +						rcu_read_lock_trace_held());
>> +		if (!st_link)
>> +			continue;
>> +
>> +		map = rcu_dereference_check((st_link->map),
>> +					    rcu_read_lock_trace_held());
>> +		if (!map)
>> +			continue;
>> +
>> +		/* Call BPF OOM handler */
>> +		bpf_oom_ops = bpf_struct_ops_data(map);
>> +		ret = bpf_ops_handle_oom(bpf_oom_ops, st_link, oc);
>> +		if (ret && oc->bpf_memory_freed)
>> +			break;
>> +		ret = 0;
>> +	}
>> +
>> +	rcu_read_unlock_trace();
>> +
>> +	return ret && oc->bpf_memory_freed;
>> +}
>> +
>
> [ ... ]
>
>> +static int bpf_oom_ops_reg(void *kdata, struct bpf_link *link)
>> +{
>> +	struct bpf_struct_ops_link *st_link = (struct bpf_struct_ops_link *)link;
>> +	struct cgroup *cgrp;
>> +
>> +	/* The link is not yet fully initialized, but cgroup should be set */
>> +	if (!link)
>> +		return -EOPNOTSUPP;
>> +
>> +	cgrp = st_link->cgroup;
>> +	if (!cgrp)
>> +		return -EINVAL;
>> +
>> +	if (cmpxchg(&cgrp->bpf.bpf_oom_link, NULL, st_link))
>> +		return -EEXIST;
> iiuc, this will allow only one oom_ops to be attached to a
> cgroup. Considering oom_ops is the only user of the
> cgrp->bpf.struct_ops_links (added in patch 2), the list should have
> only one element for now.
>
> Copy some context from the patch 2 commit log.

Hi Martin!

Sorry, I'm not quite sure what do you mean, can you please elaborate
more?

We decided (in conversations at LPC) that 1 bpf oom policy for
memcg is good for now (with a potential to extend in the future, if
there will be use cases). But it seems like there is a lot of interest
to attach struct ops'es to cgroups (there are already a couple of
patchsets posted based on my earlier v2 patches), so I tried to make the
bpf link mechanics suitable for multiple use cases from scratch.

Did I answer your question?

>
>> This change doesn't answer the question how bpf programs belonging
>> to these struct ops'es will be executed. It will be done individually
>> for every bpf struct ops which supports this.
>>
>> Please, note that unlike "normal" bpf programs, struct ops'es
>> are not propagated to cgroup sub-trees.
>
> There are NONE, BPF_F_ALLOW_OVERRIDE, and BPF_F_ALLOW_MULTI, which one
> may be closer to the bpf_handle_oom() semantic. If it needs to change
> the ordering (or allow multi) in the future, does it need a new flag
> or the existing BPF_F_xxx flags can be used.

I hope that existing flags can be used, but also I'm not sure we ever
would need multiple oom handlers per cgroup. Do you have any specific
concerns here?

Thanks!
Re: [PATCH bpf-next v3 07/17] mm: introduce BPF OOM struct ops
Posted by Martin KaFai Lau 5 days, 20 hours ago

On 1/30/26 3:29 PM, Roman Gushchin wrote:
> Martin KaFai Lau <martin.lau@linux.dev> writes:
> 
>> On 1/26/26 6:44 PM, Roman Gushchin wrote:
>>> +bool bpf_handle_oom(struct oom_control *oc)
>>> +{
>>> +	struct bpf_struct_ops_link *st_link;
>>> +	struct bpf_oom_ops *bpf_oom_ops;
>>> +	struct mem_cgroup *memcg;
>>> +	struct bpf_map *map;
>>> +	int ret = 0;
>>> +
>>> +	/*
>>> +	 * System-wide OOMs are handled by the struct ops attached
>>> +	 * to the root memory cgroup
>>> +	 */
>>> +	memcg = oc->memcg ? oc->memcg : root_mem_cgroup;
>>> +
>>> +	rcu_read_lock_trace();
>>> +
>>> +	/* Find the nearest bpf_oom_ops traversing the cgroup tree upwards */
>>> +	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
>>> +		st_link = rcu_dereference_check(memcg->css.cgroup->bpf.bpf_oom_link,
>>> +						rcu_read_lock_trace_held());
>>> +		if (!st_link)
>>> +			continue;
>>> +
>>> +		map = rcu_dereference_check((st_link->map),
>>> +					    rcu_read_lock_trace_held());
>>> +		if (!map)
>>> +			continue;
>>> +
>>> +		/* Call BPF OOM handler */
>>> +		bpf_oom_ops = bpf_struct_ops_data(map);
>>> +		ret = bpf_ops_handle_oom(bpf_oom_ops, st_link, oc);
>>> +		if (ret && oc->bpf_memory_freed)
>>> +			break;
>>> +		ret = 0;
>>> +	}
>>> +
>>> +	rcu_read_unlock_trace();
>>> +
>>> +	return ret && oc->bpf_memory_freed;
>>> +}
>>> +
>>
>> [ ... ]
>>
>>> +static int bpf_oom_ops_reg(void *kdata, struct bpf_link *link)
>>> +{
>>> +	struct bpf_struct_ops_link *st_link = (struct bpf_struct_ops_link *)link;
>>> +	struct cgroup *cgrp;
>>> +
>>> +	/* The link is not yet fully initialized, but cgroup should be set */
>>> +	if (!link)
>>> +		return -EOPNOTSUPP;
>>> +
>>> +	cgrp = st_link->cgroup;
>>> +	if (!cgrp)
>>> +		return -EINVAL;
>>> +
>>> +	if (cmpxchg(&cgrp->bpf.bpf_oom_link, NULL, st_link))
>>> +		return -EEXIST;
>> iiuc, this will allow only one oom_ops to be attached to a
>> cgroup. Considering oom_ops is the only user of the
>> cgrp->bpf.struct_ops_links (added in patch 2), the list should have
>> only one element for now.
>>
>> Copy some context from the patch 2 commit log.
> 
> Hi Martin!
> 
> Sorry, I'm not quite sure what do you mean, can you please elaborate
> more?
> 
> We decided (in conversations at LPC) that 1 bpf oom policy for
> memcg is good for now (with a potential to extend in the future, if
> there will be use cases). But it seems like there is a lot of interest
> to attach struct ops'es to cgroups (there are already a couple of
> patchsets posted based on my earlier v2 patches), so I tried to make the
> bpf link mechanics suitable for multiple use cases from scratch.
> 
> Did I answer your question?

Got it. The link list is for the future struct_ops implementations to 
attach to a cgroup.

I should have mentioned the context. My bad.

BPF_PROG_TYPE_SOCK_OPS is currently a cgroup BPF prog. I am thinking of 
adding a bpf_struct_ops support to have similar hooks as in the 
BPF_PROG_TYPE_SOCK_OPS. There are some issues that need to be worked 
out. A major one is that the current cgroup progs have expectations on 
the ordering and override behavior based on the BPF_F_* and the runtime 
cgroup hierarchy. I was trying to see if there are pieces in this set 
that can be built upon. The linked list is a start but will need more 
work to make it performant for networking use.

> 
>>
>>> This change doesn't answer the question how bpf programs belonging
>>> to these struct ops'es will be executed. It will be done individually
>>> for every bpf struct ops which supports this.
>>>
>>> Please, note that unlike "normal" bpf programs, struct ops'es
>>> are not propagated to cgroup sub-trees.
>>
>> There are NONE, BPF_F_ALLOW_OVERRIDE, and BPF_F_ALLOW_MULTI, which one
>> may be closer to the bpf_handle_oom() semantic. If it needs to change
>> the ordering (or allow multi) in the future, does it need a new flag
>> or the existing BPF_F_xxx flags can be used.
> 
> I hope that existing flags can be used, but also I'm not sure we ever
> would need multiple oom handlers per cgroup. Do you have any specific
> concerns here?

Another question that I have is the default behavior when none of the 
BPF_F_* is specified when attaching a struct_ops to a cgroup.

 From uapi/bpf.h:

* NONE (default): No further BPF programs allowed in the subtree

iiuc, the bpf_handle_oom() is not the same as NONE. Should each 
struct_ops implementation have its own default policy? For the 
BPF_PROG_TYPE_SOCK_OPS work, I am thinking the default policy should be 
BPF_F_ALLOW_MULTI which is always on/set now in the 
cgroup_bpf_link_attach().
Re: [PATCH bpf-next v3 07/17] mm: introduce BPF OOM struct ops
Posted by Michal Hocko 1 week, 4 days ago
Once additional point I forgot to mention previously

On Mon 26-01-26 18:44:10, Roman Gushchin wrote:
> @@ -1168,6 +1180,13 @@ bool out_of_memory(struct oom_control *oc)
>  		return true;
>  	}
>  
> +	/*
> +	 * Let bpf handle the OOM first. If it was able to free up some memory,
> +	 * bail out. Otherwise fall back to the kernel OOM killer.
> +	 */
> +	if (bpf_handle_oom(oc))
> +		return true;
> +
>  	select_bad_process(oc);
>  	/* Found nothing?!?! */
>  	if (!oc->chosen) {

Should this check for is_sysrq_oom and always use the in kernel OOM
handling for Sysrq triggered ooms as a failsafe measure?
-- 
Michal Hocko
SUSE Labs
Re: [PATCH bpf-next v3 07/17] mm: introduce BPF OOM struct ops
Posted by Roman Gushchin 1 week, 3 days ago
Michal Hocko <mhocko@suse.com> writes:

> Once additional point I forgot to mention previously
>
> On Mon 26-01-26 18:44:10, Roman Gushchin wrote:
>> @@ -1168,6 +1180,13 @@ bool out_of_memory(struct oom_control *oc)
>>  		return true;
>>  	}
>>  
>> +	/*
>> +	 * Let bpf handle the OOM first. If it was able to free up some memory,
>> +	 * bail out. Otherwise fall back to the kernel OOM killer.
>> +	 */
>> +	if (bpf_handle_oom(oc))
>> +		return true;
>> +
>>  	select_bad_process(oc);
>>  	/* Found nothing?!?! */
>>  	if (!oc->chosen) {
>
> Should this check for is_sysrq_oom and always use the in kernel OOM
> handling for Sysrq triggered ooms as a failsafe measure?

Yep, good point. Will implement in v4.

Thanks!
Re: [PATCH bpf-next v3 07/17] mm: introduce BPF OOM struct ops
Posted by Josh Don 1 week, 4 days ago
Thanks Roman!

On Mon, Jan 26, 2026 at 6:51 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> Introduce a bpf struct ops for implementing custom OOM handling
> policies.
>
> +bool bpf_handle_oom(struct oom_control *oc)
> +{
> +       struct bpf_struct_ops_link *st_link;
> +       struct bpf_oom_ops *bpf_oom_ops;
> +       struct mem_cgroup *memcg;
> +       struct bpf_map *map;
> +       int ret = 0;
> +
> +       /*
> +        * System-wide OOMs are handled by the struct ops attached
> +        * to the root memory cgroup
> +        */
> +       memcg = oc->memcg ? oc->memcg : root_mem_cgroup;
> +
> +       rcu_read_lock_trace();
> +
> +       /* Find the nearest bpf_oom_ops traversing the cgroup tree upwards */
> +       for (; memcg; memcg = parent_mem_cgroup(memcg)) {
> +               st_link = rcu_dereference_check(memcg->css.cgroup->bpf.bpf_oom_link,
> +                                               rcu_read_lock_trace_held());
> +               if (!st_link)
> +                       continue;
> +
> +               map = rcu_dereference_check((st_link->map),
> +                                           rcu_read_lock_trace_held());
> +               if (!map)
> +                       continue;
> +
> +               /* Call BPF OOM handler */
> +               bpf_oom_ops = bpf_struct_ops_data(map);
> +               ret = bpf_ops_handle_oom(bpf_oom_ops, st_link, oc);
> +               if (ret && oc->bpf_memory_freed)
> +                       break;
> +               ret = 0;
> +       }
> +
> +       rcu_read_unlock_trace();
> +
> +       return ret && oc->bpf_memory_freed;

If bpf claims to have freed memory but didn't actually do so, that
seems like something potentially worth alerting to. Perhaps something
to add to the oom header output?
Re: [PATCH bpf-next v3 07/17] mm: introduce BPF OOM struct ops
Posted by Roman Gushchin 1 week, 3 days ago
Josh Don <joshdon@google.com> writes:

> Thanks Roman!
>
> On Mon, Jan 26, 2026 at 6:51 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>>
>> Introduce a bpf struct ops for implementing custom OOM handling
>> policies.
>>
>> +bool bpf_handle_oom(struct oom_control *oc)
>> +{
>> +       struct bpf_struct_ops_link *st_link;
>> +       struct bpf_oom_ops *bpf_oom_ops;
>> +       struct mem_cgroup *memcg;
>> +       struct bpf_map *map;
>> +       int ret = 0;
>> +
>> +       /*
>> +        * System-wide OOMs are handled by the struct ops attached
>> +        * to the root memory cgroup
>> +        */
>> +       memcg = oc->memcg ? oc->memcg : root_mem_cgroup;
>> +
>> +       rcu_read_lock_trace();
>> +
>> +       /* Find the nearest bpf_oom_ops traversing the cgroup tree upwards */
>> +       for (; memcg; memcg = parent_mem_cgroup(memcg)) {
>> +               st_link = rcu_dereference_check(memcg->css.cgroup->bpf.bpf_oom_link,
>> +                                               rcu_read_lock_trace_held());
>> +               if (!st_link)
>> +                       continue;
>> +
>> +               map = rcu_dereference_check((st_link->map),
>> +                                           rcu_read_lock_trace_held());
>> +               if (!map)
>> +                       continue;
>> +
>> +               /* Call BPF OOM handler */
>> +               bpf_oom_ops = bpf_struct_ops_data(map);
>> +               ret = bpf_ops_handle_oom(bpf_oom_ops, st_link, oc);
>> +               if (ret && oc->bpf_memory_freed)
>> +                       break;
>> +               ret = 0;
>> +       }
>> +
>> +       rcu_read_unlock_trace();
>> +
>> +       return ret && oc->bpf_memory_freed;
>
> If bpf claims to have freed memory but didn't actually do so, that
> seems like something potentially worth alerting to. Perhaps something
> to add to the oom header output?

Michal pointed at a more fundamental problem: if a bpf handler performed
some actions (e.g. killed a program), how to safely allow other bpf
handlers to exit without performing redundant destructive operations?
Now it works on marking victim processes, so that subsequent kernel
oom handlers just bail out if they see a marked process.

I don't know to extend it to generic actions. E.g. we can have an atomic
counter attached to the bpf oom instance (link), we can bump it on
performing a destructive operation, but it's not clear when to clear it.

So maybe it's not worth it at all and it's better to drop this
protection mechanism altogether.

Thanks!
Re: [PATCH bpf-next v3 07/17] mm: introduce BPF OOM struct ops
Posted by Michal Hocko 1 week, 5 days ago
On Mon 26-01-26 18:44:10, Roman Gushchin wrote:
> Introduce a bpf struct ops for implementing custom OOM handling
> policies.
> 
> It's possible to load one bpf_oom_ops for the system and one
> bpf_oom_ops for every memory cgroup. In case of a memcg OOM, the
> cgroup tree is traversed from the OOM'ing memcg up to the root and
> corresponding BPF OOM handlers are executed until some memory is
> freed. If no memory is freed, the kernel OOM killer is invoked.
> 
> The struct ops provides the bpf_handle_out_of_memory() callback,
> which expected to return 1 if it was able to free some memory and 0
> otherwise. If 1 is returned, the kernel also checks the bpf_memory_freed
> field of the oom_control structure, which is expected to be set by
> kfuncs suitable for releasing memory (which will be introduced later
> in the patch series). If both are set, OOM is considered handled,
> otherwise the next OOM handler in the chain is executed: e.g. BPF OOM
> attached to the parent cgroup or the kernel OOM killer.

I still find this dual reporting a bit confusing. I can see your
intention in having a pre-defined "releasers" of the memory to trust BPF
handlers more but they do have access to oc->bpf_memory_freed so they
can manipulate it. Therefore an additional level of protection is rather
weak. 

It is also not really clear to me how this works while there is OOM
victim on the way out. (i.e. tsk_is_oom_victim() -> abort case). This
will result in no killing therefore no bpf_memory_freed, right? Handler
itself should consider its work done. How exactly is this handled.

Also is there any way to handle the oom by increasing the memcg limit?
I do not see a callback for that.
-- 
Michal Hocko
SUSE Labs
Re: [PATCH bpf-next v3 07/17] mm: introduce BPF OOM struct ops
Posted by Roman Gushchin 1 week, 4 days ago
Michal Hocko <mhocko@suse.com> writes:

> On Mon 26-01-26 18:44:10, Roman Gushchin wrote:
>> Introduce a bpf struct ops for implementing custom OOM handling
>> policies.
>> 
>> It's possible to load one bpf_oom_ops for the system and one
>> bpf_oom_ops for every memory cgroup. In case of a memcg OOM, the
>> cgroup tree is traversed from the OOM'ing memcg up to the root and
>> corresponding BPF OOM handlers are executed until some memory is
>> freed. If no memory is freed, the kernel OOM killer is invoked.
>> 
>> The struct ops provides the bpf_handle_out_of_memory() callback,
>> which expected to return 1 if it was able to free some memory and 0
>> otherwise. If 1 is returned, the kernel also checks the bpf_memory_freed
>> field of the oom_control structure, which is expected to be set by
>> kfuncs suitable for releasing memory (which will be introduced later
>> in the patch series). If both are set, OOM is considered handled,
>> otherwise the next OOM handler in the chain is executed: e.g. BPF OOM
>> attached to the parent cgroup or the kernel OOM killer.
>
> I still find this dual reporting a bit confusing. I can see your
> intention in having a pre-defined "releasers" of the memory to trust BPF
> handlers more but they do have access to oc->bpf_memory_freed so they
> can manipulate it. Therefore an additional level of protection is rather
> weak.

No, they can't. They have only a read-only access.

> It is also not really clear to me how this works while there is OOM
> victim on the way out. (i.e. tsk_is_oom_victim() -> abort case). This
> will result in no killing therefore no bpf_memory_freed, right? Handler
> itself should consider its work done. How exactly is this handled.

It's a good question, I see your point...
Basically we want to give a handler an option to exit with "I promise,
some memory will be freed soon" without doing anything destructive.
But keeping it save at the same time.

I don't have a perfect answer out of my head, maybe some sort of a
rate-limiter/counter might work? E.g. a handler can promise this N times
before the kernel kicks in? Any ideas?

> Also is there any way to handle the oom by increasing the memcg limit?
> I do not see a callback for that.

There is no kfunc yet, but it's a good idea (which we accidentally
discussed few days ago). I'll implement it.

Thank you!
Re: [PATCH bpf-next v3 07/17] mm: introduce BPF OOM struct ops
Posted by Matt Bobrowski 6 days, 12 hours ago
On Tue, Jan 27, 2026 at 09:12:56PM +0000, Roman Gushchin wrote:
> Michal Hocko <mhocko@suse.com> writes:
> 
> > On Mon 26-01-26 18:44:10, Roman Gushchin wrote:
> >> Introduce a bpf struct ops for implementing custom OOM handling
> >> policies.
> >> 
> >> It's possible to load one bpf_oom_ops for the system and one
> >> bpf_oom_ops for every memory cgroup. In case of a memcg OOM, the
> >> cgroup tree is traversed from the OOM'ing memcg up to the root and
> >> corresponding BPF OOM handlers are executed until some memory is
> >> freed. If no memory is freed, the kernel OOM killer is invoked.
> >> 
> >> The struct ops provides the bpf_handle_out_of_memory() callback,
> >> which expected to return 1 if it was able to free some memory and 0
> >> otherwise. If 1 is returned, the kernel also checks the bpf_memory_freed
> >> field of the oom_control structure, which is expected to be set by
> >> kfuncs suitable for releasing memory (which will be introduced later
> >> in the patch series). If both are set, OOM is considered handled,
> >> otherwise the next OOM handler in the chain is executed: e.g. BPF OOM
> >> attached to the parent cgroup or the kernel OOM killer.
> >
> > I still find this dual reporting a bit confusing. I can see your
> > intention in having a pre-defined "releasers" of the memory to trust BPF
> > handlers more but they do have access to oc->bpf_memory_freed so they
> > can manipulate it. Therefore an additional level of protection is rather
> > weak.
> 
> No, they can't. They have only a read-only access.
> 
> > It is also not really clear to me how this works while there is OOM
> > victim on the way out. (i.e. tsk_is_oom_victim() -> abort case). This
> > will result in no killing therefore no bpf_memory_freed, right? Handler
> > itself should consider its work done. How exactly is this handled.
> 
> It's a good question, I see your point...
> Basically we want to give a handler an option to exit with "I promise,
> some memory will be freed soon" without doing anything destructive.
> But keeping it save at the same time.
> 
> I don't have a perfect answer out of my head, maybe some sort of a
> rate-limiter/counter might work? E.g. a handler can promise this N times
> before the kernel kicks in? Any ideas?
> 
> > Also is there any way to handle the oom by increasing the memcg limit?
> > I do not see a callback for that.
> 
> There is no kfunc yet, but it's a good idea (which we accidentally
> discussed few days ago). I'll implement it.

Yes, please, this is something that I had mentioned to you the other
day too. With this kind of BPF kfunc, we'll basically be able to
handle memcg scoped OOM events inline without necessarily being forced
to kill off anything.
Re: [PATCH bpf-next v3 07/17] mm: introduce BPF OOM struct ops
Posted by Michal Hocko 1 week, 4 days ago
On Tue 27-01-26 21:12:56, Roman Gushchin wrote:
> Michal Hocko <mhocko@suse.com> writes:
> 
> > On Mon 26-01-26 18:44:10, Roman Gushchin wrote:
> >> Introduce a bpf struct ops for implementing custom OOM handling
> >> policies.
> >> 
> >> It's possible to load one bpf_oom_ops for the system and one
> >> bpf_oom_ops for every memory cgroup. In case of a memcg OOM, the
> >> cgroup tree is traversed from the OOM'ing memcg up to the root and
> >> corresponding BPF OOM handlers are executed until some memory is
> >> freed. If no memory is freed, the kernel OOM killer is invoked.
> >> 
> >> The struct ops provides the bpf_handle_out_of_memory() callback,
> >> which expected to return 1 if it was able to free some memory and 0
> >> otherwise. If 1 is returned, the kernel also checks the bpf_memory_freed
> >> field of the oom_control structure, which is expected to be set by
> >> kfuncs suitable for releasing memory (which will be introduced later
> >> in the patch series). If both are set, OOM is considered handled,
> >> otherwise the next OOM handler in the chain is executed: e.g. BPF OOM
> >> attached to the parent cgroup or the kernel OOM killer.
> >
> > I still find this dual reporting a bit confusing. I can see your
> > intention in having a pre-defined "releasers" of the memory to trust BPF
> > handlers more but they do have access to oc->bpf_memory_freed so they
> > can manipulate it. Therefore an additional level of protection is rather
> > weak.
> 
> No, they can't. They have only a read-only access.

Could you explain this a bit more. This must be some BPF magic because
they are getting a standard pointer to oom_control.
 
> > It is also not really clear to me how this works while there is OOM
> > victim on the way out. (i.e. tsk_is_oom_victim() -> abort case). This
> > will result in no killing therefore no bpf_memory_freed, right? Handler
> > itself should consider its work done. How exactly is this handled.
> 
> It's a good question, I see your point...
> Basically we want to give a handler an option to exit with "I promise,
> some memory will be freed soon" without doing anything destructive.
> But keeping it save at the same time.

Yes, something like OOM_BACKOFF, OOM_PROCESSED, OOM_FAILED.

> I don't have a perfect answer out of my head, maybe some sort of a
> rate-limiter/counter might work? E.g. a handler can promise this N times
> before the kernel kicks in? Any ideas?

Counters usually do not work very well for async operations. In this
case there is oom_repaer and/or task exit to finish the oom operation.
The former is bound and guaranteed to make a forward progress but there
is no time frame to assume when that happens as it depends on how many
tasks might be queued (usually a single one but this is not something to
rely on because of concurrent ooms in memcgs and also multiple tasks
could be killed at the same time).

Another complication is that there are multiple levels of OOM to track
(global, NUMA, memcg) so any watchdog would have to be aware of that as
well. I am really wondering whether we really need to be so careful with
handlers. It is not like you would allow any random oom handler to be
loaded, right? Would it make sense to start without this protection and
converge to something as we see how this evolves? Maybe this will raise
the bar for oom handlers as the price for bugs is going to be really
high.

> > Also is there any way to handle the oom by increasing the memcg limit?
> > I do not see a callback for that.
> 
> There is no kfunc yet, but it's a good idea (which we accidentally
> discussed few days ago). I'll implement it.

Cool!
-- 
Michal Hocko
SUSE Labs
Re: [PATCH bpf-next v3 07/17] mm: introduce BPF OOM struct ops
Posted by Roman Gushchin 1 week, 3 days ago
Michal Hocko <mhocko@suse.com> writes:

> On Tue 27-01-26 21:12:56, Roman Gushchin wrote:
>> Michal Hocko <mhocko@suse.com> writes:
>> 
>> > On Mon 26-01-26 18:44:10, Roman Gushchin wrote:
>> >> Introduce a bpf struct ops for implementing custom OOM handling
>> >> policies.
>> >> 
>> >> It's possible to load one bpf_oom_ops for the system and one
>> >> bpf_oom_ops for every memory cgroup. In case of a memcg OOM, the
>> >> cgroup tree is traversed from the OOM'ing memcg up to the root and
>> >> corresponding BPF OOM handlers are executed until some memory is
>> >> freed. If no memory is freed, the kernel OOM killer is invoked.
>> >> 
>> >> The struct ops provides the bpf_handle_out_of_memory() callback,
>> >> which expected to return 1 if it was able to free some memory and 0
>> >> otherwise. If 1 is returned, the kernel also checks the bpf_memory_freed
>> >> field of the oom_control structure, which is expected to be set by
>> >> kfuncs suitable for releasing memory (which will be introduced later
>> >> in the patch series). If both are set, OOM is considered handled,
>> >> otherwise the next OOM handler in the chain is executed: e.g. BPF OOM
>> >> attached to the parent cgroup or the kernel OOM killer.
>> >
>> > I still find this dual reporting a bit confusing. I can see your
>> > intention in having a pre-defined "releasers" of the memory to trust BPF
>> > handlers more but they do have access to oc->bpf_memory_freed so they
>> > can manipulate it. Therefore an additional level of protection is rather
>> > weak.
>> 
>> No, they can't. They have only a read-only access.
>
> Could you explain this a bit more. This must be some BPF magic because
> they are getting a standard pointer to oom_control.

Yes, but bpf programs (unlike kernel modules) are going through the
verifier when being loaded to the kernel. The verifier ensures that
programs are safe: e.g. they can't access memory outside of safe areas,
they can't can infinite loops, dereference a NULL pointer etc.

So even it looks like a normal argument, it's read only. And the program
can't even read the memory outside of the structure itself, e.g. a
program doing something like (oc + 1)->bpf_memory_freed won't be allowed
to load.

>> > It is also not really clear to me how this works while there is OOM
>> > victim on the way out. (i.e. tsk_is_oom_victim() -> abort case). This
>> > will result in no killing therefore no bpf_memory_freed, right? Handler
>> > itself should consider its work done. How exactly is this handled.
>> 
>> It's a good question, I see your point...
>> Basically we want to give a handler an option to exit with "I promise,
>> some memory will be freed soon" without doing anything destructive.
>> But keeping it save at the same time.
>
> Yes, something like OOM_BACKOFF, OOM_PROCESSED, OOM_FAILED.
>
>> I don't have a perfect answer out of my head, maybe some sort of a
>> rate-limiter/counter might work? E.g. a handler can promise this N times
>> before the kernel kicks in? Any ideas?
>
> Counters usually do not work very well for async operations. In this
> case there is oom_repaer and/or task exit to finish the oom operation.
> The former is bound and guaranteed to make a forward progress but there
> is no time frame to assume when that happens as it depends on how many
> tasks might be queued (usually a single one but this is not something to
> rely on because of concurrent ooms in memcgs and also multiple tasks
> could be killed at the same time).
> Another complication is that there are multiple levels of OOM to track
> (global, NUMA, memcg) so any watchdog would have to be aware of that as
> well.

Yeah, it has to be an atomic counter attached to the bpf oom "instance":
a policy attached to a specific cgroup or system-wide.

> I am really wondering whether we really need to be so careful with
> handlers. It is not like you would allow any random oom handler to be
> loaded, right? Would it make sense to start without this protection and
> converge to something as we see how this evolves? Maybe this will raise
> the bar for oom handlers as the price for bugs is going to be really
> high.

Right, bpf programs require CAP_SYSADMIN to be loaded.
I still would prefer to keep it 100% safe, but the more I think about it
the more I agree with you: likely limitations of the protection mechanism will
create more issues than the value of the protection itself.

>> > Also is there any way to handle the oom by increasing the memcg limit?
>> > I do not see a callback for that.
>> 
>> There is no kfunc yet, but it's a good idea (which we accidentally
>> discussed few days ago). I'll implement it.
>
> Cool!

Thank you!