From nobody Mon Feb 9 13:01:14 2026 Received: from out-174.mta0.migadu.com (out-174.mta0.migadu.com [91.218.175.174]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3248E309F1C for ; Tue, 27 Jan 2026 02:44:54 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.174 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769481896; cv=none; b=ZO3HLlWDorqvDFSWcnWCiM/s4IBTzi5zaYh7+b804p51tiYocb59r2zVD3w447lldOCoPW93aFVmqJ8WtEyfuiXm+R78y6wd7H4cSy+T3oy299FvyetPEq7bIVlMIv5tX7TVcj2zcvGLKPHiiLKZJ8Wo8KZ24RAXWQXN6MJsEpw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1769481896; c=relaxed/simple; bh=L5IiCCtwg4ScLDKE1SCRsaxEuBG7vq5bXi8c3d9CqvQ=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=lV2C26hr5NqTOe6C/lTY7mZwiycXd/kC+OOgzmdbRA3cA4P7xpB2WQQ/8NVIvcnBwnqKzTwHTSENogliq/bDSDAVgC6/cZtnI1fpbBIBZdr+m6gggB9zj+ozDFMsmAJYNmY96VIFW1Ui/afw/cz/lo7G/fGSzP73sfFcd+ltNMs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=r2nK62HA; arc=none smtp.client-ip=91.218.175.174 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="r2nK62HA" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1769481892; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=/PHaA2+5weS8eMVRt0w0C41HRb3xuqgYUHYcE0aRYLk=; b=r2nK62HAUpJgtaVgR9wcxmPF5WBN7yrbWIiPNAeIkNI/RPqTsKGe6ktWP9lW5F867uVOe+ bXuqAwfECgywick13UfqRJNlK0fAjpTe4vLxsSt8eBnxKA5oxNo0Z7qcYBm9b3MEYh48g7 A3utZOabrRMvrYAesNh4JONgVmbMKCc= From: Roman Gushchin To: bpf@vger.kernel.org Cc: Michal Hocko , Alexei Starovoitov , Matt Bobrowski , Shakeel Butt , JP Kobryn , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Suren Baghdasaryan , Johannes Weiner , Andrew Morton , Roman Gushchin Subject: [PATCH bpf-next v3 07/17] mm: introduce BPF OOM struct ops Date: Mon, 26 Jan 2026 18:44:10 -0800 Message-ID: <20260127024421.494929-8-roman.gushchin@linux.dev> In-Reply-To: <20260127024421.494929-1-roman.gushchin@linux.dev> References: <20260127024421.494929-1-roman.gushchin@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" Introduce a bpf struct ops for implementing custom OOM handling policies. It's possible to load one bpf_oom_ops for the system and one bpf_oom_ops for every memory cgroup. In case of a memcg OOM, the cgroup tree is traversed from the OOM'ing memcg up to the root and corresponding BPF OOM handlers are executed until some memory is freed. If no memory is freed, the kernel OOM killer is invoked. The struct ops provides the bpf_handle_out_of_memory() callback, which expected to return 1 if it was able to free some memory and 0 otherwise. If 1 is returned, the kernel also checks the bpf_memory_freed field of the oom_control structure, which is expected to be set by kfuncs suitable for releasing memory (which will be introduced later in the patch series). If both are set, OOM is considered handled, otherwise the next OOM handler in the chain is executed: e.g. BPF OOM attached to the parent cgroup or the kernel OOM killer. The bpf_handle_out_of_memory() callback program is sleepable to allow using iterators, e.g. cgroup iterators. The callback receives struct oom_control as an argument, so it can determine the scope of the OOM event: if this is a memcg-wide or system-wide OOM. It also receives bpf_struct_ops_link as the second argument, so it can detect the cgroup level at which this specific instance is attached. The bpf_handle_out_of_memory() callback is executed just before the kernel victim task selection algorithm, so all heuristics and sysctls like panic on oom, sysctl_oom_kill_allocating_task and sysctl_oom_kill_allocating_task are respected. The struct ops has the name field, which allows to define a custom name for the implemented policy. It's printed in the OOM report in the oom_handler=3D format only if a bpf handler is invoked. Signed-off-by: Roman Gushchin --- MAINTAINERS | 2 + include/linux/bpf-cgroup-defs.h | 3 + include/linux/bpf.h | 1 + include/linux/bpf_oom.h | 46 ++++++++ include/linux/oom.h | 8 ++ kernel/bpf/bpf_struct_ops.c | 12 +- mm/Makefile | 2 +- mm/bpf_oom.c | 192 ++++++++++++++++++++++++++++++++ mm/oom_kill.c | 19 ++++ 9 files changed, 282 insertions(+), 3 deletions(-) create mode 100644 include/linux/bpf_oom.h create mode 100644 mm/bpf_oom.c diff --git a/MAINTAINERS b/MAINTAINERS index 491d567f7dc8..53465570c1e5 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -4807,7 +4807,9 @@ M: Shakeel Butt L: bpf@vger.kernel.org L: linux-mm@kvack.org S: Maintained +F: include/linux/bpf_oom.h F: mm/bpf_memcontrol.c +F: mm/bpf_oom.c =20 BPF [MISC] L: bpf@vger.kernel.org diff --git a/include/linux/bpf-cgroup-defs.h b/include/linux/bpf-cgroup-def= s.h index 6c5e37190dad..52395834ce13 100644 --- a/include/linux/bpf-cgroup-defs.h +++ b/include/linux/bpf-cgroup-defs.h @@ -74,6 +74,9 @@ struct cgroup_bpf { /* list of bpf struct ops links */ struct list_head struct_ops_links; =20 + /* BPF OOM struct ops link */ + struct bpf_struct_ops_link __rcu *bpf_oom_link; + /* reference counter used to detach bpf programs after cgroup removal */ struct percpu_ref refcnt; =20 diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 391888eb257c..a5cee5a657b0 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -3944,6 +3944,7 @@ static inline bool bpf_is_subprog(const struct bpf_pr= og *prog) int bpf_prog_get_file_line(struct bpf_prog *prog, unsigned long ip, const = char **filep, const char **linep, int *nump); struct bpf_prog *bpf_prog_find_from_stack(void); +void *bpf_struct_ops_data(struct bpf_map *map); =20 int bpf_insn_array_init(struct bpf_map *map, const struct bpf_prog *prog); int bpf_insn_array_ready(struct bpf_map *map); diff --git a/include/linux/bpf_oom.h b/include/linux/bpf_oom.h new file mode 100644 index 000000000000..c81133145c50 --- /dev/null +++ b/include/linux/bpf_oom.h @@ -0,0 +1,46 @@ +/* SPDX-License-Identifier: GPL-2.0+ */ + +#ifndef __BPF_OOM_H +#define __BPF_OOM_H + +struct oom_control; + +#define BPF_OOM_NAME_MAX_LEN 64 + +struct bpf_oom_ops { + /** + * @handle_out_of_memory: Out of memory bpf handler, called before + * the in-kernel OOM killer. + * @oc: OOM control structure + * @st_link: struct ops link + * + * Should return 1 if some memory was freed up, otherwise + * the in-kernel OOM killer is invoked. + */ + int (*handle_out_of_memory)(struct oom_control *oc, + struct bpf_struct_ops_link *st_link); + + /** + * @name: BPF OOM policy name + */ + char name[BPF_OOM_NAME_MAX_LEN]; +}; + +#ifdef CONFIG_BPF_SYSCALL +/** + * @bpf_handle_oom: handle out of memory condition using bpf + * @oc: OOM control structure + * + * Returns true if some memory was freed. + */ +bool bpf_handle_oom(struct oom_control *oc); + +#else /* CONFIG_BPF_SYSCALL */ +static inline bool bpf_handle_oom(struct oom_control *oc) +{ + return false; +} + +#endif /* CONFIG_BPF_SYSCALL */ + +#endif /* __BPF_OOM_H */ diff --git a/include/linux/oom.h b/include/linux/oom.h index 7b02bc1d0a7e..c2dce336bcb4 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -51,6 +51,14 @@ struct oom_control { =20 /* Used to print the constraint info. */ enum oom_constraint constraint; + +#ifdef CONFIG_BPF_SYSCALL + /* Used by the bpf oom implementation to mark the forward progress */ + bool bpf_memory_freed; + + /* Handler name */ + const char *bpf_handler_name; +#endif }; =20 extern struct mutex oom_lock; diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c index 2e361e22cfa0..6285a6d56b98 100644 --- a/kernel/bpf/bpf_struct_ops.c +++ b/kernel/bpf/bpf_struct_ops.c @@ -1009,7 +1009,7 @@ static void bpf_struct_ops_map_free(struct bpf_map *m= ap) * in the tramopline image to finish before releasing * the trampoline image. */ - synchronize_rcu_mult(call_rcu, call_rcu_tasks); + synchronize_rcu_mult(call_rcu, call_rcu_tasks, call_rcu_tasks_trace); =20 __bpf_struct_ops_map_free(map); } @@ -1226,7 +1226,8 @@ static void bpf_struct_ops_map_link_dealloc(struct bp= f_link *link) if (st_link->cgroup) cgroup_bpf_detach_struct_ops(st_link->cgroup, st_link); =20 - kfree(st_link); + synchronize_rcu_tasks_trace(); + kfree_rcu(st_link, link.rcu); } =20 static void bpf_struct_ops_map_link_show_fdinfo(const struct bpf_link *lin= k, @@ -1535,3 +1536,10 @@ void bpf_map_struct_ops_info_fill(struct bpf_map_inf= o *info, struct bpf_map *map =20 info->btf_vmlinux_id =3D btf_obj_id(st_map->btf); } + +void *bpf_struct_ops_data(struct bpf_map *map) +{ + struct bpf_struct_ops_map *st_map =3D (struct bpf_struct_ops_map *)map; + + return &st_map->kvalue.data; +} diff --git a/mm/Makefile b/mm/Makefile index bf46fe31dc14..e939525ba01b 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -107,7 +107,7 @@ ifdef CONFIG_SWAP obj-$(CONFIG_MEMCG) +=3D swap_cgroup.o endif ifdef CONFIG_BPF_SYSCALL -obj-$(CONFIG_MEMCG) +=3D bpf_memcontrol.o +obj-$(CONFIG_MEMCG) +=3D bpf_memcontrol.o bpf_oom.o endif obj-$(CONFIG_CGROUP_HUGETLB) +=3D hugetlb_cgroup.o obj-$(CONFIG_GUP_TEST) +=3D gup_test.o diff --git a/mm/bpf_oom.c b/mm/bpf_oom.c new file mode 100644 index 000000000000..ea70be6e2c26 --- /dev/null +++ b/mm/bpf_oom.c @@ -0,0 +1,192 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * BPF-driven OOM killer customization + * + * Author: Roman Gushchin + */ + +#include +#include +#include +#include +#include +#include +#include + +static int bpf_ops_handle_oom(struct bpf_oom_ops *bpf_oom_ops, + struct bpf_struct_ops_link *st_link, + struct oom_control *oc) +{ + int ret; + + oc->bpf_handler_name =3D &bpf_oom_ops->name[0]; + oc->bpf_memory_freed =3D false; + pagefault_disable(); + ret =3D bpf_oom_ops->handle_out_of_memory(oc, st_link); + pagefault_enable(); + oc->bpf_handler_name =3D NULL; + + return ret; +} + +bool bpf_handle_oom(struct oom_control *oc) +{ + struct bpf_struct_ops_link *st_link; + struct bpf_oom_ops *bpf_oom_ops; + struct mem_cgroup *memcg; + struct bpf_map *map; + int ret =3D 0; + + /* + * System-wide OOMs are handled by the struct ops attached + * to the root memory cgroup + */ + memcg =3D oc->memcg ? oc->memcg : root_mem_cgroup; + + rcu_read_lock_trace(); + + /* Find the nearest bpf_oom_ops traversing the cgroup tree upwards */ + for (; memcg; memcg =3D parent_mem_cgroup(memcg)) { + st_link =3D rcu_dereference_check(memcg->css.cgroup->bpf.bpf_oom_link, + rcu_read_lock_trace_held()); + if (!st_link) + continue; + + map =3D rcu_dereference_check((st_link->map), + rcu_read_lock_trace_held()); + if (!map) + continue; + + /* Call BPF OOM handler */ + bpf_oom_ops =3D bpf_struct_ops_data(map); + ret =3D bpf_ops_handle_oom(bpf_oom_ops, st_link, oc); + if (ret && oc->bpf_memory_freed) + break; + ret =3D 0; + } + + rcu_read_unlock_trace(); + + return ret && oc->bpf_memory_freed; +} + +static int __handle_out_of_memory(struct oom_control *oc, + struct bpf_struct_ops_link *st_link) +{ + return 0; +} + +static struct bpf_oom_ops __bpf_oom_ops =3D { + .handle_out_of_memory =3D __handle_out_of_memory, +}; + +static const struct bpf_func_proto * +bpf_oom_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) +{ + return tracing_prog_func_proto(func_id, prog); +} + +static bool bpf_oom_ops_is_valid_access(int off, int size, + enum bpf_access_type type, + const struct bpf_prog *prog, + struct bpf_insn_access_aux *info) +{ + return bpf_tracing_btf_ctx_access(off, size, type, prog, info); +} + +static const struct bpf_verifier_ops bpf_oom_verifier_ops =3D { + .get_func_proto =3D bpf_oom_func_proto, + .is_valid_access =3D bpf_oom_ops_is_valid_access, +}; + +static int bpf_oom_ops_reg(void *kdata, struct bpf_link *link) +{ + struct bpf_struct_ops_link *st_link =3D (struct bpf_struct_ops_link *)lin= k; + struct cgroup *cgrp; + + /* The link is not yet fully initialized, but cgroup should be set */ + if (!link) + return -EOPNOTSUPP; + + cgrp =3D st_link->cgroup; + if (!cgrp) + return -EINVAL; + + if (cmpxchg(&cgrp->bpf.bpf_oom_link, NULL, st_link)) + return -EEXIST; + + return 0; +} + +static void bpf_oom_ops_unreg(void *kdata, struct bpf_link *link) +{ + struct bpf_struct_ops_link *st_link =3D (struct bpf_struct_ops_link *)lin= k; + struct cgroup *cgrp; + + if (!link) + return; + + cgrp =3D st_link->cgroup; + if (!cgrp) + return; + + WARN_ON(cmpxchg(&cgrp->bpf.bpf_oom_link, st_link, NULL) !=3D st_link); +} + +static int bpf_oom_ops_check_member(const struct btf_type *t, + const struct btf_member *member, + const struct bpf_prog *prog) +{ + u32 moff =3D __btf_member_bit_offset(t, member) / 8; + + switch (moff) { + case offsetof(struct bpf_oom_ops, handle_out_of_memory): + if (!prog) + return -EINVAL; + break; + } + + return 0; +} + +static int bpf_oom_ops_init_member(const struct btf_type *t, + const struct btf_member *member, + void *kdata, const void *udata) +{ + const struct bpf_oom_ops *uops =3D udata; + struct bpf_oom_ops *ops =3D kdata; + u32 moff =3D __btf_member_bit_offset(t, member) / 8; + + switch (moff) { + case offsetof(struct bpf_oom_ops, name): + if (uops->name[0]) + strscpy_pad(ops->name, uops->name, sizeof(ops->name)); + else + strscpy_pad(ops->name, "bpf_defined_policy"); + return 1; + } + return 0; +} + +static int bpf_oom_ops_init(struct btf *btf) +{ + return 0; +} + +static struct bpf_struct_ops bpf_oom_bpf_ops =3D { + .verifier_ops =3D &bpf_oom_verifier_ops, + .reg =3D bpf_oom_ops_reg, + .unreg =3D bpf_oom_ops_unreg, + .check_member =3D bpf_oom_ops_check_member, + .init_member =3D bpf_oom_ops_init_member, + .init =3D bpf_oom_ops_init, + .name =3D "bpf_oom_ops", + .owner =3D THIS_MODULE, + .cfi_stubs =3D &__bpf_oom_ops +}; + +static int __init bpf_oom_struct_ops_init(void) +{ + return register_bpf_struct_ops(&bpf_oom_bpf_ops, bpf_oom_ops); +} +late_initcall(bpf_oom_struct_ops_init); diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 5eb11fbba704..44bbcf033804 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -45,6 +45,7 @@ #include #include #include +#include =20 #include #include "internal.h" @@ -246,6 +247,15 @@ static const char * const oom_constraint_text[] =3D { [CONSTRAINT_MEMCG] =3D "CONSTRAINT_MEMCG", }; =20 +static const char *oom_handler_name(struct oom_control *oc) +{ +#ifdef CONFIG_BPF_SYSCALL + if (oc->bpf_handler_name) + return oc->bpf_handler_name; +#endif + return NULL; +} + /* * Determine the type of allocation constraint. */ @@ -461,6 +471,8 @@ static void dump_header(struct oom_control *oc) pr_warn("%s invoked oom-killer: gfp_mask=3D%#x(%pGg), order=3D%d, oom_sco= re_adj=3D%hd\n", current->comm, oc->gfp_mask, &oc->gfp_mask, oc->order, current->signal->oom_score_adj); + if (oom_handler_name(oc)) + pr_warn("oom bpf handler: %s\n", oom_handler_name(oc)); if (!IS_ENABLED(CONFIG_COMPACTION) && oc->order) pr_warn("COMPACTION is disabled!!!\n"); =20 @@ -1168,6 +1180,13 @@ bool out_of_memory(struct oom_control *oc) return true; } =20 + /* + * Let bpf handle the OOM first. If it was able to free up some memory, + * bail out. Otherwise fall back to the kernel OOM killer. + */ + if (bpf_handle_oom(oc)) + return true; + select_bad_process(oc); /* Found nothing?!?! */ if (!oc->chosen) { --=20 2.52.0