From nobody Sat Oct 4 09:40:55 2025 Received: from out-173.mta1.migadu.com (out-173.mta1.migadu.com [95.215.58.173]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1AB7F343D95 for ; Mon, 18 Aug 2025 17:01:52 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.173 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755536515; cv=none; b=AGvxxcTRrdegb7nC53s5vysFgwGPYJ83RFj+6aNF066AFVqnYgwVSHbebhVq932LvmQvZDsPIRcp7Zeoshzetz6Bd9PeM+dm3gEZYzGwRhwFah9xyOyyGZ7CVVaFTycAShCBWxncA1vzKfYPn9KLL1Vl+0/mFKYYjCydCEErGbE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755536515; c=relaxed/simple; bh=nk4cg1zfeP+aFL6Ac5QA/xUBA8OJihACsC6uBn47WzM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=RGPDcXxrUwctdpMpUz6dWi97kPa4dCDiGcjFd7pnlMouIaxUmkHHqMiRFVS2zK6v9CiunY1A6Ik14AoNwo15Rz/ghLE5kFo5mM00udmT4CTti6GbRDcH3BVtCXaYuoiy3w50TouQfyxMQCvPbG3z2u4xQB/Yq1Jpfg4AcFpfztI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=YJQDKt0c; arc=none smtp.client-ip=95.215.58.173 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="YJQDKt0c" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1755536511; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=JNLqPQvVnvWTwD+rLbtBhy1f6P7SG3IVoIVxszQU1Bw=; b=YJQDKt0cJsjXZZ9sUygYWpuDbmyDiOjE4s2t6gSApi0XBw6W8MPhjjKlpy6QxHC2bfXram P7hGTgH0zN0T46at7gjGacKdiLd/ADpgIW8zq4OXHo3ZCbo38vL8c3IYpqR1PyM/sC/YAw fX92oeD5jUIm1LhZ09Tkckli3JpasOg= From: Roman Gushchin To: linux-mm@kvack.org, bpf@vger.kernel.org Cc: Suren Baghdasaryan , Johannes Weiner , Michal Hocko , David Rientjes , Matt Bobrowski , Song Liu , Kumar Kartikeya Dwivedi , Alexei Starovoitov , Andrew Morton , linux-kernel@vger.kernel.org, Roman Gushchin Subject: [PATCH v1 01/14] mm: introduce bpf struct ops for OOM handling Date: Mon, 18 Aug 2025 10:01:23 -0700 Message-ID: <20250818170136.209169-2-roman.gushchin@linux.dev> In-Reply-To: <20250818170136.209169-1-roman.gushchin@linux.dev> References: <20250818170136.209169-1-roman.gushchin@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" Introduce a bpf struct ops for implementing custom OOM handling policies. The struct ops provides the bpf_handle_out_of_memory() callback, which expected to return 1 if it was able to free some memory and 0 otherwise. In the latter case it's guaranteed that the in-kernel OOM killer will be invoked. Otherwise the kernel also checks the bpf_memory_freed field of the oom_control structure, which is expected to be set by kfuncs suitable for releasing memory. It's a safety mechanism which prevents a bpf program to claim forward progress without actually releasing memory. The callback program is sleepable to enable using iterators, e.g. cgroup iterators. The callback receives struct oom_control as an argument, so it can easily filter out OOM's it doesn't want to handle, e.g. global vs memcg OOM's. The callback is executed just before the kernel victim task selection algorithm, so all heuristics and sysctls like panic on oom, sysctl_oom_kill_allocating_task and sysctl_oom_kill_allocating_task are respected. The struct ops also has the name field, which allows to define a custom name for the implemented policy. It's printed in the OOM report in the oom_policy=3D format. "default" is printed if bpf is not used or policy name is not specified. [ 112.696676] test_progs invoked oom-killer: gfp_mask=3D0xcc0(GFP_KERNEL),= order=3D0, oom_score_adj=3D0 oom_policy=3Dbpf_test_policy [ 112.698160] CPU: 1 UID: 0 PID: 660 Comm: test_progs Not tainted 6.16.0-0= 0015-gf09eb0d6badc #102 PREEMPT(full) [ 112.698165] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS = 1.17.0-5.fc42 04/01/2014 [ 112.698167] Call Trace: [ 112.698177] [ 112.698182] dump_stack_lvl+0x4d/0x70 [ 112.698192] dump_header+0x59/0x1c6 [ 112.698199] oom_kill_process.cold+0x8/0xef [ 112.698206] bpf_oom_kill_process+0x59/0xb0 [ 112.698216] bpf_prog_7ecad0f36a167fd7_test_out_of_memory+0x2be/0x313 [ 112.698229] bpf__bpf_oom_ops_handle_out_of_memory+0x47/0xaf [ 112.698236] ? srso_alias_return_thunk+0x5/0xfbef5 [ 112.698240] bpf_handle_oom+0x11a/0x1e0 [ 112.698250] out_of_memory+0xab/0x5c0 [ 112.698258] mem_cgroup_out_of_memory+0xbc/0x110 [ 112.698274] try_charge_memcg+0x4b5/0x7e0 [ 112.698288] charge_memcg+0x2f/0xc0 [ 112.698293] __mem_cgroup_charge+0x30/0xc0 [ 112.698299] do_anonymous_page+0x40f/0xa50 [ 112.698311] __handle_mm_fault+0xbba/0x1140 [ 112.698317] ? srso_alias_return_thunk+0x5/0xfbef5 [ 112.698335] handle_mm_fault+0xe6/0x370 [ 112.698343] do_user_addr_fault+0x211/0x6a0 [ 112.698354] exc_page_fault+0x75/0x1d0 [ 112.698363] asm_exc_page_fault+0x26/0x30 [ 112.698366] RIP: 0033:0x7fa97236db00 It's possible to load multiple bpf struct programs. In the case of oom, they will be executed one by one in the same order they been loaded until one of them returns 1 and bpf_memory_freed is set to 1 - an indication that the memory was freed. This allows to have multiple bpf programs to focus on different types of OOM's - e.g. one program can only handle memcg OOM's in one memory cgroup. But the filtering is done in bpf - so it's fully flexible. Signed-off-by: Roman Gushchin --- include/linux/bpf_oom.h | 49 +++++++++++++ include/linux/oom.h | 8 ++ mm/Makefile | 3 + mm/bpf_oom.c | 157 ++++++++++++++++++++++++++++++++++++++++ mm/oom_kill.c | 22 +++++- 5 files changed, 237 insertions(+), 2 deletions(-) create mode 100644 include/linux/bpf_oom.h create mode 100644 mm/bpf_oom.c diff --git a/include/linux/bpf_oom.h b/include/linux/bpf_oom.h new file mode 100644 index 000000000000..29cb5ea41d97 --- /dev/null +++ b/include/linux/bpf_oom.h @@ -0,0 +1,49 @@ +/* SPDX-License-Identifier: GPL-2.0+ */ + +#ifndef __BPF_OOM_H +#define __BPF_OOM_H + +struct bpf_oom; +struct oom_control; + +#define BPF_OOM_NAME_MAX_LEN 64 + +struct bpf_oom_ops { + /** + * @handle_out_of_memory: Out of memory bpf handler, called before + * the in-kernel OOM killer. + * @oc: OOM control structure + * + * Should return 1 if some memory was freed up, otherwise + * the in-kernel OOM killer is invoked. + */ + int (*handle_out_of_memory)(struct oom_control *oc); + + /** + * @name: BPF OOM policy name + */ + char name[BPF_OOM_NAME_MAX_LEN]; + + /* Private */ + struct bpf_oom *bpf_oom; +}; + +#ifdef CONFIG_BPF_SYSCALL +/** + * @bpf_handle_oom: handle out of memory using bpf programs + * @oc: OOM control structure + * + * Returns true if a bpf oom program was executed, returned 1 + * and some memory was actually freed. + */ +bool bpf_handle_oom(struct oom_control *oc); + +#else /* CONFIG_BPF_SYSCALL */ +static inline bool bpf_handle_oom(struct oom_control *oc) +{ + return false; +} + +#endif /* CONFIG_BPF_SYSCALL */ + +#endif /* __BPF_OOM_H */ diff --git a/include/linux/oom.h b/include/linux/oom.h index 1e0fc6931ce9..ef453309b7ea 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -51,6 +51,14 @@ struct oom_control { =20 /* Used to print the constraint info. */ enum oom_constraint constraint; + +#ifdef CONFIG_BPF_SYSCALL + /* Used by the bpf oom implementation to mark the forward progress */ + bool bpf_memory_freed; + + /* Policy name */ + const char *bpf_policy_name; +#endif }; =20 extern struct mutex oom_lock; diff --git a/mm/Makefile b/mm/Makefile index 1a7a11d4933d..a714aba03759 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -105,6 +105,9 @@ obj-$(CONFIG_MEMCG) +=3D memcontrol.o vmpressure.o ifdef CONFIG_SWAP obj-$(CONFIG_MEMCG) +=3D swap_cgroup.o endif +ifdef CONFIG_BPF_SYSCALL +obj-y +=3D bpf_oom.o +endif obj-$(CONFIG_CGROUP_HUGETLB) +=3D hugetlb_cgroup.o obj-$(CONFIG_GUP_TEST) +=3D gup_test.o obj-$(CONFIG_DMAPOOL_TEST) +=3D dmapool_test.o diff --git a/mm/bpf_oom.c b/mm/bpf_oom.c new file mode 100644 index 000000000000..47633046819c --- /dev/null +++ b/mm/bpf_oom.c @@ -0,0 +1,157 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * BPF-driven OOM killer customization + * + * Author: Roman Gushchin + */ + +#include +#include +#include +#include + +DEFINE_STATIC_SRCU(bpf_oom_srcu); +static DEFINE_SPINLOCK(bpf_oom_lock); +static LIST_HEAD(bpf_oom_handlers); + +struct bpf_oom { + struct bpf_oom_ops *ops; + struct list_head node; + struct srcu_struct srcu; +}; + +bool bpf_handle_oom(struct oom_control *oc) +{ + struct bpf_oom_ops *ops; + struct bpf_oom *bpf_oom; + int list_idx, idx, ret =3D 0; + + oc->bpf_memory_freed =3D false; + + list_idx =3D srcu_read_lock(&bpf_oom_srcu); + list_for_each_entry_srcu(bpf_oom, &bpf_oom_handlers, node, false) { + ops =3D READ_ONCE(bpf_oom->ops); + if (!ops || !ops->handle_out_of_memory) + continue; + idx =3D srcu_read_lock(&bpf_oom->srcu); + oc->bpf_policy_name =3D ops->name[0] ? &ops->name[0] : + "bpf_defined_policy"; + ret =3D ops->handle_out_of_memory(oc); + oc->bpf_policy_name =3D NULL; + srcu_read_unlock(&bpf_oom->srcu, idx); + + if (ret && oc->bpf_memory_freed) + break; + } + srcu_read_unlock(&bpf_oom_srcu, list_idx); + + return ret && oc->bpf_memory_freed; +} + +static int __handle_out_of_memory(struct oom_control *oc) +{ + return 0; +} + +static struct bpf_oom_ops __bpf_oom_ops =3D { + .handle_out_of_memory =3D __handle_out_of_memory, +}; + +static const struct bpf_func_proto * +bpf_oom_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) +{ + return tracing_prog_func_proto(func_id, prog); +} + +static bool bpf_oom_ops_is_valid_access(int off, int size, + enum bpf_access_type type, + const struct bpf_prog *prog, + struct bpf_insn_access_aux *info) +{ + return bpf_tracing_btf_ctx_access(off, size, type, prog, info); +} + +static const struct bpf_verifier_ops bpf_oom_verifier_ops =3D { + .get_func_proto =3D bpf_oom_func_proto, + .is_valid_access =3D bpf_oom_ops_is_valid_access, +}; + +static int bpf_oom_ops_reg(void *kdata, struct bpf_link *link) +{ + struct bpf_oom_ops *ops =3D kdata; + struct bpf_oom *bpf_oom; + int ret; + + bpf_oom =3D kmalloc(sizeof(*bpf_oom), GFP_KERNEL_ACCOUNT); + if (!bpf_oom) + return -ENOMEM; + + ret =3D init_srcu_struct(&bpf_oom->srcu); + if (ret) { + kfree(bpf_oom); + return ret; + } + + WRITE_ONCE(bpf_oom->ops, ops); + ops->bpf_oom =3D bpf_oom; + + spin_lock(&bpf_oom_lock); + list_add_rcu(&bpf_oom->node, &bpf_oom_handlers); + spin_unlock(&bpf_oom_lock); + + return 0; +} + +static void bpf_oom_ops_unreg(void *kdata, struct bpf_link *link) +{ + struct bpf_oom_ops *ops =3D kdata; + struct bpf_oom *bpf_oom =3D ops->bpf_oom; + + WRITE_ONCE(bpf_oom->ops, NULL); + + spin_lock(&bpf_oom_lock); + list_del_rcu(&bpf_oom->node); + spin_unlock(&bpf_oom_lock); + + synchronize_srcu(&bpf_oom->srcu); + + kfree(bpf_oom); +} + +static int bpf_oom_ops_init_member(const struct btf_type *t, + const struct btf_member *member, + void *kdata, const void *udata) +{ + const struct bpf_oom_ops *uops =3D (const struct bpf_oom_ops *)udata; + struct bpf_oom_ops *ops =3D (struct bpf_oom_ops *)kdata; + u32 moff =3D __btf_member_bit_offset(t, member) / 8; + + switch (moff) { + case offsetof(struct bpf_oom_ops, name): + strscpy_pad(ops->name, uops->name, sizeof(ops->name)); + return 1; + } + return 0; +} + +static int bpf_oom_ops_init(struct btf *btf) +{ + return 0; +} + +static struct bpf_struct_ops bpf_oom_bpf_ops =3D { + .verifier_ops =3D &bpf_oom_verifier_ops, + .reg =3D bpf_oom_ops_reg, + .unreg =3D bpf_oom_ops_unreg, + .init_member =3D bpf_oom_ops_init_member, + .init =3D bpf_oom_ops_init, + .name =3D "bpf_oom_ops", + .owner =3D THIS_MODULE, + .cfi_stubs =3D &__bpf_oom_ops +}; + +static int __init bpf_oom_struct_ops_init(void) +{ + return register_bpf_struct_ops(&bpf_oom_bpf_ops, bpf_oom_ops); +} +late_initcall(bpf_oom_struct_ops_init); diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 25923cfec9c6..ad7bd65061d6 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -45,6 +45,7 @@ #include #include #include +#include =20 #include #include "internal.h" @@ -246,6 +247,15 @@ static const char * const oom_constraint_text[] =3D { [CONSTRAINT_MEMCG] =3D "CONSTRAINT_MEMCG", }; =20 +static const char *oom_policy_name(struct oom_control *oc) +{ +#ifdef CONFIG_BPF_SYSCALL + if (oc->bpf_policy_name) + return oc->bpf_policy_name; +#endif + return "default"; +} + /* * Determine the type of allocation constraint. */ @@ -458,9 +468,10 @@ static void dump_oom_victim(struct oom_control *oc, st= ruct task_struct *victim) =20 static void dump_header(struct oom_control *oc) { - pr_warn("%s invoked oom-killer: gfp_mask=3D%#x(%pGg), order=3D%d, oom_sco= re_adj=3D%hd\n", + pr_warn("%s invoked oom-killer: gfp_mask=3D%#x(%pGg), order=3D%d, oom_sco= re_adj=3D%hd\noom_policy=3D%s\n", current->comm, oc->gfp_mask, &oc->gfp_mask, oc->order, - current->signal->oom_score_adj); + current->signal->oom_score_adj, + oom_policy_name(oc)); if (!IS_ENABLED(CONFIG_COMPACTION) && oc->order) pr_warn("COMPACTION is disabled!!!\n"); =20 @@ -1161,6 +1172,13 @@ bool out_of_memory(struct oom_control *oc) return true; } =20 + /* + * Let bpf handle the OOM first. If it was able to free up some memory, + * bail out. Otherwise fall back to the kernel OOM killer. + */ + if (bpf_handle_oom(oc)) + return true; + select_bad_process(oc); /* Found nothing?!?! */ if (!oc->chosen) { --=20 2.50.1 From nobody Sat Oct 4 09:40:55 2025 Received: from out-176.mta1.migadu.com (out-176.mta1.migadu.com [95.215.58.176]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 58152345743 for ; Mon, 18 Aug 2025 17:01:56 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.176 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755536518; cv=none; b=qpW77ByTcyjxS1DsU52/Ur4sotl0iuTeDuci6YSiqWu9hLSbhisHlorZH0BA3kAXSzk9CGcBTbcUAKCBqdLL3DUUgCfiFA9l5xBXdFQ9aO36O6YSluNSWvmdRnDaYoSoJRcKPUDVY+k0r2LDBKmDXWnBVewhqac79X59ibSxHpo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755536518; c=relaxed/simple; bh=/dEZnLiGaz7HAgtMNSqJ6aIkswSJwGKaAbEHmWJ87j4=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=EJxuAA47I4wRpyj+h5ABBvsjNTeOR/6QxFHGbWPKjFMxYnx9MM/Dld2ZQUfspOS5/RpeYbPhO5faLgjv93KKDrbowgWOdmVPpYcN21k5m5FfYXcznq9kbZKK5dBbEQbSdEMfEPHyTbNIr/p0v2S7dbQojfBOSnFEkXxeVwwb01s= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=stwTv26k; arc=none smtp.client-ip=95.215.58.176 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="stwTv26k" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1755536514; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=spKMQn7nIVpT0/mjP9As0UlCNDUJOgYkELPXiq5D32U=; b=stwTv26ksBnQzC/fOPKoIoZj8nFwNHuCsH2a3/p+NiLavlZyC/iWksVguM9b1cmfmHwuht 4qxDw5kejSx5NB86vwRw7QmLxeji0zoV/T7OPU/5t5WVbx/yXBGUQMmmory7rKcWcW7Hah GM6PJtse+KnlXgK7IcEtdM0FKn5dcm8= From: Roman Gushchin To: linux-mm@kvack.org, bpf@vger.kernel.org Cc: Suren Baghdasaryan , Johannes Weiner , Michal Hocko , David Rientjes , Matt Bobrowski , Song Liu , Kumar Kartikeya Dwivedi , Alexei Starovoitov , Andrew Morton , linux-kernel@vger.kernel.org, Roman Gushchin Subject: [PATCH v1 02/14] bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL Date: Mon, 18 Aug 2025 10:01:24 -0700 Message-ID: <20250818170136.209169-3-roman.gushchin@linux.dev> In-Reply-To: <20250818170136.209169-1-roman.gushchin@linux.dev> References: <20250818170136.209169-1-roman.gushchin@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" Struct oom_control is used to describe the OOM context. It's memcg field defines the scope of OOM: it's NULL for global OOMs and a valid memcg pointer for memcg-scoped OOMs. Teach bpf verifier to recognize it as trusted or NULL pointer. It will provide the bpf OOM handler a trusted memcg pointer, which for example is required for iterating the memcg's subtree. Signed-off-by: Roman Gushchin Acked-by: Kumar Kartikeya Dwivedi --- kernel/bpf/verifier.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 169845710c7e..b5153c843028 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -7035,6 +7035,10 @@ BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct socket) { struct sock *sk; }; =20 +BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct oom_control) { + struct mem_cgroup *memcg; +}; + static bool type_is_rcu(struct bpf_verifier_env *env, struct bpf_reg_state *reg, const char *field_name, u32 btf_id) @@ -7075,6 +7079,7 @@ static bool type_is_trusted_or_null(struct bpf_verifi= er_env *env, { BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct socket)); BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct dentry)); + BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct oom_control)); =20 return btf_nested_type_is_trusted(&env->log, reg, field_name, btf_id, "__safe_trusted_or_null"); --=20 2.50.1 From nobody Sat Oct 4 09:40:55 2025 Received: from out-183.mta1.migadu.com (out-183.mta1.migadu.com [95.215.58.183]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 171BC3469F7 for ; Mon, 18 Aug 2025 17:01:59 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.183 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755536522; cv=none; b=JFLeoaGJpq+lRWF9lYPa7sTxKh2vKpuiKPGpAJPm46vbA15shrCnvlUaD+4Fq10+pr7bW2nHunJLM95vMTxCb/umsDhEil+qncuF81B5s7aT3raSD6nKeAiLeh08phuUN5oUY7rIKbHqbD9e2Tul8xVWLt4GE0CDZKLeXT/EUmk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755536522; c=relaxed/simple; bh=VmVEbFLEnyWhw3pWoNq/VuyY2/MD0Sg3ccZdi1IaiWs=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=jsngr0e+oNT5ZJ7IhxXem8/KCde8kchhQp7pSnuwV3coVFZauyF9roQYv4v2rI0jWTUYyzY39F1wt0xE70Ok0TpfdJ6MVw48OLTz6ghD0FGbmsmaJ2pNglQ0kMxaUw/+QchozvBcoBPKuzB5c8vrdEcZZons/21UKp+jBBYxQzU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=Z2Iw0gpu; arc=none smtp.client-ip=95.215.58.183 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="Z2Iw0gpu" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1755536518; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=L4x/oF580ZECIilVQBke49Y0d4JocMOGaoYxAcI7yCk=; b=Z2Iw0gput+/1fflbYf2ysFeYgnLOuo3q4kzRyb0+67lL+EX7r6p97LEJYkWsfd7bCEAx5J 1hWjzQB4MXsokOE7orCwLZNINuFfdXPIqQdMjjNWV5dDpCNH8gDSOxB4hx49h8ecYAevuZ gmlq3d8JVscbFimgiJC+HXmEhEVEYp8= From: Roman Gushchin To: linux-mm@kvack.org, bpf@vger.kernel.org Cc: Suren Baghdasaryan , Johannes Weiner , Michal Hocko , David Rientjes , Matt Bobrowski , Song Liu , Kumar Kartikeya Dwivedi , Alexei Starovoitov , Andrew Morton , linux-kernel@vger.kernel.org, Roman Gushchin Subject: [PATCH v1 03/14] mm: introduce bpf_oom_kill_process() bpf kfunc Date: Mon, 18 Aug 2025 10:01:25 -0700 Message-ID: <20250818170136.209169-4-roman.gushchin@linux.dev> In-Reply-To: <20250818170136.209169-1-roman.gushchin@linux.dev> References: <20250818170136.209169-1-roman.gushchin@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" Introduce bpf_oom_kill_process() bpf kfunc, which is supposed to be used by bpf OOM programs. It allows to kill a process in exactly the same way the OOM killer does: using the OOM reaper, bumping corresponding memcg and global statistics, respecting memory.oom.group etc. On success, it sets om_control's bpf_memory_freed field to true, enabling the bpf program to bypass the kernel OOM killer. Signed-off-by: Roman Gushchin --- mm/oom_kill.c | 67 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 67 insertions(+) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index ad7bd65061d6..25fc5e744e27 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -1282,3 +1282,70 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsign= ed int, flags) return -ENOSYS; #endif /* CONFIG_MMU */ } + +#ifdef CONFIG_BPF_SYSCALL + +__bpf_kfunc_start_defs(); +/** + * bpf_oom_kill_process - Kill a process as OOM killer + * @oc: pointer to oom_control structure, describes OOM context + * @task: task to be killed + * @message__str: message to print in dmesg + * + * Kill a process in a way similar to the kernel OOM killer. + * This means dump the necessary information to dmesg, adjust memcg + * statistics, leverage the oom reaper, respect memory.oom.group etc. + * + * bpf_oom_kill_process() marks the forward progress by setting + * oc->bpf_memory_freed. If the progress was made, the bpf program + * is free to decide if the kernel oom killer should be invoked. + * Otherwise it's enforced, so that a bad bpf program can't + * deadlock the machine on memory. + */ +__bpf_kfunc int bpf_oom_kill_process(struct oom_control *oc, + struct task_struct *task, + const char *message__str) +{ + if (oom_unkillable_task(task)) + return -EPERM; + + /* paired with put_task_struct() in oom_kill_process() */ + task =3D tryget_task_struct(task); + if (!task) + return -EINVAL; + + oc->chosen =3D task; + + oom_kill_process(oc, message__str); + + oc->chosen =3D NULL; + oc->bpf_memory_freed =3D true; + + return 0; +} + +__bpf_kfunc_end_defs(); + +BTF_KFUNCS_START(bpf_oom_kfuncs) +BTF_ID_FLAGS(func, bpf_oom_kill_process, KF_SLEEPABLE | KF_TRUSTED_ARGS) +BTF_KFUNCS_END(bpf_oom_kfuncs) + +static const struct btf_kfunc_id_set bpf_oom_kfunc_set =3D { + .owner =3D THIS_MODULE, + .set =3D &bpf_oom_kfuncs, +}; + +static int __init bpf_oom_init(void) +{ + int err; + + err =3D register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, + &bpf_oom_kfunc_set); + if (err) + pr_warn("error while registering bpf oom kfuncs: %d", err); + + return err; +} +late_initcall(bpf_oom_init); + +#endif --=20 2.50.1 From nobody Sat Oct 4 09:40:55 2025 Received: from out-181.mta1.migadu.com (out-181.mta1.migadu.com [95.215.58.181]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4759D34AAE1 for ; Mon, 18 Aug 2025 17:02:04 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.181 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755536527; cv=none; b=g8WDZ0kBKNjlutolZWWkUHI/ZMODz0lSN/V2osXCbQBP23q9FAQHOTYa4SczdqpuJsQaAM2K7t3l6WpMXR+u4uD5Wue4FcVE1Q7cUt7p+uCT2lBksDNSAbRYa5nahfSu05m31j0+FXx9GaOS3Zz4e22k3VBPBYYS+wdzqtI7w8k= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755536527; c=relaxed/simple; bh=XkDc9xaRsCenjnAav3LSkzM9k5QAYW2F9B7ruvWnRkg=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=kSk8SZqdmANEdo5S1G1O/EWKaLMKFNTpZ8VMCy5Stcr69nex3+a5iKi3bsvBfMILmkET4Y1bmWSNl0QZbURCJzeVjPzHIeC4n/zi04v1lnHNc5UioW2gsEmZtomMuyjINlO7h/k5cxe7CciAqZLQbBzladIgUeEzcEmY3pcW3cM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=HmVndicI; arc=none smtp.client-ip=95.215.58.181 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="HmVndicI" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1755536522; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=pbLFrydxxFltKx71yS/4EU7bCQFvXwfOv838sVEINzk=; b=HmVndicIgd1MOovjGT3tuk+Clf7w9gQt/J5iV244rD8BMDMagEbXceif4wnMg6wqi0thEx x0uvly97MaJ68GXg7+UWvj8c2K0A8gkSJwQyXDQ+uu87+gxH8naiO3OA/PGJl5XwAEq2uz tHBa73obbnXn3rKHcxgCrsAdsE6xpQ8= From: Roman Gushchin To: linux-mm@kvack.org, bpf@vger.kernel.org Cc: Suren Baghdasaryan , Johannes Weiner , Michal Hocko , David Rientjes , Matt Bobrowski , Song Liu , Kumar Kartikeya Dwivedi , Alexei Starovoitov , Andrew Morton , linux-kernel@vger.kernel.org, Roman Gushchin Subject: [PATCH v1 04/14] mm: introduce bpf kfuncs to deal with memcg pointers Date: Mon, 18 Aug 2025 10:01:26 -0700 Message-ID: <20250818170136.209169-5-roman.gushchin@linux.dev> In-Reply-To: <20250818170136.209169-1-roman.gushchin@linux.dev> References: <20250818170136.209169-1-roman.gushchin@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" To effectively operate with memory cgroups in bpf there is a need to convert css pointers to memcg pointers. A simple container_of cast which is used in the kernel code can't be used in bpf because from the verifier's point of view that's a out-of-bounds memory access. Introduce helper get/put kfuncs which can be used to get a refcounted memcg pointer from the css pointer: - bpf_get_mem_cgroup, - bpf_put_mem_cgroup. bpf_get_mem_cgroup() can take both memcg's css and the corresponding cgroup's "self" css. It allows it to be used with the existing cgroup iterator which iterates over cgroup tree, not memcg tree. Signed-off-by: Roman Gushchin --- include/linux/memcontrol.h | 2 + mm/Makefile | 1 + mm/bpf_memcontrol.c | 151 +++++++++++++++++++++++++++++++++++++ 3 files changed, 154 insertions(+) create mode 100644 mm/bpf_memcontrol.c diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 87b6688f124a..785a064000cd 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -932,6 +932,8 @@ static inline void mod_memcg_page_state(struct page *pa= ge, rcu_read_unlock(); } =20 +unsigned long memcg_events(struct mem_cgroup *memcg, int event); +unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap); unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx); unsigned long lruvec_page_state(struct lruvec *lruvec, enum node_stat_item= idx); unsigned long lruvec_page_state_local(struct lruvec *lruvec, diff --git a/mm/Makefile b/mm/Makefile index a714aba03759..c397af904a87 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -107,6 +107,7 @@ obj-$(CONFIG_MEMCG) +=3D swap_cgroup.o endif ifdef CONFIG_BPF_SYSCALL obj-y +=3D bpf_oom.o +obj-$(CONFIG_MEMCG) +=3D bpf_memcontrol.o endif obj-$(CONFIG_CGROUP_HUGETLB) +=3D hugetlb_cgroup.o obj-$(CONFIG_GUP_TEST) +=3D gup_test.o diff --git a/mm/bpf_memcontrol.c b/mm/bpf_memcontrol.c new file mode 100644 index 000000000000..66f2a359af7e --- /dev/null +++ b/mm/bpf_memcontrol.c @@ -0,0 +1,151 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Memory Controller-related BPF kfuncs and auxiliary code + * + * Author: Roman Gushchin + */ + +#include +#include + +__bpf_kfunc_start_defs(); + +/** + * bpf_get_mem_cgroup - Get a reference to a memory cgroup + * @css: pointer to the css structure + * + * Returns a pointer to a mem_cgroup structure after bumping + * the corresponding css's reference counter. + * + * It's fine to pass a css which belongs to any cgroup controller, + * e.g. unified hierarchy's main css. + * + * Implements KF_ACQUIRE semantics. + */ +__bpf_kfunc struct mem_cgroup * +bpf_get_mem_cgroup(struct cgroup_subsys_state *css) +{ + struct mem_cgroup *memcg =3D NULL; + bool rcu_unlock =3D false; + + if (!root_mem_cgroup) + return NULL; + + if (root_mem_cgroup->css.ss !=3D css->ss) { + struct cgroup *cgroup =3D css->cgroup; + int ssid =3D root_mem_cgroup->css.ss->id; + + rcu_read_lock(); + rcu_unlock =3D true; + css =3D rcu_dereference_raw(cgroup->subsys[ssid]); + } + + if (css && css_tryget(css)) + memcg =3D container_of(css, struct mem_cgroup, css); + + if (rcu_unlock) + rcu_read_unlock(); + + return memcg; +} + +/** + * bpf_put_mem_cgroup - Put a reference to a memory cgroup + * @memcg: memory cgroup to release + * + * Releases a previously acquired memcg reference. + * Implements KF_RELEASE semantics. + */ +__bpf_kfunc void bpf_put_mem_cgroup(struct mem_cgroup *memcg) +{ + css_put(&memcg->css); +} + +/** + * bpf_mem_cgroup_events - Read memory cgroup's event counter + * @memcg: memory cgroup + * @event: event idx + * + * Allows to read memory cgroup event counters. + */ +__bpf_kfunc unsigned long bpf_mem_cgroup_events(struct mem_cgroup *memcg, = int event) +{ + + if (event < 0 || event >=3D NR_VM_EVENT_ITEMS) + return (unsigned long)-1; + + return memcg_events(memcg, event); +} + +/** + * bpf_mem_cgroup_usage - Read memory cgroup's usage + * @memcg: memory cgroup + * + * Returns current memory cgroup size in bytes. + */ +__bpf_kfunc unsigned long bpf_mem_cgroup_usage(struct mem_cgroup *memcg) +{ + return page_counter_read(&memcg->memory); +} + +/** + * bpf_mem_cgroup_events - Read memory cgroup's page state counter + * @memcg: memory cgroup + * @event: event idx + * + * Allows to read memory cgroup statistics. + */ +__bpf_kfunc unsigned long bpf_mem_cgroup_page_state(struct mem_cgroup *mem= cg, int idx) +{ + if (idx < 0 || idx >=3D MEMCG_NR_STAT) + return (unsigned long)-1; + + return memcg_page_state(memcg, idx); +} + +/** + * bpf_mem_cgroup_flush_stats - Flush memory cgroup's statistics + * @memcg: memory cgroup + * + * Propagate memory cgroup's statistics up the cgroup tree. + * + * Note, that this function uses the rate-limited version of + * mem_cgroup_flush_stats() to avoid hurting the system-wide + * performance. So bpf_mem_cgroup_flush_stats() guarantees only + * that statistics is not stale beyond 2*FLUSH_TIME. + */ +__bpf_kfunc void bpf_mem_cgroup_flush_stats(struct mem_cgroup *memcg) +{ + mem_cgroup_flush_stats_ratelimited(memcg); +} + +__bpf_kfunc_end_defs(); + +BTF_KFUNCS_START(bpf_memcontrol_kfuncs) +BTF_ID_FLAGS(func, bpf_get_mem_cgroup, KF_ACQUIRE | KF_RET_NULL) +BTF_ID_FLAGS(func, bpf_put_mem_cgroup, KF_RELEASE) + +BTF_ID_FLAGS(func, bpf_mem_cgroup_events, KF_TRUSTED_ARGS) +BTF_ID_FLAGS(func, bpf_mem_cgroup_usage, KF_TRUSTED_ARGS) +BTF_ID_FLAGS(func, bpf_mem_cgroup_page_state, KF_TRUSTED_ARGS) +BTF_ID_FLAGS(func, bpf_mem_cgroup_flush_stats, KF_TRUSTED_ARGS) + +BTF_KFUNCS_END(bpf_memcontrol_kfuncs) + +static const struct btf_kfunc_id_set bpf_memcontrol_kfunc_set =3D { + .owner =3D THIS_MODULE, + .set =3D &bpf_memcontrol_kfuncs, +}; + +static int __init bpf_memcontrol_init(void) +{ + int err; + + err =3D register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, + &bpf_memcontrol_kfunc_set); + if (err) + pr_warn("error while registering bpf memcontrol kfuncs: %d", err); + + return err; +} +late_initcall(bpf_memcontrol_init); --=20 2.50.1 From nobody Sat Oct 4 09:40:55 2025 Received: from out-188.mta1.migadu.com (out-188.mta1.migadu.com [95.215.58.188]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E3E193469F7 for ; Mon, 18 Aug 2025 17:02:08 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.188 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755536530; cv=none; b=HfemoKnROcEejwG9UKlESniEw1GkqGvkEiuFW3BFUHhCQaNeKOcEUA7q/pLognpQWOHZggyWHvYnw0Z0u4dRWBYpJsCkhwIaYyERjg04/l/3Rg9z5kigOBctqQ/gK94LbkHN3WuONyDMd75L+hrnAEZwu9ZJc5Mcv0O3jSzI3V4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755536530; c=relaxed/simple; bh=E2BhppgzFUlQ5lTqeZyZg+1Z/QXYkFBhZ9Y5K0aoKpQ=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=Up49X+UOW8GC1daONpOu411relWfwMNamGxv1YLI4vrWZ4FG1zJb1A3yH/N0lFnMo2WSQVpwcYSDVCJPuaoTiljrGRIBApdTUipH1uuniTuoAppZpjj5boAEarhfalHOkt9naXoap7bwa4Sl5x+2WwDInL5PTuUt2FqAk6k6bRA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=UQf/ux/y; arc=none smtp.client-ip=95.215.58.188 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="UQf/ux/y" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1755536526; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=a6r+LPu0pSFsz4Xm/eQ1z/XPznLAX+8OLgc0s/PVh14=; b=UQf/ux/y+q4m6txXDXG2RhzT0g0AxXDkgQwJ8Af7N6Bmd0yDMCdgna4YuRuOSEiaZnRX4x VjU4YYWKAFrW3w0CP6RuXC7F4PccfTvnLX6eStoaE9l31FapcAhFtZPb+oltmQ0qGtf75b k5ugS/MCzTGdcS9vUtDiOPKu0aiQYiw= From: Roman Gushchin To: linux-mm@kvack.org, bpf@vger.kernel.org Cc: Suren Baghdasaryan , Johannes Weiner , Michal Hocko , David Rientjes , Matt Bobrowski , Song Liu , Kumar Kartikeya Dwivedi , Alexei Starovoitov , Andrew Morton , linux-kernel@vger.kernel.org, Roman Gushchin Subject: [PATCH v1 05/14] mm: introduce bpf_get_root_mem_cgroup() bpf kfunc Date: Mon, 18 Aug 2025 10:01:27 -0700 Message-ID: <20250818170136.209169-6-roman.gushchin@linux.dev> In-Reply-To: <20250818170136.209169-1-roman.gushchin@linux.dev> References: <20250818170136.209169-1-roman.gushchin@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" Introduce a bpf kfunc to get a trusted pointer to the root memory cgroup. It's very handy to traverse the full memcg tree, e.g. for handling a system-wide OOM. It's possible to obtain this pointer by traversing the memcg tree up from any known memcg, but it's sub-optimal and makes bpf programs more complex and less efficient. bpf_get_root_mem_cgroup() has a KF_ACQUIRE | KF_RET_NULL semantics, however in reality it's not necessarily to bump the corresponding reference counter - root memory cgroup is immortal, reference counting is skipped, see css_get(). Once set, root_mem_cgroup is always a valid memcg pointer. It's safe to call bpf_put_mem_cgroup() for the pointer obtained with bpf_get_root_mem_cgroup(), it's effectively a no-op. Signed-off-by: Roman Gushchin --- mm/bpf_memcontrol.c | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/mm/bpf_memcontrol.c b/mm/bpf_memcontrol.c index 66f2a359af7e..a8faa561bcba 100644 --- a/mm/bpf_memcontrol.c +++ b/mm/bpf_memcontrol.c @@ -10,6 +10,20 @@ =20 __bpf_kfunc_start_defs(); =20 +/** + * bpf_get_root_mem_cgroup - Returns a pointer to the root memory cgroup + * + * The function has KF_ACQUIRE semantics, even though the root memory + * cgroup is never destroyed after being created and doesn't require + * reference counting. And it's perfectly safe to pass it to + * bpf_put_mem_cgroup() + */ +__bpf_kfunc struct mem_cgroup *bpf_get_root_mem_cgroup(void) +{ + /* css_get() is not needed */ + return root_mem_cgroup; +} + /** * bpf_get_mem_cgroup - Get a reference to a memory cgroup * @css: pointer to the css structure @@ -122,6 +136,7 @@ __bpf_kfunc void bpf_mem_cgroup_flush_stats(struct mem_= cgroup *memcg) __bpf_kfunc_end_defs(); =20 BTF_KFUNCS_START(bpf_memcontrol_kfuncs) +BTF_ID_FLAGS(func, bpf_get_root_mem_cgroup, KF_ACQUIRE | KF_RET_NULL) BTF_ID_FLAGS(func, bpf_get_mem_cgroup, KF_ACQUIRE | KF_RET_NULL) BTF_ID_FLAGS(func, bpf_put_mem_cgroup, KF_RELEASE) =20 --=20 2.50.1 From nobody Sat Oct 4 09:40:55 2025 Received: from out-183.mta1.migadu.com (out-183.mta1.migadu.com [95.215.58.183]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 59BBE34AB18 for ; Mon, 18 Aug 2025 17:02:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.183 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755536535; cv=none; b=GY1dJPKX5v7WJZ0SO89NmS85iPx+AQjriDiwa0Z74EdDuyy3LQPJBPf1czQF5mgnCTHt0yswksbbxq8mjPrPTPyKFE0FY8G4/hRRLWyOoRRpMcjWwF267Fe443CgzCnW/rHp5NmiKgnGP2KWMqjuLrAhOciynWDxdT5irhp41PU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755536535; c=relaxed/simple; bh=C1Gbqaqm73xyGfQep8dWlho+A3NrvXS97rehBZBBN6c=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=rPeV6mw5kIBnyQJxk7u1p975CkxPw2QwaZGSvTZPDkbRxMe/n155XwOv/P7DXWCuQJlin/jvebskfEDn6eaBgWETwc4sA4lttodbEmPwpozKk/hFBgXjkNLmriQFULfDidhnUgmkV5rdZ+u9IwyJFxO2UWRvXNAZGADAX9lfUqo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=cnTHBb3v; arc=none smtp.client-ip=95.215.58.183 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="cnTHBb3v" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1755536530; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=EJDVBZVxsGNSev7vCHYTiVf7Gz5dWdPKYQSXDL3bVeg=; b=cnTHBb3vfy+2gSumXKhXWxCrMLvGYp0NrdSLAlXmGwwmwjg9bLrvJBl/jNgJ6Iwhm7LZpM a/3hqIIZv1M0mFgr/n4N23F/a+OeyXEJ86Exhc2ysTnZOPo5s/raQHSr5UKsCJ7h8nPo59 Um5UemvfOJunakaWWZzZQkEikHuITbg= From: Roman Gushchin To: linux-mm@kvack.org, bpf@vger.kernel.org Cc: Suren Baghdasaryan , Johannes Weiner , Michal Hocko , David Rientjes , Matt Bobrowski , Song Liu , Kumar Kartikeya Dwivedi , Alexei Starovoitov , Andrew Morton , linux-kernel@vger.kernel.org, Roman Gushchin Subject: [PATCH v1 06/14] mm: introduce bpf_out_of_memory() bpf kfunc Date: Mon, 18 Aug 2025 10:01:28 -0700 Message-ID: <20250818170136.209169-7-roman.gushchin@linux.dev> In-Reply-To: <20250818170136.209169-1-roman.gushchin@linux.dev> References: <20250818170136.209169-1-roman.gushchin@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" Introduce bpf_out_of_memory() bpf kfunc, which allows to declare an out of memory events and trigger the corresponding kernel OOM handling mechanism. It takes a trusted memcg pointer (or NULL for system-wide OOMs) as an argument, as well as the page order. If the wait_on_oom_lock argument is not set, only one OOM can be declared and handled in the system at once, so if the function is called in parallel to another OOM handling, it bails out with -EBUSY. This mode is suited for global OOM's: any concurrent OOMs will likely do the job and release some memory. In a blocking mode (which is suited for memcg OOMs) the execution will wait on the oom_lock mutex. The function is declared as sleepable. It guarantees that it won't be called from an atomic context. It's required by the OOM handling code, which is not guaranteed to work in a non-blocking context. Handling of a memcg OOM almost always requires taking of the css_set_lock spinlock. The fact that bpf_out_of_memory() is sleepable also guarantees that it can't be called with acquired css_set_lock, so the kernel can't deadlock on it. Signed-off-by: Roman Gushchin --- mm/oom_kill.c | 45 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 45 insertions(+) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 25fc5e744e27..df409f0fac45 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -1324,10 +1324,55 @@ __bpf_kfunc int bpf_oom_kill_process(struct oom_con= trol *oc, return 0; } =20 +/** + * bpf_out_of_memory - declare Out Of Memory state and invoke OOM killer + * @memcg__nullable: memcg or NULL for system-wide OOMs + * @order: order of page which wasn't allocated + * @wait_on_oom_lock: if true, block on oom_lock + * @constraint_text__nullable: custom constraint description for the OOM r= eport + * + * Declares the Out Of Memory state and invokes the OOM killer. + * + * OOM handlers are synchronized using the oom_lock mutex. If wait_on_oom_= lock + * is true, the function will wait on it. Otherwise it bails out with -EBU= SY + * if oom_lock is contended. + * + * Generally it's advised to pass wait_on_oom_lock=3Dtrue for global OOMs + * and wait_on_oom_lock=3Dfalse for memcg-scoped OOMs. + * + * Returns 1 if the forward progress was achieved and some memory was free= d. + * Returns a negative value if an error has been occurred. + */ +__bpf_kfunc int bpf_out_of_memory(struct mem_cgroup *memcg__nullable, + int order, bool wait_on_oom_lock) +{ + struct oom_control oc =3D { + .memcg =3D memcg__nullable, + .order =3D order, + }; + int ret; + + if (oc.order < 0 || oc.order > MAX_PAGE_ORDER) + return -EINVAL; + + if (wait_on_oom_lock) { + ret =3D mutex_lock_killable(&oom_lock); + if (ret) + return ret; + } else if (!mutex_trylock(&oom_lock)) + return -EBUSY; + + ret =3D out_of_memory(&oc); + + mutex_unlock(&oom_lock); + return ret; +} + __bpf_kfunc_end_defs(); =20 BTF_KFUNCS_START(bpf_oom_kfuncs) BTF_ID_FLAGS(func, bpf_oom_kill_process, KF_SLEEPABLE | KF_TRUSTED_ARGS) +BTF_ID_FLAGS(func, bpf_out_of_memory, KF_SLEEPABLE | KF_TRUSTED_ARGS) BTF_KFUNCS_END(bpf_oom_kfuncs) =20 static const struct btf_kfunc_id_set bpf_oom_kfunc_set =3D { --=20 2.50.1 From nobody Sat Oct 4 09:40:55 2025 Received: from out-187.mta1.migadu.com (out-187.mta1.migadu.com [95.215.58.187]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5360334AB17 for ; Mon, 18 Aug 2025 17:02:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.187 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755536537; cv=none; b=clqAAAhvOQrl2q7oZKc9pVCTCuOcHkXhpqaBKUKPY3kbkU4M2WH5gbQUKy1CTg6zHnjU78VtQhlIfycWisYC/5/sQkoTmINYY0RkXybIaTaDyOOuMLt982J+9bEizeOjOeLfB4sVGfeSGQEg2BSBqJmWBE+oIJSoX0rEzvfZVdI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755536537; c=relaxed/simple; bh=iEwPyLarkYZ1SYl78PTz6f3Tq7siad3IfaAtYtYAiwk=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=U9T+G4C121ZKS83ww+VW74rI7qVtNp2v330GkCCK47fRrG1i/hqagJHwiMZbQZ72ksoh4q5858YxHYfF/w1QNlw09kFIJOcHoaZrwM0V+RseciAY3gOgBMBTV+1c4EC8aD2f0ZNBjNveStWmodU2PvC/a4PBHeasmr6fYeHBuWg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=fGk1dplm; arc=none smtp.client-ip=95.215.58.187 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="fGk1dplm" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1755536533; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=+PLsLBHEH7fH2DeVbRY37aGg0nfytx5nHxzamt/J4w8=; b=fGk1dplmhX8TBaKAV1v7A3rxD9Ol2v1dVG0iJpNbN08nSv/gNpwrxCdezuKtXAgWuH7pnH 0ubEOf+1Y1+QbRYQuqvw96/rOTlUWv2I2DKvu9jhtXsjugbpPT3L17R5EjGmPe28KovF8S xmIqxrmPCQw+j3alA32yMAWl/zfhADM= From: Roman Gushchin To: linux-mm@kvack.org, bpf@vger.kernel.org Cc: Suren Baghdasaryan , Johannes Weiner , Michal Hocko , David Rientjes , Matt Bobrowski , Song Liu , Kumar Kartikeya Dwivedi , Alexei Starovoitov , Andrew Morton , linux-kernel@vger.kernel.org, Roman Gushchin Subject: [PATCH v1 07/14] mm: allow specifying custom oom constraint for bpf triggers Date: Mon, 18 Aug 2025 10:01:29 -0700 Message-ID: <20250818170136.209169-8-roman.gushchin@linux.dev> In-Reply-To: <20250818170136.209169-1-roman.gushchin@linux.dev> References: <20250818170136.209169-1-roman.gushchin@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" Currently there is a hard-coded list of possible oom constraints: NONE, CPUSET, MEMORY_POLICY & MEMCG. Add a new one: CONSTRAINT_BPF. Also, add an ability to specify a custom constraint name when calling bpf_out_of_memory(). If an empty string is passed as an argument, CONSTRAINT_BPF is displayed. The resulting output in dmesg will look like this: [ 315.224875] kworker/u17:0 invoked oom-killer: gfp_mask=3D0x0(), order=3D= 0, oom_score_adj=3D0 oom_policy=3Ddefault [ 315.226532] CPU: 1 UID: 0 PID: 74 Comm: kworker/u17:0 Not tainted 6.16.0= -00015-gf09eb0d6badc #102 PREEMPT(full) [ 315.226534] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS = 1.17.0-5.fc42 04/01/2014 [ 315.226536] Workqueue: bpf_psi_wq bpf_psi_handle_event_fn [ 315.226542] Call Trace: [ 315.226545] [ 315.226548] dump_stack_lvl+0x4d/0x70 [ 315.226555] dump_header+0x59/0x1c6 [ 315.226561] oom_kill_process.cold+0x8/0xef [ 315.226565] out_of_memory+0x111/0x5c0 [ 315.226577] bpf_out_of_memory+0x6f/0xd0 [ 315.226580] ? srso_alias_return_thunk+0x5/0xfbef5 [ 315.226589] bpf_prog_3018b0cf55d2c6bb_handle_psi_event+0x5d/0x76 [ 315.226594] bpf__bpf_psi_ops_handle_psi_event+0x47/0xa7 [ 315.226599] bpf_psi_handle_event_fn+0x63/0xb0 [ 315.226604] process_one_work+0x1fc/0x580 [ 315.226616] ? srso_alias_return_thunk+0x5/0xfbef5 [ 315.226624] worker_thread+0x1d9/0x3b0 [ 315.226629] ? __pfx_worker_thread+0x10/0x10 [ 315.226632] kthread+0x128/0x270 [ 315.226637] ? lock_release+0xd4/0x2d0 [ 315.226645] ? __pfx_kthread+0x10/0x10 [ 315.226649] ret_from_fork+0x81/0xd0 [ 315.226652] ? __pfx_kthread+0x10/0x10 [ 315.226655] ret_from_fork_asm+0x1a/0x30 [ 315.226667] [ 315.239745] memory: usage 42240kB, limit 9007199254740988kB, failcnt 0 [ 315.240231] swap: usage 0kB, limit 0kB, failcnt 0 [ 315.240585] Memory cgroup stats for /cgroup-test-work-dir673/oom_test/cg= 2: [ 315.240603] anon 42897408 [ 315.241317] file 0 [ 315.241493] kernel 98304 ... [ 315.255946] Tasks state (memory values in pages): [ 315.256292] [ pid ] uid tgid total_vm rss rss_anon rss_file rs= s_shmem pgtables_bytes swapents oom_score_adj name [ 315.257107] [ 675] 0 675 162013 10969 10712 257 = 0 155648 0 0 test_progs [ 315.257927] oom-kill:constraint=3DCONSTRAINT_BPF_PSI_MEM,nodemask=3D(nul= l),cpuset=3D/,mems_allowed=3D0,oom_memcg=3D/cgroup-test-work-dir673/oom_tes= t/cg2,task_memcg=3D/cgroup-test-work-dir673/oom_test/cg2,task=3Dtest_progs,= pid=3D675,uid=3D0 [ 315.259371] Memory cgroup out of memory: Killed process 675 (test_progs)= total-vm:648052kB, anon-rss:42848kB, file-rss:1028kB, shmem-rss:0kB, UID:0= pgtables:152kB oom_score_adj:0 Signed-off-by: Roman Gushchin --- include/linux/oom.h | 4 ++++ mm/oom_kill.c | 38 +++++++++++++++++++++++++++++--------- 2 files changed, 33 insertions(+), 9 deletions(-) diff --git a/include/linux/oom.h b/include/linux/oom.h index ef453309b7ea..4b04944b42de 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -19,6 +19,7 @@ enum oom_constraint { CONSTRAINT_CPUSET, CONSTRAINT_MEMORY_POLICY, CONSTRAINT_MEMCG, + CONSTRAINT_BPF, }; =20 /* @@ -58,6 +59,9 @@ struct oom_control { =20 /* Policy name */ const char *bpf_policy_name; + + /* BPF-specific constraint name */ + const char *bpf_constraint; #endif }; =20 diff --git a/mm/oom_kill.c b/mm/oom_kill.c index df409f0fac45..67afcd43a5f7 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -240,13 +240,6 @@ long oom_badness(struct task_struct *p, unsigned long = totalpages) return points; } =20 -static const char * const oom_constraint_text[] =3D { - [CONSTRAINT_NONE] =3D "CONSTRAINT_NONE", - [CONSTRAINT_CPUSET] =3D "CONSTRAINT_CPUSET", - [CONSTRAINT_MEMORY_POLICY] =3D "CONSTRAINT_MEMORY_POLICY", - [CONSTRAINT_MEMCG] =3D "CONSTRAINT_MEMCG", -}; - static const char *oom_policy_name(struct oom_control *oc) { #ifdef CONFIG_BPF_SYSCALL @@ -256,6 +249,27 @@ static const char *oom_policy_name(struct oom_control = *oc) return "default"; } =20 +static const char *oom_constraint_text(struct oom_control *oc) +{ + switch (oc->constraint) { + case CONSTRAINT_NONE: + return "CONSTRAINT_NONE"; + case CONSTRAINT_CPUSET: + return "CONSTRAINT_CPUSET"; + case CONSTRAINT_MEMORY_POLICY: + return "CONSTRAINT_MEMORY_POLICY"; + case CONSTRAINT_MEMCG: + return "CONSTRAINT_MEMCG"; +#ifdef CONFIG_BPF_SYSCALL + case CONSTRAINT_BPF: + return oc->bpf_constraint ? : "CONSTRAINT_BPF"; +#endif + default: + WARN_ON_ONCE(1); + return ""; + } +} + /* * Determine the type of allocation constraint. */ @@ -267,6 +281,9 @@ static enum oom_constraint constrained_alloc(struct oom= _control *oc) bool cpuset_limited =3D false; int nid; =20 + if (oc->constraint =3D=3D CONSTRAINT_BPF) + return CONSTRAINT_BPF; + if (is_memcg_oom(oc)) { oc->totalpages =3D mem_cgroup_get_max(oc->memcg) ?: 1; return CONSTRAINT_MEMCG; @@ -458,7 +475,7 @@ static void dump_oom_victim(struct oom_control *oc, str= uct task_struct *victim) { /* one line summary of the oom killer context. */ pr_info("oom-kill:constraint=3D%s,nodemask=3D%*pbl", - oom_constraint_text[oc->constraint], + oom_constraint_text(oc), nodemask_pr_args(oc->nodemask)); cpuset_print_current_mems_allowed(); mem_cgroup_print_oom_context(oc->memcg, victim); @@ -1344,11 +1361,14 @@ __bpf_kfunc int bpf_oom_kill_process(struct oom_con= trol *oc, * Returns a negative value if an error has been occurred. */ __bpf_kfunc int bpf_out_of_memory(struct mem_cgroup *memcg__nullable, - int order, bool wait_on_oom_lock) + int order, bool wait_on_oom_lock, + const char *constraint_text__nullable) { struct oom_control oc =3D { .memcg =3D memcg__nullable, .order =3D order, + .constraint =3D CONSTRAINT_BPF, + .bpf_constraint =3D constraint_text__nullable, }; int ret; =20 --=20 2.50.1 From nobody Sat Oct 4 09:40:55 2025 Received: from out-188.mta1.migadu.com (out-188.mta1.migadu.com [95.215.58.188]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 84CD5343D9A for ; Mon, 18 Aug 2025 17:02:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.188 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755536540; cv=none; b=e6q5SyaQSfHQOMaI8JL/Ektmb2tuDaJk3s8y09mFpsRU+cF4eR8ZwybWQYqPblrXXDRxeQdsl5+wkW+YIDlWsD46N8pBfKNDXpHQ//73faGKMfv8AZ9YehV0E3lTTGp/vmPOjKcFRngHhiJHLSlEHFilorUnzYohhFPgA+cAhj0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755536540; c=relaxed/simple; bh=L76RNvsn/XDlLpgNSihLqM9GD4IV0OFBqI3VWeftGLQ=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=mvg1v2dGEiVgn1BmqO07GRYkw/D7Y7QlXW3fEVeZZEofJyebWtrqOvcNuqAI9vOXGtYa1Tm9eTuALIChAwR+IinBanERvJQ+MefLsNkJUKfzZR5/4LvQm9BDex0p4fR3DeLis9fRJSX4SCo9Xx4km4xgdRiv3T5BgxbMx/2Bhnw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=c9BU2ZxZ; arc=none smtp.client-ip=95.215.58.188 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="c9BU2ZxZ" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1755536537; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Mwt3OHKrQ6H4MDHsSz3qoIKrO+SSIvtmYL/lFo8ussA=; b=c9BU2ZxZSDg31TJtdgdrRnY30VLA37u2N0l2k3tuhQngnqGpEuOtjGzNcqFUWpGOsUrQC5 4dgKOQdbOg7/I7V5T39IJlU3N2D5ZsIQhdJ+eAoQsA5lfTXIJTEaHMof7exLhQbd8Rb/Cv amf9TSDOqkbE68038pDGb423QpmyCEY= From: Roman Gushchin To: linux-mm@kvack.org, bpf@vger.kernel.org Cc: Suren Baghdasaryan , Johannes Weiner , Michal Hocko , David Rientjes , Matt Bobrowski , Song Liu , Kumar Kartikeya Dwivedi , Alexei Starovoitov , Andrew Morton , linux-kernel@vger.kernel.org, Roman Gushchin Subject: [PATCH v1 08/14] mm: introduce bpf_task_is_oom_victim() kfunc Date: Mon, 18 Aug 2025 10:01:30 -0700 Message-ID: <20250818170136.209169-9-roman.gushchin@linux.dev> In-Reply-To: <20250818170136.209169-1-roman.gushchin@linux.dev> References: <20250818170136.209169-1-roman.gushchin@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" Export tsk_is_oom_victim() helper as a bpf kfunc. It's very useful to avoid redundant oom kills. Signed-off-by: Roman Gushchin --- mm/oom_kill.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 67afcd43a5f7..fe6e69dfbdba 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -1388,11 +1388,25 @@ __bpf_kfunc int bpf_out_of_memory(struct mem_cgroup= *memcg__nullable, return ret; } =20 +/** + * bpf_task_is_oom_victim - Check if the task has been marked as an OOM vi= ctim + * @task: task to check + * + * Returns true if the task has been previously selected by the OOM killer + * to be killed. It's expected that the task will be destroyed soon and so= me + * memory will be freed, so maybe no additional actions required. + */ +__bpf_kfunc bool bpf_task_is_oom_victim(struct task_struct *task) +{ + return tsk_is_oom_victim(task); +} + __bpf_kfunc_end_defs(); =20 BTF_KFUNCS_START(bpf_oom_kfuncs) BTF_ID_FLAGS(func, bpf_oom_kill_process, KF_SLEEPABLE | KF_TRUSTED_ARGS) BTF_ID_FLAGS(func, bpf_out_of_memory, KF_SLEEPABLE | KF_TRUSTED_ARGS) +BTF_ID_FLAGS(func, bpf_task_is_oom_victim, KF_TRUSTED_ARGS) BTF_KFUNCS_END(bpf_oom_kfuncs) =20 static const struct btf_kfunc_id_set bpf_oom_kfunc_set =3D { --=20 2.50.1 From nobody Sat Oct 4 09:40:55 2025 Received: from out-186.mta1.migadu.com (out-186.mta1.migadu.com [95.215.58.186]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id EECA93570A4 for ; Mon, 18 Aug 2025 17:02:22 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.186 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755536544; cv=none; b=OrR/nkEdMxKrMcwVlmcBE6yP00PqeAQSOHuzwFmnD6K1nle56yE6HcZwRpBXyG51JdCu9hkRr3/qGLQcuobIZj0ojB4jILkUkR4r96DzcUdXPgDML2Rzh8v3xnx4DqqodDjYUKL8g/t/PFW92JqDKXD/OxvmWMAUAIhE0UPUuMA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755536544; c=relaxed/simple; bh=mMxj2cpuBPPJO4bgeDuIYxUpij9GGqzSYoXBNaz7/lY=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=PvwZVqaCoKm8Z370XbvRKUbcjnqOS9vT1zaclQ9FE83TmVPlFNrvxwBoyNNk7ennckkgRGZfbdo+leF4MJQCA1Xnu8LvwbP1xjqnKZbOPvpBaszUDwp7O8b3SHVjlsEQyDObnagn0tMrtDKUPy9gU8qtU2EeG2SCtEgii46BMtg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=K7chACaY; arc=none smtp.client-ip=95.215.58.186 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="K7chACaY" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1755536541; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=fyQY6WbHHR935OUk0tr0hH/Nv0XIIXTz4uXs9PTYD5s=; b=K7chACaYMDaRGExX1A6HHl1xtyW3fyGo3b4d8d7yQX+89xHyZKKYsNXUFRUAJl7ieMHKQQ 9lYnqpAC/gwQ/sUsGNIpFUZZxXBih2tznXkSby/csvimM2HtnoCeDG9Dy8Sf9jixe5XS8Q g3zmzuv1E5IK60R00CA8D1tLLixhhOo= From: Roman Gushchin To: linux-mm@kvack.org, bpf@vger.kernel.org Cc: Suren Baghdasaryan , Johannes Weiner , Michal Hocko , David Rientjes , Matt Bobrowski , Song Liu , Kumar Kartikeya Dwivedi , Alexei Starovoitov , Andrew Morton , linux-kernel@vger.kernel.org, Roman Gushchin Subject: [PATCH v1 09/14] bpf: selftests: introduce read_cgroup_file() helper Date: Mon, 18 Aug 2025 10:01:31 -0700 Message-ID: <20250818170136.209169-10-roman.gushchin@linux.dev> In-Reply-To: <20250818170136.209169-1-roman.gushchin@linux.dev> References: <20250818170136.209169-1-roman.gushchin@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" Implement read_cgroup_file() helper to read from cgroup control files, e.g. statistics. Signed-off-by: Roman Gushchin --- tools/testing/selftests/bpf/cgroup_helpers.c | 39 ++++++++++++++++++++ tools/testing/selftests/bpf/cgroup_helpers.h | 2 + 2 files changed, 41 insertions(+) diff --git a/tools/testing/selftests/bpf/cgroup_helpers.c b/tools/testing/s= elftests/bpf/cgroup_helpers.c index e4535451322e..3ffd4b764f91 100644 --- a/tools/testing/selftests/bpf/cgroup_helpers.c +++ b/tools/testing/selftests/bpf/cgroup_helpers.c @@ -125,6 +125,45 @@ int enable_controllers(const char *relative_path, cons= t char *controllers) return __enable_controllers(cgroup_path, controllers); } =20 +static size_t __read_cgroup_file(const char *cgroup_path, const char *file, + char *buf, size_t size) +{ + char file_path[PATH_MAX + 1]; + size_t ret; + int fd; + + snprintf(file_path, sizeof(file_path), "%s/%s", cgroup_path, file); + fd =3D open(file_path, O_RDONLY); + if (fd < 0) { + log_err("Opening %s", file_path); + return -1; + } + + ret =3D read(fd, buf, size); + close(fd); + return ret; +} + +/** + * read_cgroup_file() - Read to a cgroup file + * @relative_path: The cgroup path, relative to the workdir + * @file: The name of the file in cgroupfs to read to + * @buf: Buffer to read from the file + * @size: Size of the buffer + * + * Read to a file in the given cgroup's directory. + * + * If successful, the number of read bytes is returned. + */ +size_t read_cgroup_file(const char *relative_path, const char *file, + char *buf, size_t size) +{ + char cgroup_path[PATH_MAX - 24]; + + format_cgroup_path(cgroup_path, relative_path); + return __read_cgroup_file(cgroup_path, file, buf, size); +} + static int __write_cgroup_file(const char *cgroup_path, const char *file, const char *buf) { diff --git a/tools/testing/selftests/bpf/cgroup_helpers.h b/tools/testing/s= elftests/bpf/cgroup_helpers.h index 502845160d88..821cb76db1f7 100644 --- a/tools/testing/selftests/bpf/cgroup_helpers.h +++ b/tools/testing/selftests/bpf/cgroup_helpers.h @@ -11,6 +11,8 @@ =20 /* cgroupv2 related */ int enable_controllers(const char *relative_path, const char *controllers); +size_t read_cgroup_file(const char *relative_path, const char *file, + char *buf, size_t size); int write_cgroup_file(const char *relative_path, const char *file, const char *buf); int write_cgroup_file_parent(const char *relative_path, const char *file, --=20 2.50.1 From nobody Sat Oct 4 09:40:55 2025 Received: from out-181.mta1.migadu.com (out-181.mta1.migadu.com [95.215.58.181]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4827E3570B4 for ; Mon, 18 Aug 2025 17:02:26 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.181 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755536548; cv=none; b=J52wzCAClHbgw2+RBN+jvatI4qlvmdGkmDkmcE/wWlalSdsxeinbnkfuUYaRTvW9nFoNnVqvlPTl78lvOlTluqL+aadponeX9zjaDqJ4vqFI5r+jm2M2gIyO+RRwQidHM0sQMlakq6P1psRJ4K4oBKrYSk8WsKA1Oh954eCxMvQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755536548; c=relaxed/simple; bh=u+kBJS89kT30qMqSb97y7F87+bgte6AjqyJw91ZBN4o=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=ryP4n9lTr5ekHRBwZFjsk6xnR5oEkQSSL2SU7owizekZi51XjopkI2pxQLzD3DwhwojsrpAv1WIGtLzTV/bbDkbhWRSauxcv91Aokr8/CxXgkKGo9gaIx3XvXhllyaR28I+RsR2/deenKug9DEijKf7rlfeQGwgip19yE39H8Nk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=RTdyUVik; arc=none smtp.client-ip=95.215.58.181 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="RTdyUVik" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1755536544; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=m5PZGww5UJXcYRq1XlF9P00nUCT95JA1S7NFEcobsCQ=; b=RTdyUVikURz5qAFz3Plns7U4ASUW/r9ELevyJv1S0J8KnMckP3zDbA/g8C+kEl/NlvC6B8 Qgjxi5r5LjAdneqD384YzLtNNR8eeinpxwUbPtn1ec2fnCq1nv0QQNG3UTN0R9jPTm7T1T dGkgUd7c9oK+0oWvcKmGcdj3qbubnUI= From: Roman Gushchin To: linux-mm@kvack.org, bpf@vger.kernel.org Cc: Suren Baghdasaryan , Johannes Weiner , Michal Hocko , David Rientjes , Matt Bobrowski , Song Liu , Kumar Kartikeya Dwivedi , Alexei Starovoitov , Andrew Morton , linux-kernel@vger.kernel.org, Roman Gushchin Subject: [PATCH v1 10/14] bpf: selftests: bpf OOM handler test Date: Mon, 18 Aug 2025 10:01:32 -0700 Message-ID: <20250818170136.209169-11-roman.gushchin@linux.dev> In-Reply-To: <20250818170136.209169-1-roman.gushchin@linux.dev> References: <20250818170136.209169-1-roman.gushchin@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" Implement a pseudo-realistic test for the OOM handling functionality. The OOM handling policy which is implemented in bpf is to kill all tasks belonging to the biggest leaf cgroup, which doesn't contain unkillable tasks (tasks with oom_score_adj set to -1000). Pagecache size is excluded from the accounting. The test creates a hierarchy of memory cgroups, causes an OOM at the top level, checks that the expected process will be killed and checks memcg's oom statistics. Signed-off-by: Roman Gushchin --- .../selftests/bpf/prog_tests/test_oom.c | 229 ++++++++++++++++++ tools/testing/selftests/bpf/progs/test_oom.c | 108 +++++++++ 2 files changed, 337 insertions(+) create mode 100644 tools/testing/selftests/bpf/prog_tests/test_oom.c create mode 100644 tools/testing/selftests/bpf/progs/test_oom.c diff --git a/tools/testing/selftests/bpf/prog_tests/test_oom.c b/tools/test= ing/selftests/bpf/prog_tests/test_oom.c new file mode 100644 index 000000000000..eaeb14a9d18f --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/test_oom.c @@ -0,0 +1,229 @@ +// SPDX-License-Identifier: GPL-2.0-only +#include +#include +#include + +#include "cgroup_helpers.h" +#include "test_oom.skel.h" + +struct cgroup_desc { + const char *path; + int fd; + unsigned long long id; + int pid; + size_t target; + size_t max; + int oom_score_adj; + bool victim; +}; + +#define MB (1024 * 1024) +#define OOM_SCORE_ADJ_MIN (-1000) +#define OOM_SCORE_ADJ_MAX 1000 + +static struct cgroup_desc cgroups[] =3D { + { .path =3D "/oom_test", .max =3D 80 * MB}, + { .path =3D "/oom_test/cg1", .target =3D 10 * MB, + .oom_score_adj =3D OOM_SCORE_ADJ_MAX }, + { .path =3D "/oom_test/cg2", .target =3D 40 * MB, + .oom_score_adj =3D OOM_SCORE_ADJ_MIN }, + { .path =3D "/oom_test/cg3" }, + { .path =3D "/oom_test/cg3/cg4", .target =3D 30 * MB, + .victim =3D true }, + { .path =3D "/oom_test/cg3/cg5", .target =3D 20 * MB }, +}; + +static int spawn_task(struct cgroup_desc *desc) +{ + char *ptr; + int pid; + + pid =3D fork(); + if (pid < 0) + return pid; + + if (pid > 0) { + /* parent */ + desc->pid =3D pid; + return 0; + } + + /* child */ + if (desc->oom_score_adj) { + char buf[64]; + int fd =3D open("/proc/self/oom_score_adj", O_WRONLY); + + if (fd < 0) + return -1; + + snprintf(buf, sizeof(buf), "%d", desc->oom_score_adj); + write(fd, buf, sizeof(buf)); + close(fd); + } + + ptr =3D (char *)malloc(desc->target); + if (!ptr) + return -ENOMEM; + + memset(ptr, 'a', desc->target); + + while (1) + sleep(1000); + + return 0; +} + +static void setup_environment(void) +{ + int i, err; + + err =3D setup_cgroup_environment(); + if (!ASSERT_OK(err, "setup_cgroup_environment")) + goto cleanup; + + for (i =3D 0; i < ARRAY_SIZE(cgroups); i++) { + cgroups[i].fd =3D create_and_get_cgroup(cgroups[i].path); + if (!ASSERT_GE(cgroups[i].fd, 0, "create_and_get_cgroup")) + goto cleanup; + + cgroups[i].id =3D get_cgroup_id(cgroups[i].path); + if (!ASSERT_GT(cgroups[i].id, 0, "get_cgroup_id")) + goto cleanup; + + /* Freeze the top-level cgroup */ + if (i =3D=3D 0) { + /* Freeze the top-level cgroup */ + err =3D write_cgroup_file(cgroups[i].path, "cgroup.freeze", "1"); + if (!ASSERT_OK(err, "freeze cgroup")) + goto cleanup; + } + + /* Recursively enable the memory controller */ + if (!cgroups[i].target) { + + err =3D write_cgroup_file(cgroups[i].path, "cgroup.subtree_control", + "+memory"); + if (!ASSERT_OK(err, "enable memory controller")) + goto cleanup; + } + + /* Set memory.max */ + if (cgroups[i].max) { + char buf[256]; + + snprintf(buf, sizeof(buf), "%lu", cgroups[i].max); + err =3D write_cgroup_file(cgroups[i].path, "memory.max", buf); + if (!ASSERT_OK(err, "set memory.max")) + goto cleanup; + + snprintf(buf, sizeof(buf), "0"); + write_cgroup_file(cgroups[i].path, "memory.swap.max", buf); + + } + + /* Spawn tasks creating memory pressure */ + if (cgroups[i].target) { + char buf[256]; + + err =3D spawn_task(&cgroups[i]); + if (!ASSERT_OK(err, "spawn task")) + goto cleanup; + + snprintf(buf, sizeof(buf), "%d", cgroups[i].pid); + err =3D write_cgroup_file(cgroups[i].path, "cgroup.procs", buf); + if (!ASSERT_OK(err, "put child into a cgroup")) + goto cleanup; + } + } + + return; + +cleanup: + cleanup_cgroup_environment(); +} + +static int run_and_wait_for_oom(void) +{ + int ret =3D -1; + bool first =3D true; + char buf[4096] =3D {}; + size_t size; + + /* Unfreeze the top-level cgroup */ + ret =3D write_cgroup_file(cgroups[0].path, "cgroup.freeze", "0"); + if (!ASSERT_OK(ret, "freeze cgroup")) + return -1; + + for (;;) { + int i, status; + pid_t pid =3D wait(&status); + + if (pid =3D=3D -1) { + if (errno =3D=3D EINTR) + continue; + /* ECHILD */ + break; + } + + if (!first) + continue; + + first =3D false; + + /* Check which process was terminated first */ + for (i =3D 0; i < ARRAY_SIZE(cgroups); i++) { + if (!ASSERT_OK(cgroups[i].victim !=3D + (pid =3D=3D cgroups[i].pid), + "correct process was killed")) { + ret =3D -1; + break; + } + + if (!cgroups[i].victim) + continue; + + /* Check the memcg oom counter */ + size =3D read_cgroup_file(cgroups[i].path, + "memory.events", + buf, sizeof(buf)); + if (!ASSERT_OK(size <=3D 0, "read memory.events")) { + ret =3D -1; + break; + } + + if (!ASSERT_OK(strstr(buf, "oom_kill 1") =3D=3D NULL, + "oom_kill count check")) { + ret =3D -1; + break; + } + } + + /* Kill all remaining tasks */ + for (i =3D 0; i < ARRAY_SIZE(cgroups); i++) + if (cgroups[i].pid && cgroups[i].pid !=3D pid) + kill(cgroups[i].pid, SIGKILL); + } + + return ret; +} + +void test_oom(void) +{ + struct test_oom *skel; + int err; + + setup_environment(); + + skel =3D test_oom__open_and_load(); + err =3D test_oom__attach(skel); + if (CHECK_FAIL(err)) + goto cleanup; + + /* Unfreeze all child tasks and create the memory pressure */ + err =3D run_and_wait_for_oom(); + CHECK_FAIL(err); + +cleanup: + cleanup_cgroup_environment(); + test_oom__destroy(skel); +} diff --git a/tools/testing/selftests/bpf/progs/test_oom.c b/tools/testing/s= elftests/bpf/progs/test_oom.c new file mode 100644 index 000000000000..ca83563fc9a8 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/test_oom.c @@ -0,0 +1,108 @@ +// SPDX-License-Identifier: GPL-2.0-only +#include "vmlinux.h" +#include +#include + +char _license[] SEC("license") =3D "GPL"; + +#define OOM_SCORE_ADJ_MIN (-1000) + +void bpf_rcu_read_lock(void) __ksym; +void bpf_rcu_read_unlock(void) __ksym; +struct task_struct *bpf_task_acquire(struct task_struct *p) __ksym; +void bpf_task_release(struct task_struct *p) __ksym; +struct mem_cgroup *bpf_get_root_mem_cgroup(void) __ksym; +struct mem_cgroup *bpf_get_mem_cgroup(struct cgroup_subsys_state *css) __k= sym; +void bpf_put_mem_cgroup(struct mem_cgroup *memcg) __ksym; +int bpf_oom_kill_process(struct oom_control *oc, struct task_struct *task, + const char *message__str) __ksym; + +static bool mem_cgroup_killable(struct mem_cgroup *memcg) +{ + struct task_struct *task; + bool ret =3D true; + + bpf_for_each(css_task, task, &memcg->css, CSS_TASK_ITER_PROCS) + if (task->signal->oom_score_adj =3D=3D OOM_SCORE_ADJ_MIN) + return false; + + return ret; +} + +/* + * Find the largest leaf cgroup (ignoring page cache) without unkillable t= asks + * and kill all belonging tasks. + */ +SEC("struct_ops.s/handle_out_of_memory") +int BPF_PROG(test_out_of_memory, struct oom_control *oc) +{ + struct task_struct *task; + struct mem_cgroup *root_memcg =3D oc->memcg; + struct mem_cgroup *memcg, *victim =3D NULL; + struct cgroup_subsys_state *css_pos; + unsigned long usage, max_usage =3D 0; + unsigned long pagecache =3D 0; + int ret =3D 0; + + if (root_memcg) + root_memcg =3D bpf_get_mem_cgroup(&root_memcg->css); + else + root_memcg =3D bpf_get_root_mem_cgroup(); + + if (!root_memcg) + return 0; + + bpf_rcu_read_lock(); + bpf_for_each(css, css_pos, &root_memcg->css, BPF_CGROUP_ITER_DESCENDANTS_= POST) { + if (css_pos->cgroup->nr_descendants + css_pos->cgroup->nr_dying_descenda= nts) + continue; + + memcg =3D bpf_get_mem_cgroup(css_pos); + if (!memcg) + continue; + + usage =3D bpf_mem_cgroup_usage(memcg); + pagecache =3D bpf_mem_cgroup_page_state(memcg, NR_FILE_PAGES); + + if (usage > pagecache) + usage -=3D pagecache; + else + usage =3D 0; + + if ((usage > max_usage) && mem_cgroup_killable(memcg)) { + max_usage =3D usage; + if (victim) + bpf_put_mem_cgroup(victim); + victim =3D bpf_get_mem_cgroup(&memcg->css); + } + + bpf_put_mem_cgroup(memcg); + } + bpf_rcu_read_unlock(); + + if (!victim) + goto exit; + + bpf_for_each(css_task, task, &victim->css, CSS_TASK_ITER_PROCS) { + struct task_struct *t =3D bpf_task_acquire(task); + + if (t) { + if (!bpf_task_is_oom_victim(task)) + bpf_oom_kill_process(oc, task, "bpf oom test"); + bpf_task_release(t); + ret =3D 1; + } + } + + bpf_put_mem_cgroup(victim); +exit: + bpf_put_mem_cgroup(root_memcg); + + return ret; +} + +SEC(".struct_ops.link") +struct bpf_oom_ops test_bpf_oom =3D { + .name =3D "bpf_test_policy", + .handle_out_of_memory =3D (void *)test_out_of_memory, +}; --=20 2.50.1 From nobody Sat Oct 4 09:40:55 2025 Received: from out-189.mta1.migadu.com (out-189.mta1.migadu.com [95.215.58.189]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7D3C23570C4 for ; Mon, 18 Aug 2025 17:02:29 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.189 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755536551; cv=none; b=SculmN0vqps0QcbvIxOYfD7JAdWFkOhLcpl1bGitYVL5uVqtZzfmy0iV2pcLSH0lgfyBp/ewRw7IIHYDqAKqHP5oYg+H/oN7jH6IYzfd9RSNCxx+17tl+QepTw4rjKjJif2QncOjPJ5XB6lZ5h5BzMKigScEvWTkLslyLADlGFU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755536551; c=relaxed/simple; bh=LpUvZYfAyvYsY1hyOXGDHMlV4+yV+Q9IiwaMAqbSsNE=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=q5zaIPrGJ6vT2a7W+Eiwc+F03+4I7LT/SR74BHllcXPLEebZtH9++uYd/Q7ifQXjEuG1zZ2jl+A1GqiPUcpsWZPfGVLqTS1onJubBpPnYN4eZr6eb6hj2vsuY2/6CUZvQzTVOy+6n52c0NA4AqpBCrJToHLNoKSTKPYUuHliEWQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=V99YPHhZ; arc=none smtp.client-ip=95.215.58.189 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="V99YPHhZ" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1755536548; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=HMXbpPyZn28wNi81Xy52IQ6/8O/P3CwcUXlLMg8zaio=; b=V99YPHhZQ+9Z8uCLGLL9YKz0afO/2QEMrxwXJ5U+Jz6YIUa/0QAzy1yphb3RyW4jRP64q8 BtPn2FKzKyGcaSeG+ltwX+S0XBJlvC7zRp5ZIN/+P0BbBr/JsoZeZDsv0ToPKeLFKTpxv5 yDjDkKAMKljjERe5aBtchFJSAkvxJuo= From: Roman Gushchin To: linux-mm@kvack.org, bpf@vger.kernel.org Cc: Suren Baghdasaryan , Johannes Weiner , Michal Hocko , David Rientjes , Matt Bobrowski , Song Liu , Kumar Kartikeya Dwivedi , Alexei Starovoitov , Andrew Morton , linux-kernel@vger.kernel.org, Roman Gushchin Subject: [PATCH v1 11/14] sched: psi: refactor psi_trigger_create() Date: Mon, 18 Aug 2025 10:01:33 -0700 Message-ID: <20250818170136.209169-12-roman.gushchin@linux.dev> In-Reply-To: <20250818170136.209169-1-roman.gushchin@linux.dev> References: <20250818170136.209169-1-roman.gushchin@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" Currently psi_trigger_create() does a lot of things: parses the user text input, allocates and initializes the psi_trigger structure and turns on the trigger. It does it slightly different for two existing types of psi_triggers: system-wide and cgroup-wide. In order to support a new type of psi triggers, which will be owned by a bpf program and won't have a user's text description, let's refactor psi_trigger_create(). 1. Introduce psi_trigger_type enum: currently PSI_SYSTEM and PSI_CGROUP are valid values. 2. Introduce psi_trigger_params structure to avoid passing a large number of parameters to psi_trigger_create(). 3. Move out the user's input parsing into the new psi_trigger_parse() helper. 4. Move out the capabilities check into the new psi_file_privileged() helper. 5. Stop relying on t->of for detecting trigger type. Signed-off-by: Roman Gushchin --- include/linux/psi.h | 15 +++++-- include/linux/psi_types.h | 33 ++++++++++++++- kernel/cgroup/cgroup.c | 14 ++++++- kernel/sched/psi.c | 87 +++++++++++++++++++++++++-------------- 4 files changed, 112 insertions(+), 37 deletions(-) diff --git a/include/linux/psi.h b/include/linux/psi.h index e0745873e3f2..8178e998d94b 100644 --- a/include/linux/psi.h +++ b/include/linux/psi.h @@ -23,14 +23,23 @@ void psi_memstall_enter(unsigned long *flags); void psi_memstall_leave(unsigned long *flags); =20 int psi_show(struct seq_file *s, struct psi_group *group, enum psi_res res= ); -struct psi_trigger *psi_trigger_create(struct psi_group *group, char *buf, - enum psi_res res, struct file *file, - struct kernfs_open_file *of); +int psi_trigger_parse(struct psi_trigger_params *params, const char *buf); +struct psi_trigger *psi_trigger_create(struct psi_group *group, + const struct psi_trigger_params *param); void psi_trigger_destroy(struct psi_trigger *t); =20 __poll_t psi_trigger_poll(void **trigger_ptr, struct file *file, poll_table *wait); =20 +static inline bool psi_file_privileged(struct file *file) +{ + /* + * Checking the privilege here on file->f_cred implies that a privileged = user + * could open the file and delegate the write to an unprivileged one. + */ + return cap_raised(file->f_cred->cap_effective, CAP_SYS_RESOURCE); +} + #ifdef CONFIG_CGROUPS static inline struct psi_group *cgroup_psi(struct cgroup *cgrp) { diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h index f1fd3a8044e0..cea54121d9b9 100644 --- a/include/linux/psi_types.h +++ b/include/linux/psi_types.h @@ -121,7 +121,38 @@ struct psi_window { u64 prev_growth; }; =20 +enum psi_trigger_type { + PSI_SYSTEM, + PSI_CGROUP, +}; + +struct psi_trigger_params { + /* Trigger type */ + enum psi_trigger_type type; + + /* Resources that workloads could be stalled on */ + enum psi_res res; + + /* True if all threads should be stalled to trigger */ + bool full; + + /* Threshold in us */ + u32 threshold_us; + + /* Window in us */ + u32 window_us; + + /* Privileged triggers are treated differently */ + bool privileged; + + /* Link to kernfs open file, only for PSI_CGROUP */ + struct kernfs_open_file *of; +}; + struct psi_trigger { + /* Trigger type */ + enum psi_trigger_type type; + /* PSI state being monitored by the trigger */ enum psi_states state; =20 @@ -137,7 +168,7 @@ struct psi_trigger { /* Wait queue for polling */ wait_queue_head_t event_wait; =20 - /* Kernfs file for cgroup triggers */ + /* Kernfs file for PSI_CGROUP triggers */ struct kernfs_open_file *of; =20 /* Pending event flag */ diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index a723b7dc6e4e..9cd3c3a52c21 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -3872,6 +3872,12 @@ static ssize_t pressure_write(struct kernfs_open_fil= e *of, char *buf, struct psi_trigger *new; struct cgroup *cgrp; struct psi_group *psi; + struct psi_trigger_params params; + int err; + + err =3D psi_trigger_parse(¶ms, buf); + if (err) + return err; =20 cgrp =3D cgroup_kn_lock_live(of->kn, false); if (!cgrp) @@ -3887,7 +3893,13 @@ static ssize_t pressure_write(struct kernfs_open_fil= e *of, char *buf, } =20 psi =3D cgroup_psi(cgrp); - new =3D psi_trigger_create(psi, buf, res, of->file, of); + + params.type =3D PSI_CGROUP; + params.res =3D res; + params.privileged =3D psi_file_privileged(of->file); + params.of =3D of; + + new =3D psi_trigger_create(psi, ¶ms); if (IS_ERR(new)) { cgroup_put(cgrp); return PTR_ERR(new); diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index ad04a5c3162a..e1d8eaeeff17 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -489,7 +489,7 @@ static void update_triggers(struct psi_group *group, u6= 4 now, =20 /* Generate an event */ if (cmpxchg(&t->event, 0, 1) =3D=3D 0) { - if (t->of) + if (t->type =3D=3D PSI_CGROUP) kernfs_notify(t->of->kn); else wake_up_interruptible(&t->event_wait); @@ -1281,74 +1281,87 @@ int psi_show(struct seq_file *m, struct psi_group *= group, enum psi_res res) return 0; } =20 -struct psi_trigger *psi_trigger_create(struct psi_group *group, char *buf, - enum psi_res res, struct file *file, - struct kernfs_open_file *of) +int psi_trigger_parse(struct psi_trigger_params *params, const char *buf) { - struct psi_trigger *t; - enum psi_states state; - u32 threshold_us; - bool privileged; - u32 window_us; + u32 threshold_us, window_us; =20 if (static_branch_likely(&psi_disabled)) - return ERR_PTR(-EOPNOTSUPP); - - /* - * Checking the privilege here on file->f_cred implies that a privileged = user - * could open the file and delegate the write to an unprivileged one. - */ - privileged =3D cap_raised(file->f_cred->cap_effective, CAP_SYS_RESOURCE); + return -EOPNOTSUPP; =20 if (sscanf(buf, "some %u %u", &threshold_us, &window_us) =3D=3D 2) - state =3D PSI_IO_SOME + res * 2; + params->full =3D false; else if (sscanf(buf, "full %u %u", &threshold_us, &window_us) =3D=3D 2) - state =3D PSI_IO_FULL + res * 2; + params->full =3D true; else - return ERR_PTR(-EINVAL); + return -EINVAL; + + params->threshold_us =3D threshold_us; + params->window_us =3D window_us; + return 0; +} + +struct psi_trigger *psi_trigger_create(struct psi_group *group, + const struct psi_trigger_params *params) +{ + struct psi_trigger *t; + enum psi_states state; + + if (static_branch_likely(&psi_disabled)) + return ERR_PTR(-EOPNOTSUPP); + + state =3D params->full ? PSI_IO_FULL : PSI_IO_SOME; + state +=3D params->res * 2; =20 #ifdef CONFIG_IRQ_TIME_ACCOUNTING - if (res =3D=3D PSI_IRQ && --state !=3D PSI_IRQ_FULL) + if (params->res =3D=3D PSI_IRQ && --state !=3D PSI_IRQ_FULL) return ERR_PTR(-EINVAL); #endif =20 if (state >=3D PSI_NONIDLE) return ERR_PTR(-EINVAL); =20 - if (window_us =3D=3D 0 || window_us > WINDOW_MAX_US) + if (params->window_us =3D=3D 0 || params->window_us > WINDOW_MAX_US) return ERR_PTR(-EINVAL); =20 /* * Unprivileged users can only use 2s windows so that averages aggregation * work is used, and no RT threads need to be spawned. */ - if (!privileged && window_us % 2000000) + if (!params->privileged && params->window_us % 2000000) return ERR_PTR(-EINVAL); =20 /* Check threshold */ - if (threshold_us =3D=3D 0 || threshold_us > window_us) + if (params->threshold_us =3D=3D 0 || params->threshold_us > params->windo= w_us) return ERR_PTR(-EINVAL); =20 t =3D kmalloc(sizeof(*t), GFP_KERNEL); if (!t) return ERR_PTR(-ENOMEM); =20 + t->type =3D params->type; t->group =3D group; t->state =3D state; - t->threshold =3D threshold_us * NSEC_PER_USEC; - t->win.size =3D window_us * NSEC_PER_USEC; + t->threshold =3D params->threshold_us * NSEC_PER_USEC; + t->win.size =3D params->window_us * NSEC_PER_USEC; window_reset(&t->win, sched_clock(), group->total[PSI_POLL][t->state], 0); =20 t->event =3D 0; t->last_event_time =3D 0; - t->of =3D of; - if (!of) + + switch (params->type) { + case PSI_SYSTEM: init_waitqueue_head(&t->event_wait); + break; + case PSI_CGROUP: + t->of =3D params->of; + break; + } + t->pending_event =3D false; - t->aggregator =3D privileged ? PSI_POLL : PSI_AVGS; + t->aggregator =3D params->privileged ? PSI_POLL : PSI_AVGS; =20 - if (privileged) { + if (params->privileged) { mutex_lock(&group->rtpoll_trigger_lock); =20 if (!rcu_access_pointer(group->rtpoll_task)) { @@ -1401,7 +1414,7 @@ void psi_trigger_destroy(struct psi_trigger *t) * being accessed later. Can happen if cgroup is deleted from under a * polling process. */ - if (t->of) + if (t->type =3D=3D PSI_CGROUP) kernfs_notify(t->of->kn); else wake_up_interruptible(&t->event_wait); @@ -1481,7 +1494,7 @@ __poll_t psi_trigger_poll(void **trigger_ptr, if (!t) return DEFAULT_POLLMASK | EPOLLERR | EPOLLPRI; =20 - if (t->of) + if (t->type =3D=3D PSI_CGROUP) kernfs_generic_poll(t->of, wait); else poll_wait(file, &t->event_wait, wait); @@ -1530,6 +1543,8 @@ static ssize_t psi_write(struct file *file, const cha= r __user *user_buf, size_t buf_size; struct seq_file *seq; struct psi_trigger *new; + struct psi_trigger_params params; + int err; =20 if (static_branch_likely(&psi_disabled)) return -EOPNOTSUPP; @@ -1543,6 +1558,10 @@ static ssize_t psi_write(struct file *file, const ch= ar __user *user_buf, =20 buf[buf_size - 1] =3D '\0'; =20 + err =3D psi_trigger_parse(¶ms, buf); + if (err) + return err; + seq =3D file->private_data; =20 /* Take seq->lock to protect seq->private from concurrent writes */ @@ -1554,7 +1573,11 @@ static ssize_t psi_write(struct file *file, const ch= ar __user *user_buf, return -EBUSY; } =20 - new =3D psi_trigger_create(&psi_system, buf, res, file, NULL); + params.type =3D PSI_SYSTEM; + params.res =3D res; + params.privileged =3D psi_file_privileged(file); + + new =3D psi_trigger_create(&psi_system, ¶ms); if (IS_ERR(new)) { mutex_unlock(&seq->lock); return PTR_ERR(new); --=20 2.50.1 From nobody Sat Oct 4 09:40:55 2025 Received: from out-189.mta1.migadu.com (out-189.mta1.migadu.com [95.215.58.189]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 849AC3469E5; Mon, 18 Aug 2025 17:02:34 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.189 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755536556; cv=none; b=ES7VGWKOui8fUxGwT0OOADnB4wrRKJVZo23dDF1cFCnXubqrmDRiCNn6Ue7pjHoI/BqCLtKYDHXSWBv/C3EBTFqjDWLwwrz3ZRM6p5isONLwCrVyGMFOfUXfT3ebtagu4TKSRrITigVdu+/OHhl4MuAYWVPXA5DndqhJQKHCZT4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755536556; c=relaxed/simple; bh=IK6bLHfkNWeH4yBbAM2vyhNmfQCnsYlYiDPxFtHJQ1I=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=kGFIo7E9HWERSZz+S/YSR6anX6AX9D75Dulik24n2cZuSnBFq8/sTJZMjF984W0NY2LW9LWa9Rc1SjcBqtzBirw0OA6bZnIrCFr5P5+GzlXlEIEfJfzR1I9itrUl7h8LdwowY6rBmMT9w3PlsuYn6h3nAvxELONm4w5S8kYllxE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=ZJ0SHZLm; arc=none smtp.client-ip=95.215.58.189 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="ZJ0SHZLm" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1755536552; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=/YrYUzXccRuZ9yFhf36EzBvX3yiPzvgrHin8+im7U8c=; b=ZJ0SHZLmHc8edLK5ZYYB4wbYZfDWJcfd2/Tf3z0l1o5HxRXQSbxTpe09chOrUd8ci7phP+ ieVgN4zISNFQN8Jh/sSklBbnBlw8SL9MLoPvSYkPnbGeNrLInaafU8anGtjdp58uCaIjxx xCRExOHuT7oMdQG9t4IXTVWnQQul9mE= From: Roman Gushchin To: linux-mm@kvack.org, bpf@vger.kernel.org Cc: Suren Baghdasaryan , Johannes Weiner , Michal Hocko , David Rientjes , Matt Bobrowski , Song Liu , Kumar Kartikeya Dwivedi , Alexei Starovoitov , Andrew Morton , linux-kernel@vger.kernel.org, Roman Gushchin Subject: [PATCH v1 12/14] sched: psi: implement psi trigger handling using bpf Date: Mon, 18 Aug 2025 10:01:34 -0700 Message-ID: <20250818170136.209169-13-roman.gushchin@linux.dev> In-Reply-To: <20250818170136.209169-1-roman.gushchin@linux.dev> References: <20250818170136.209169-1-roman.gushchin@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" This patch implements a bpf struct ops-based mechanism to create psi triggers, attach them to cgroups or system wide and handle psi events in bpf. The struct ops provides 3 callbacks: - init() called once at load, handy for creating psi triggers - handle_psi_event() called every time a psi trigger fires - handle_cgroup_free() called if a cgroup with an attached trigger is being freed A single struct ops can create a number of psi triggers, both cgroup-scoped and system-wide. All 3 struct ops callbacks can be sleepable. handle_psi_event() handlers are executed using a separate workqueue, so it won't affect the latency of other psi triggers. Signed-off-by: Roman Gushchin --- include/linux/bpf_psi.h | 71 ++++++++++ include/linux/psi_types.h | 43 +++++- kernel/sched/bpf_psi.c | 253 +++++++++++++++++++++++++++++++++++ kernel/sched/build_utility.c | 4 + kernel/sched/psi.c | 49 +++++-- 5 files changed, 408 insertions(+), 12 deletions(-) create mode 100644 include/linux/bpf_psi.h create mode 100644 kernel/sched/bpf_psi.c diff --git a/include/linux/bpf_psi.h b/include/linux/bpf_psi.h new file mode 100644 index 000000000000..826ab89ac11c --- /dev/null +++ b/include/linux/bpf_psi.h @@ -0,0 +1,71 @@ +/* SPDX-License-Identifier: GPL-2.0+ */ + +#ifndef __BPF_PSI_H +#define __BPF_PSI_H + +#include +#include +#include +#include + +struct cgroup; +struct bpf_psi; +struct psi_trigger; +struct psi_trigger_params; + +#define BPF_PSI_FULL 0x80000000 + +struct bpf_psi_ops { + /** + * @init: Initialization callback, suited for creating psi triggers. + * @bpf_psi: bpf_psi pointer, can be passed to bpf_psi_create_trigger(). + * + * A non-0 return value means the initialization has been failed. + */ + int (*init)(struct bpf_psi *bpf_psi); + + /** + * @handle_psi_event: PSI event callback + * @t: psi_trigger pointer + */ + void (*handle_psi_event)(struct psi_trigger *t); + + /** + * @handle_cgroup_free: Cgroup free callback + * @cgroup_id: Id of freed cgroup + * + * Called every time a cgroup with an attached bpf psi trigger is freed. + * No psi events can be raised after handle_cgroup_free(). + */ + void (*handle_cgroup_free)(u64 cgroup_id); + + /* private */ + struct bpf_psi *bpf_psi; +}; + +struct bpf_psi { + spinlock_t lock; + struct list_head triggers; + struct bpf_psi_ops *ops; + struct srcu_struct srcu; +}; + +#ifdef CONFIG_BPF_SYSCALL +void bpf_psi_add_trigger(struct psi_trigger *t, + const struct psi_trigger_params *params); +void bpf_psi_remove_trigger(struct psi_trigger *t); +void bpf_psi_handle_event(struct psi_trigger *t); +#ifdef CONFIG_CGROUPS +void bpf_psi_cgroup_free(struct cgroup *cgroup); +#endif + +#else /* CONFIG_BPF_SYSCALL */ +static inline void bpf_psi_add_trigger(struct psi_trigger *t, + const struct psi_trigger_params *params) {} +static inline void bpf_psi_remove_trigger(struct psi_trigger *t) {} +static inline void bpf_psi_handle_event(struct psi_trigger *t) {} +static inline void bpf_psi_cgroup_free(struct cgroup *cgroup) {} + +#endif /* CONFIG_BPF_SYSCALL */ + +#endif /* __BPF_PSI_H */ diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h index cea54121d9b9..f695cc34cfd4 100644 --- a/include/linux/psi_types.h +++ b/include/linux/psi_types.h @@ -124,6 +124,7 @@ struct psi_window { enum psi_trigger_type { PSI_SYSTEM, PSI_CGROUP, + PSI_BPF, }; =20 struct psi_trigger_params { @@ -145,8 +146,15 @@ struct psi_trigger_params { /* Privileged triggers are treated differently */ bool privileged; =20 - /* Link to kernfs open file, only for PSI_CGROUP */ - struct kernfs_open_file *of; + union { + /* Link to kernfs open file, only for PSI_CGROUP */ + struct kernfs_open_file *of; + +#ifdef CONFIG_BPF_SYSCALL + /* Link to bpf_psi structure, only for BPF_PSI */ + struct bpf_psi *bpf_psi; +#endif + }; }; =20 struct psi_trigger { @@ -188,6 +196,31 @@ struct psi_trigger { =20 /* Trigger type - PSI_AVGS for unprivileged, PSI_POLL for RT */ enum psi_aggregators aggregator; + +#ifdef CONFIG_BPF_SYSCALL + /* Fields specific to PSI_BPF triggers */ + + /* Bpf psi structure for events handling */ + struct bpf_psi *bpf_psi; + + /* List node inside bpf_psi->triggers list */ + struct list_head bpf_psi_node; + + /* List node inside group->bpf_triggers list */ + struct list_head bpf_group_node; + + /* Work structure, used to execute event handlers */ + struct work_struct bpf_work; + + /* + * Whether the trigger is being pinned in memory. + * Protected by group->bpf_triggers_lock. + */ + bool pinned; + + /* Cgroup Id */ + u64 cgroup_id; +#endif }; =20 struct psi_group { @@ -236,6 +269,12 @@ struct psi_group { u64 rtpoll_total[NR_PSI_STATES - 1]; u64 rtpoll_next_update; u64 rtpoll_until; + +#ifdef CONFIG_BPF_SYSCALL + /* List of triggers owned by bpf and corresponding lock */ + spinlock_t bpf_triggers_lock; + struct list_head bpf_triggers; +#endif }; =20 #else /* CONFIG_PSI */ diff --git a/kernel/sched/bpf_psi.c b/kernel/sched/bpf_psi.c new file mode 100644 index 000000000000..2ea9d7276b21 --- /dev/null +++ b/kernel/sched/bpf_psi.c @@ -0,0 +1,253 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * BPF PSI event handlers + * + * Author: Roman Gushchin + */ + +#include +#include + +static struct workqueue_struct *bpf_psi_wq; + +static struct bpf_psi *bpf_psi_create(struct bpf_psi_ops *ops) +{ + struct bpf_psi *bpf_psi; + + bpf_psi =3D kzalloc(sizeof(*bpf_psi), GFP_KERNEL); + if (!bpf_psi) + return NULL; + + if (init_srcu_struct(&bpf_psi->srcu)) { + kfree(bpf_psi); + return NULL; + } + + spin_lock_init(&bpf_psi->lock); + bpf_psi->ops =3D ops; + INIT_LIST_HEAD(&bpf_psi->triggers); + ops->bpf_psi =3D bpf_psi; + + return bpf_psi; +} + +static void bpf_psi_free(struct bpf_psi *bpf_psi) +{ + cleanup_srcu_struct(&bpf_psi->srcu); + kfree(bpf_psi); +} + +static void bpf_psi_handle_event_fn(struct work_struct *work) +{ + struct psi_trigger *t; + struct bpf_psi *bpf_psi; + int idx; + + t =3D container_of(work, struct psi_trigger, bpf_work); + bpf_psi =3D READ_ONCE(t->bpf_psi); + + if (likely(bpf_psi)) { + idx =3D srcu_read_lock(&bpf_psi->srcu); + if (bpf_psi->ops->handle_psi_event) + bpf_psi->ops->handle_psi_event(t); + srcu_read_unlock(&bpf_psi->srcu, idx); + } +} + +void bpf_psi_add_trigger(struct psi_trigger *t, + const struct psi_trigger_params *params) +{ + t->bpf_psi =3D params->bpf_psi; + t->pinned =3D false; + INIT_WORK(&t->bpf_work, bpf_psi_handle_event_fn); + + spin_lock(&t->bpf_psi->lock); + list_add(&t->bpf_psi_node, &t->bpf_psi->triggers); + spin_unlock(&t->bpf_psi->lock); + + spin_lock(&t->group->bpf_triggers_lock); + list_add(&t->bpf_group_node, &t->group->bpf_triggers); + spin_unlock(&t->group->bpf_triggers_lock); +} + +void bpf_psi_remove_trigger(struct psi_trigger *t) +{ + spin_lock(&t->group->bpf_triggers_lock); + list_del(&t->bpf_group_node); + spin_unlock(&t->group->bpf_triggers_lock); + + spin_lock(&t->bpf_psi->lock); + list_del(&t->bpf_psi_node); + spin_unlock(&t->bpf_psi->lock); +} + +#ifdef CONFIG_CGROUPS +void bpf_psi_cgroup_free(struct cgroup *cgroup) +{ + struct psi_group *group =3D cgroup->psi; + u64 cgrp_id =3D cgroup_id(cgroup); + struct psi_trigger *t, *p; + struct bpf_psi *bpf_psi; + LIST_HEAD(to_destroy); + int idx; + + spin_lock(&group->bpf_triggers_lock); + list_for_each_entry_safe(t, p, &group->bpf_triggers, bpf_group_node) { + if (!t->pinned) { + t->pinned =3D true; + list_move(&t->bpf_group_node, &to_destroy); + } + } + spin_unlock(&group->bpf_triggers_lock); + + list_for_each_entry_safe(t, p, &to_destroy, bpf_group_node) { + bpf_psi =3D READ_ONCE(t->bpf_psi); + + idx =3D srcu_read_lock(&bpf_psi->srcu); + if (bpf_psi->ops->handle_cgroup_free) + bpf_psi->ops->handle_cgroup_free(cgrp_id); + srcu_read_unlock(&bpf_psi->srcu, idx); + + spin_lock(&bpf_psi->lock); + list_del(&t->bpf_psi_node); + spin_unlock(&bpf_psi->lock); + + WRITE_ONCE(t->bpf_psi, NULL); + flush_workqueue(bpf_psi_wq); + synchronize_srcu(&bpf_psi->srcu); + psi_trigger_destroy(t); + } +} +#endif + +void bpf_psi_handle_event(struct psi_trigger *t) +{ + queue_work(bpf_psi_wq, &t->bpf_work); +} + +// bpf struct ops + +static int __bpf_psi_init(struct bpf_psi *bpf_psi) { return 0; } +static void __bpf_psi_handle_psi_event(struct psi_trigger *t) {} +static void __bpf_psi_handle_cgroup_free(u64 cgroup_id) {} + +static struct bpf_psi_ops __bpf_psi_ops =3D { + .init =3D __bpf_psi_init, + .handle_psi_event =3D __bpf_psi_handle_psi_event, + .handle_cgroup_free =3D __bpf_psi_handle_cgroup_free, +}; + +static const struct bpf_func_proto * +bpf_psi_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) +{ + return tracing_prog_func_proto(func_id, prog); +} + +static bool bpf_psi_ops_is_valid_access(int off, int size, + enum bpf_access_type type, + const struct bpf_prog *prog, + struct bpf_insn_access_aux *info) +{ + return bpf_tracing_btf_ctx_access(off, size, type, prog, info); +} + +static const struct bpf_verifier_ops bpf_psi_verifier_ops =3D { + .get_func_proto =3D bpf_psi_func_proto, + .is_valid_access =3D bpf_psi_ops_is_valid_access, +}; + +static int bpf_psi_ops_reg(void *kdata, struct bpf_link *link) +{ + struct bpf_psi_ops *ops =3D kdata; + struct bpf_psi *bpf_psi; + + bpf_psi =3D bpf_psi_create(ops); + if (!bpf_psi) + return -ENOMEM; + + return ops->init(bpf_psi); +} + +static void bpf_psi_ops_unreg(void *kdata, struct bpf_link *link) +{ + struct bpf_psi_ops *ops =3D kdata; + struct bpf_psi *bpf_psi =3D ops->bpf_psi; + struct psi_trigger *t, *p; + LIST_HEAD(to_destroy); + + spin_lock(&bpf_psi->lock); + list_for_each_entry_safe(t, p, &bpf_psi->triggers, bpf_psi_node) { + spin_lock(&t->group->bpf_triggers_lock); + if (!t->pinned) { + t->pinned =3D true; + list_move(&t->bpf_group_node, &to_destroy); + list_del(&t->bpf_psi_node); + + WRITE_ONCE(t->bpf_psi, NULL); + } + spin_unlock(&t->group->bpf_triggers_lock); + } + spin_unlock(&bpf_psi->lock); + + flush_workqueue(bpf_psi_wq); + synchronize_srcu(&bpf_psi->srcu); + + list_for_each_entry_safe(t, p, &to_destroy, bpf_group_node) + psi_trigger_destroy(t); + + bpf_psi_free(bpf_psi); +} + +static int bpf_psi_ops_check_member(const struct btf_type *t, + const struct btf_member *member, + const struct bpf_prog *prog) +{ + return 0; +} + +static int bpf_psi_ops_init_member(const struct btf_type *t, + const struct btf_member *member, + void *kdata, const void *udata) +{ + return 0; +} + +static int bpf_psi_ops_init(struct btf *btf) +{ + return 0; +} + +static struct bpf_struct_ops bpf_psi_bpf_ops =3D { + .verifier_ops =3D &bpf_psi_verifier_ops, + .reg =3D bpf_psi_ops_reg, + .unreg =3D bpf_psi_ops_unreg, + .check_member =3D bpf_psi_ops_check_member, + .init_member =3D bpf_psi_ops_init_member, + .init =3D bpf_psi_ops_init, + .name =3D "bpf_psi_ops", + .owner =3D THIS_MODULE, + .cfi_stubs =3D &__bpf_psi_ops +}; + +static int __init bpf_psi_struct_ops_init(void) +{ + int wq_flags =3D WQ_MEM_RECLAIM | WQ_UNBOUND | WQ_HIGHPRI; + int err; + + bpf_psi_wq =3D alloc_workqueue("bpf_psi_wq", wq_flags, 0); + if (!bpf_psi_wq) + return -ENOMEM; + + err =3D register_bpf_struct_ops(&bpf_psi_bpf_ops, bpf_psi_ops); + if (err) { + pr_warn("error while registering bpf psi struct ops: %d", err); + goto err; + } + + return 0; + +err: + destroy_workqueue(bpf_psi_wq); + return err; +} +late_initcall(bpf_psi_struct_ops_init); diff --git a/kernel/sched/build_utility.c b/kernel/sched/build_utility.c index bf9d8db94b70..80f3799a2fa6 100644 --- a/kernel/sched/build_utility.c +++ b/kernel/sched/build_utility.c @@ -19,6 +19,7 @@ #include #include =20 +#include #include #include #include @@ -92,6 +93,9 @@ =20 #ifdef CONFIG_PSI # include "psi.c" +# ifdef CONFIG_BPF_SYSCALL +# include "bpf_psi.c" +# endif #endif =20 #ifdef CONFIG_MEMBARRIER diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index e1d8eaeeff17..e10fbbc34099 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -201,6 +201,10 @@ static void group_init(struct psi_group *group) init_waitqueue_head(&group->rtpoll_wait); timer_setup(&group->rtpoll_timer, poll_timer_fn, 0); rcu_assign_pointer(group->rtpoll_task, NULL); +#ifdef CONFIG_BPF_SYSCALL + spin_lock_init(&group->bpf_triggers_lock); + INIT_LIST_HEAD(&group->bpf_triggers); +#endif } =20 void __init psi_init(void) @@ -489,10 +493,17 @@ static void update_triggers(struct psi_group *group, = u64 now, =20 /* Generate an event */ if (cmpxchg(&t->event, 0, 1) =3D=3D 0) { - if (t->type =3D=3D PSI_CGROUP) - kernfs_notify(t->of->kn); - else + switch (t->type) { + case PSI_SYSTEM: wake_up_interruptible(&t->event_wait); + break; + case PSI_CGROUP: + kernfs_notify(t->of->kn); + break; + case PSI_BPF: + bpf_psi_handle_event(t); + break; + } } t->last_event_time =3D now; /* Reset threshold breach flag once event got generated */ @@ -1125,6 +1136,7 @@ void psi_cgroup_free(struct cgroup *cgroup) return; =20 cancel_delayed_work_sync(&cgroup->psi->avgs_work); + bpf_psi_cgroup_free(cgroup); free_percpu(cgroup->psi->pcpu); /* All triggers must be removed by now */ WARN_ONCE(cgroup->psi->rtpoll_states, "psi: trigger leak\n"); @@ -1356,6 +1368,9 @@ struct psi_trigger *psi_trigger_create(struct psi_gro= up *group, case PSI_CGROUP: t->of =3D params->of; break; + case PSI_BPF: + bpf_psi_add_trigger(t, params); + break; } =20 t->pending_event =3D false; @@ -1369,8 +1384,10 @@ struct psi_trigger *psi_trigger_create(struct psi_gr= oup *group, =20 task =3D kthread_create(psi_rtpoll_worker, group, "psimon"); if (IS_ERR(task)) { - kfree(t); mutex_unlock(&group->rtpoll_trigger_lock); + if (t->type =3D=3D PSI_BPF) + bpf_psi_remove_trigger(t); + kfree(t); return ERR_CAST(task); } atomic_set(&group->rtpoll_wakeup, 0); @@ -1414,10 +1431,16 @@ void psi_trigger_destroy(struct psi_trigger *t) * being accessed later. Can happen if cgroup is deleted from under a * polling process. */ - if (t->type =3D=3D PSI_CGROUP) - kernfs_notify(t->of->kn); - else + switch (t->type) { + case PSI_SYSTEM: wake_up_interruptible(&t->event_wait); + break; + case PSI_CGROUP: + kernfs_notify(t->of->kn); + break; + case PSI_BPF: + break; + } =20 if (t->aggregator =3D=3D PSI_AVGS) { mutex_lock(&group->avgs_lock); @@ -1494,10 +1517,16 @@ __poll_t psi_trigger_poll(void **trigger_ptr, if (!t) return DEFAULT_POLLMASK | EPOLLERR | EPOLLPRI; =20 - if (t->type =3D=3D PSI_CGROUP) - kernfs_generic_poll(t->of, wait); - else + switch (t->type) { + case PSI_SYSTEM: poll_wait(file, &t->event_wait, wait); + break; + case PSI_CGROUP: + kernfs_generic_poll(t->of, wait); + break; + case PSI_BPF: + break; + } =20 if (cmpxchg(&t->event, 1, 0) =3D=3D 1) ret |=3D EPOLLPRI; --=20 2.50.1 From nobody Sat Oct 4 09:40:55 2025 Received: from out-183.mta1.migadu.com (out-183.mta1.migadu.com [95.215.58.183]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B3F9528488F for ; Mon, 18 Aug 2025 17:02:38 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.183 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755536560; cv=none; b=k6SC7uiFTEPkDOm+QLouyRrdlebzKyLPd8QwXAWowf67Y+9oygZ3wj7iEIsKzOwnP59rgGQCAUoG+irNhNqUVRg8YV8ZwGNqvSR7t24UXA9fDEqNunOF1TlSOHtVJKk/7VMERADq8hQejKv9SutenAdBEo8B7tFRxyY1hBTuTN0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755536560; c=relaxed/simple; bh=vf2FgXUFW1hJGq8BrB8SWUGv/LfqjjuO0IlAyzrPQ0M=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=iCbJ1BvnM2FmZ9+t5BIHkp6wnQZbiQXoIvCo7ugkR9vBJfQ18XVoujI38voFEgWP+n7Dt34TKluqkeH/Km8a6b5qA3Gt6wU+Wk28GBbr8s5Meh2WJ52+W+4V/XssbcMve8eUcGUQY142RxIzAc8cz16AjFmy8mAMiHdZfnTW8Aw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=bRVkX03N; arc=none smtp.client-ip=95.215.58.183 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="bRVkX03N" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1755536556; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=fR8USxErG1Ul1WpSShNEk/0fldDsW4tkEGeWyFTA9dE=; b=bRVkX03N8WwJPSb+ZZILgpbe4ubu++Z3SsDoJNbAOFhyp8e3kblkz8zFcSlnhQn1rguMK6 Eq1DWH0U3jgeeCrmWHUvCjqQ4AqNEj0rXOn9z2e1USrwOrOMNKVohUmpz34K3IVVLwVlm5 Qx0QCG6q1nKf6PJJUl8PNSFmr9hYeBw= From: Roman Gushchin To: linux-mm@kvack.org, bpf@vger.kernel.org Cc: Suren Baghdasaryan , Johannes Weiner , Michal Hocko , David Rientjes , Matt Bobrowski , Song Liu , Kumar Kartikeya Dwivedi , Alexei Starovoitov , Andrew Morton , linux-kernel@vger.kernel.org, Roman Gushchin Subject: [PATCH v1 13/14] sched: psi: implement bpf_psi_create_trigger() kfunc Date: Mon, 18 Aug 2025 10:01:35 -0700 Message-ID: <20250818170136.209169-14-roman.gushchin@linux.dev> In-Reply-To: <20250818170136.209169-1-roman.gushchin@linux.dev> References: <20250818170136.209169-1-roman.gushchin@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" Implement a new bpf_psi_create_trigger() bpf kfunc, which allows to create new psi triggers and attach them to cgroups or be system-wide. Created triggers will exist until the struct ops is loaded and if they are attached to a cgroup until the cgroup exists. Due to a limitation of 5 arguments, the resource type and the "full" bit are squeezed into a single u32. Signed-off-by: Roman Gushchin --- kernel/sched/bpf_psi.c | 84 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 84 insertions(+) diff --git a/kernel/sched/bpf_psi.c b/kernel/sched/bpf_psi.c index 2ea9d7276b21..94b684221708 100644 --- a/kernel/sched/bpf_psi.c +++ b/kernel/sched/bpf_psi.c @@ -156,6 +156,83 @@ static const struct bpf_verifier_ops bpf_psi_verifier_= ops =3D { .is_valid_access =3D bpf_psi_ops_is_valid_access, }; =20 +__bpf_kfunc_start_defs(); + +/** + * bpf_psi_create_trigger - Create a PSI trigger + * @bpf_psi: bpf_psi struct to attach the trigger to + * @cgroup_id: cgroup Id to attach the trigger; 0 for system-wide scope + * @resource: resource to monitor (PSI_MEM, PSI_IO, etc) and the full bit. + * @threshold_us: threshold in us + * @window_us: window in us + * + * Creates a PSI trigger and attached is to bpf_psi. The trigger will be + * active unless bpf struct ops is unloaded or the corresponding cgroup + * is deleted. + * + * Resource's most significant bit encodes whether "some" or "full" + * PSI state should be tracked. + * + * Returns 0 on success and the error code on failure. + */ +__bpf_kfunc int bpf_psi_create_trigger(struct bpf_psi *bpf_psi, + u64 cgroup_id, u32 resource, + u32 threshold_us, u32 window_us) +{ + enum psi_res res =3D resource & ~BPF_PSI_FULL; + bool full =3D resource & BPF_PSI_FULL; + struct psi_trigger_params params; + struct cgroup *cgroup __maybe_unused =3D NULL; + struct psi_group *group; + struct psi_trigger *t; + int ret =3D 0; + + if (res >=3D NR_PSI_RESOURCES) + return -EINVAL; + +#ifdef CONFIG_CGROUPS + if (cgroup_id) { + cgroup =3D cgroup_get_from_id(cgroup_id); + if (IS_ERR_OR_NULL(cgroup)) + return PTR_ERR(cgroup); + + group =3D cgroup_psi(cgroup); + } else +#endif + group =3D &psi_system; + + params.type =3D PSI_BPF; + params.bpf_psi =3D bpf_psi; + params.privileged =3D capable(CAP_SYS_RESOURCE); + params.res =3D res; + params.full =3D full; + params.threshold_us =3D threshold_us; + params.window_us =3D window_us; + + t =3D psi_trigger_create(group, ¶ms); + if (IS_ERR(t)) + ret =3D PTR_ERR(t); + else + t->cgroup_id =3D cgroup_id; + +#ifdef CONFIG_CGROUPS + if (cgroup) + cgroup_put(cgroup); +#endif + + return ret; +} +__bpf_kfunc_end_defs(); + +BTF_KFUNCS_START(bpf_psi_kfuncs) +BTF_ID_FLAGS(func, bpf_psi_create_trigger, KF_TRUSTED_ARGS) +BTF_KFUNCS_END(bpf_psi_kfuncs) + +static const struct btf_kfunc_id_set bpf_psi_kfunc_set =3D { + .owner =3D THIS_MODULE, + .set =3D &bpf_psi_kfuncs, +}; + static int bpf_psi_ops_reg(void *kdata, struct bpf_link *link) { struct bpf_psi_ops *ops =3D kdata; @@ -238,6 +315,13 @@ static int __init bpf_psi_struct_ops_init(void) if (!bpf_psi_wq) return -ENOMEM; =20 + err =3D register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS, + &bpf_psi_kfunc_set); + if (err) { + pr_warn("error while registering bpf psi kfuncs: %d", err); + goto err; + } + err =3D register_bpf_struct_ops(&bpf_psi_bpf_ops, bpf_psi_ops); if (err) { pr_warn("error while registering bpf psi struct ops: %d", err); --=20 2.50.1 From nobody Sat Oct 4 09:40:55 2025 Received: from out-173.mta1.migadu.com (out-173.mta1.migadu.com [95.215.58.173]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3F9E2341AC3; Mon, 18 Aug 2025 17:02:42 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=95.215.58.173 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755536564; cv=none; b=GMc1d3lPGJ5ldRO/+O0gysTfQv5YXhuhi5m2TaXbPJWkbM7wItmo5ZZjkdtU9I9AGRlegOOEGulR5SGKXrjrGMOCFdoDrDU0TzHnZxcFx7OEefs12Y1vzwbKpOFFoEOM7LpCmpA4Yn3CQEywQAgn7DpS65638SHSUoLNcl95cQI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1755536564; c=relaxed/simple; bh=RToAh4b/e30wrE6mNGIluyf8QImkEZphYBn0FkizgYs=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=Of8jQsTXHVaFRV7hUbUmIUCej9SbKVTTXo56rqlvo206zuDkjfSLySI98gQ/7WMMhv5Wa3pOZA6iaSMrc1Wo4I2CbrgyByJDc78lujOOo+dYubGG3v2jTBFUfGoxOZACY9kbYB4DAdPCtaBUpKrawnofSmmXUg9WwTKgP6S91bg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=G6Jd/6cP; arc=none smtp.client-ip=95.215.58.173 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="G6Jd/6cP" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1755536560; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=zWT242NgNZP3jc1GQRw8o49/i2gTFgutb9b4mEZjMpA=; b=G6Jd/6cPEKn36h3z3JKcou8qMKVTIQVTjk1XJJAqMSHviLXKnseiiF4s+ycNlEVF4gmEs8 vrd0Hjv0FBdhFuXJqncBC5vwr4nz7aXsRXeFQlIuDqRnz3Mj1mmg9zCQ3L7TRAc1Uh9ikm xEoVctnSxZw0eWXBb+JWHvbr37zDzs8= From: Roman Gushchin To: linux-mm@kvack.org, bpf@vger.kernel.org Cc: Suren Baghdasaryan , Johannes Weiner , Michal Hocko , David Rientjes , Matt Bobrowski , Song Liu , Kumar Kartikeya Dwivedi , Alexei Starovoitov , Andrew Morton , linux-kernel@vger.kernel.org, Roman Gushchin Subject: [PATCH v1 14/14] bpf: selftests: psi struct ops test Date: Mon, 18 Aug 2025 10:01:36 -0700 Message-ID: <20250818170136.209169-15-roman.gushchin@linux.dev> In-Reply-To: <20250818170136.209169-1-roman.gushchin@linux.dev> References: <20250818170136.209169-1-roman.gushchin@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Migadu-Flow: FLOW_OUT Content-Type: text/plain; charset="utf-8" Add a psi struct ops test. The test creates a cgroup with two child sub-cgroups, sets up memory.high for one of those and puts there a memory hungry process (initially frozen). Then it creates 2 psi triggers from within a init() bpf callback and attaches them to these cgroups. Then it deletes the first cgroup and runs the memory hungry task. The task is creating a high memory pressure, which triggers the psi event. The psi bpf handler declares a memcg oom in the corresponding cgroup. Finally the checks that both handle_cgroup_free() and handle_psi_event() handlers were executed, the correct process was killed and oom counters were updated. Signed-off-by: Roman Gushchin --- .../selftests/bpf/prog_tests/test_psi.c | 224 ++++++++++++++++++ tools/testing/selftests/bpf/progs/test_psi.c | 76 ++++++ 2 files changed, 300 insertions(+) create mode 100644 tools/testing/selftests/bpf/prog_tests/test_psi.c create mode 100644 tools/testing/selftests/bpf/progs/test_psi.c diff --git a/tools/testing/selftests/bpf/prog_tests/test_psi.c b/tools/test= ing/selftests/bpf/prog_tests/test_psi.c new file mode 100644 index 000000000000..4f3c91bd6606 --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/test_psi.c @@ -0,0 +1,224 @@ +// SPDX-License-Identifier: GPL-2.0-only +#include +#include +#include + +#include "cgroup_helpers.h" +#include "test_psi.skel.h" + +enum psi_res { + PSI_IO, + PSI_MEM, + PSI_CPU, + PSI_IRQ, + NR_PSI_RESOURCES, +}; + +struct cgroup_desc { + const char *path; + unsigned long long id; + int pid; + int fd; + size_t target; + size_t high; + bool victim; +}; + +#define MB (1024 * 1024) + +static struct cgroup_desc cgroups[] =3D { + { .path =3D "/oom_test" }, + { .path =3D "/oom_test/cg1" }, + { .path =3D "/oom_test/cg2", .target =3D 500 * MB, + .high =3D 40 * MB, .victim =3D true }, +}; + +static int spawn_task(struct cgroup_desc *desc) +{ + char *ptr; + int pid; + + pid =3D fork(); + if (pid < 0) + return pid; + + if (pid > 0) { + /* parent */ + desc->pid =3D pid; + return 0; + } + + /* child */ + ptr =3D (char *)malloc(desc->target); + if (!ptr) + return -ENOMEM; + + memset(ptr, 'a', desc->target); + + while (1) + sleep(1000); + + return 0; +} + +static void setup_environment(void) +{ + int i, err; + + err =3D setup_cgroup_environment(); + if (!ASSERT_OK(err, "setup_cgroup_environment")) + goto cleanup; + + for (i =3D 0; i < ARRAY_SIZE(cgroups); i++) { + cgroups[i].fd =3D create_and_get_cgroup(cgroups[i].path); + if (!ASSERT_GE(cgroups[i].fd, 0, "create_and_get_cgroup")) + goto cleanup; + + cgroups[i].id =3D get_cgroup_id(cgroups[i].path); + if (!ASSERT_GT(cgroups[i].id, 0, "get_cgroup_id")) + goto cleanup; + + /* Freeze the top-level cgroup and enable the memory controller */ + if (i =3D=3D 0) { + err =3D write_cgroup_file(cgroups[i].path, "cgroup.freeze", "1"); + if (!ASSERT_OK(err, "freeze cgroup")) + goto cleanup; + + err =3D write_cgroup_file(cgroups[i].path, "cgroup.subtree_control", + "+memory"); + if (!ASSERT_OK(err, "enable memory controller")) + goto cleanup; + } + + /* Set memory.high */ + if (cgroups[i].high) { + char buf[256]; + + snprintf(buf, sizeof(buf), "%lu", cgroups[i].high); + err =3D write_cgroup_file(cgroups[i].path, "memory.high", buf); + if (!ASSERT_OK(err, "set memory.high")) + goto cleanup; + + snprintf(buf, sizeof(buf), "0"); + write_cgroup_file(cgroups[i].path, "memory.swap.max", buf); + } + + /* Spawn tasks creating memory pressure */ + if (cgroups[i].target) { + char buf[256]; + + err =3D spawn_task(&cgroups[i]); + if (!ASSERT_OK(err, "spawn task")) + goto cleanup; + + snprintf(buf, sizeof(buf), "%d", cgroups[i].pid); + err =3D write_cgroup_file(cgroups[i].path, "cgroup.procs", buf); + if (!ASSERT_OK(err, "put child into a cgroup")) + goto cleanup; + } + } + + return; + +cleanup: + cleanup_cgroup_environment(); +} + +static int run_and_wait_for_oom(void) +{ + int ret =3D -1; + bool first =3D true; + char buf[4096] =3D {}; + size_t size; + + /* Unfreeze the top-level cgroup */ + ret =3D write_cgroup_file(cgroups[0].path, "cgroup.freeze", "0"); + if (!ASSERT_OK(ret, "unfreeze cgroup")) + return -1; + + for (;;) { + int i, status; + pid_t pid =3D wait(&status); + + if (pid =3D=3D -1) { + if (errno =3D=3D EINTR) + continue; + /* ECHILD */ + break; + } + + if (!first) + continue; + first =3D false; + + /* Check which process was terminated first */ + for (i =3D 0; i < ARRAY_SIZE(cgroups); i++) { + if (!ASSERT_OK(cgroups[i].victim !=3D + (pid =3D=3D cgroups[i].pid), + "correct process was killed")) { + ret =3D -1; + break; + } + + if (!cgroups[i].victim) + continue; + + /* Check the memcg oom counter */ + size =3D read_cgroup_file(cgroups[i].path, "memory.events", + buf, sizeof(buf)); + if (!ASSERT_OK(size <=3D 0, "read memory.events")) { + ret =3D -1; + break; + } + + if (!ASSERT_OK(strstr(buf, "oom_kill 1") =3D=3D NULL, + "oom_kill count check")) { + ret =3D -1; + break; + } + } + + /* Kill all remaining tasks */ + for (i =3D 0; i < ARRAY_SIZE(cgroups); i++) + if (cgroups[i].pid && cgroups[i].pid !=3D pid) + kill(cgroups[i].pid, SIGKILL); + } + + return ret; +} + +void test_psi(void) +{ + struct test_psi *skel; + u64 freed_cgroup_id; + int err; + + setup_environment(); + + skel =3D test_psi__open_and_load(); + err =3D libbpf_get_error(skel); + if (CHECK_FAIL(err)) + goto cleanup; + + skel->bss->deleted_cgroup_id =3D cgroups[1].id; + skel->bss->high_pressure_cgroup_id =3D cgroups[2].id; + + err =3D test_psi__attach(skel); + if (CHECK_FAIL(err)) + goto cleanup; + + /* Delete the first cgroup, it should trigger handle_cgroup_free() */ + remove_cgroup(cgroups[1].path); + + /* Unfreeze all child tasks and create the memory pressure */ + err =3D run_and_wait_for_oom(); + CHECK_FAIL(err); + + /* Check the result of the handle_cgroup_free() handler */ + freed_cgroup_id =3D skel->bss->deleted_cgroup_id; + ASSERT_EQ(freed_cgroup_id, cgroups[1].id, "freed cgroup id"); + +cleanup: + cleanup_cgroup_environment(); + test_psi__destroy(skel); +} diff --git a/tools/testing/selftests/bpf/progs/test_psi.c b/tools/testing/s= elftests/bpf/progs/test_psi.c new file mode 100644 index 000000000000..2c36c05a3065 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/test_psi.c @@ -0,0 +1,76 @@ +// SPDX-License-Identifier: GPL-2.0-only +#include "vmlinux.h" +#include +#include + +char _license[] SEC("license") =3D "GPL"; + +struct mem_cgroup *bpf_get_mem_cgroup(struct cgroup_subsys_state *css) __k= sym; +void bpf_put_mem_cgroup(struct mem_cgroup *memcg) __ksym; +int bpf_out_of_memory(struct mem_cgroup *memcg, int order, bool wait_on_oo= m_lock, + const char *constraint_text__nullable) __ksym; +int bpf_psi_create_trigger(struct bpf_psi *bpf_psi, u64 cgroup_id, + u32 res, u32 threshold_us, u32 window_us) __ksym; + +#define PSI_FULL 0x80000000 + +/* cgroup which will experience the high memory pressure */ +u64 high_pressure_cgroup_id; + +/* cgroup which will be deleted */ +u64 deleted_cgroup_id; + +/* cgroup which was actually freed */ +u64 freed_cgroup_id; + +char constraint_name[] =3D "CONSTRAINT_BPF_PSI_MEM"; + +SEC("struct_ops.s/init") +int BPF_PROG(psi_init, struct bpf_psi *bpf_psi) +{ + int ret; + + ret =3D bpf_psi_create_trigger(bpf_psi, high_pressure_cgroup_id, + PSI_MEM | PSI_FULL, 100000, 1000000); + if (ret) + return ret; + + return bpf_psi_create_trigger(bpf_psi, deleted_cgroup_id, + PSI_IO, 100000, 1000000); +} + +SEC("struct_ops.s/handle_psi_event") +void BPF_PROG(handle_psi_event, struct psi_trigger *t) +{ + u64 cgroup_id =3D t->cgroup_id; + struct mem_cgroup *memcg; + struct cgroup *cgroup; + + cgroup =3D bpf_cgroup_from_id(cgroup_id); + if (!cgroup) + return; + + memcg =3D bpf_get_mem_cgroup(&cgroup->self); + if (!memcg) { + bpf_cgroup_release(cgroup); + return; + } + + bpf_out_of_memory(memcg, 0, true, constraint_name); + + bpf_put_mem_cgroup(memcg); + bpf_cgroup_release(cgroup); +} + +SEC("struct_ops.s/handle_cgroup_free") +void BPF_PROG(handle_cgroup_free, u64 cgroup_id) +{ + freed_cgroup_id =3D cgroup_id; +} + +SEC(".struct_ops.link") +struct bpf_psi_ops test_bpf_psi =3D { + .init =3D (void *)psi_init, + .handle_psi_event =3D (void *)handle_psi_event, + .handle_cgroup_free =3D (void *)handle_cgroup_free, +}; --=20 2.50.1