From nobody Sun Jun 14 04:08:52 2026 Received: from mail-pg1-f173.google.com (mail-pg1-f173.google.com [209.85.215.173]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 85E3117BA6 for ; Sun, 3 May 2026 16:51:00 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.173 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777827061; cv=none; b=pVZE8jjrBH0J99ucbYcuc81gKR1hRqTPlqqt6Lt/WGQp3BAprFsMTaLihmLcvjhmIiNZQEeHeNUimszVHg+kzdbGt1Wh5EDwkb2oRLoN0gc/KopFpFsx6aSVlYRmrYsRAXo9uRz4iDRGy1WOCaaGTByv1w5n1M2zA1l1R3CPAtU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777827061; c=relaxed/simple; bh=I4my/LYj+yUD/4GouKxkZKg9azWROTPg1jyt3aShvm4=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=A4LGO+xJg6yqj7gKIDI3g6comKBcuwZHg+WygXsctR9aU/Ka6oHqB4wy3epkK8PrrimLZUEN2djGSh9u7lWjYsitSjOXfAeGUNszUafDeeqrGCfTro6jYjEr3lQYfG7YDMNfSG+nrOmSIKUzVzzskLv844kVeMKEAiYmm56N5Ag= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=kirnkBUr; arc=none smtp.client-ip=209.85.215.173 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="kirnkBUr" Received: by mail-pg1-f173.google.com with SMTP id 41be03b00d2f7-c7ffe8eeaf2so626005a12.0 for ; Sun, 03 May 2026 09:51:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1777827060; x=1778431860; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=ZQGcz1ZaEzbqgScDrowVXARi09LONkb6J25l1BcDqBA=; b=kirnkBUrD4i3fSIAwyWfB9Dj+sClJtnXuVjyy8OslpCy7XuLgVPsOIvwqHuv8WmmR/ dsJIuNl4nnmgDnmz3aF3A+Dcdhv92jZVAEtpl0SnwGzqzWpMRDBQaEi93yh3rM0AlmnX d2QI+QGcK0uMVz/ydvZ9Hdkz6ivWA4lvC3Mp4CPPcXFxoTM/TUcwC9AriyxR6PbZqOFN GPZ5rrOK4a0iPtpK0RObIX2rDqidi2MwIh7ITxyMtKNUdZds6T+B17iY2XXXyPvi+WVz Luh2tt1Vv3jQFbV5zC4YYy5+NQx6plONI3FiC0EuHlGy1nPeyqmiazz/m+w9jtv7RAgl UPtA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1777827060; x=1778431860; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=ZQGcz1ZaEzbqgScDrowVXARi09LONkb6J25l1BcDqBA=; b=Im4Nqi+jyKT7/rzrDRFimdqQxKBJssFNffrsNsmgr5efeY9MhcDg37+dflr0W5T488 yBz0ENtUYdfOzjMPAaCljTLJLuZ1/fJD+BBBH5SqrM3sSfH3pH/IIA1bBv5UziKvQk/3 I+J1DKtw5KiXaHbansQZuf4zliTPq0csq0MQpJkJqzVkpys7VkUpIkE24SY8Vrk931p+ Y77wTAt9haf0iPyKj6f84078aH7cJ/AzI6+w08V3oRFrtgv3SZgCSMbju/PDU6cu+YSs zJWGL3OUflrQ+Kyf6RW4SmAbtIu8H8T7Q6m4guGpdyLdnN2rn4f4VwmhJ7fTzHOYFmUh 7ySA== X-Gm-Message-State: AOJu0Yyp3eYpcaK++m6S60laAclZR/fuJ76nfkuZoC5mIoV59Csx3HL6 KMtZ6wEqOv3Jd+ds6NIHvCofNELFsyYY89lc0OtvpqCl1zLKS7Lf5e2X X-Gm-Gg: AeBDieuXs1ozLPAZYFFpuzoSQLbRMsee1L6sxc68jur8eFQJwH+vuQvoiBneybPSPhM 5MgnM3lOImJoI6c/CUyUsBq42Os9WduiGd5Mpea4+9Gxm7nV508p7L27IsZf1CJ10b33Uix7zUI F6CGARNcQgKgLu+K8P6IbhF1ywWONP1eHtekToNDm8dfgCHUnhRbVjXiGnf/rn6G5By2A2SLBe+ xxQO4ZGs8REorc0umaDNXJ7K2FMFtUMoPOFeUcQ5Nj/mbmgpsrfVBpvIiuaGKU7kYFYBSYUEBVi uZFMs2ouOVoqLvnmUkBc7FFLDCazadhXF3vdS6bKrdAXbKOzp108/24IF2btz6YoxV589wZHoZb LsLm/uUqqV+Yuxq3WEhFgZT9WwqJfUoVvZGqXCbthy0D5f3cXOooqr92ybkRCD7cMD7Jka/TYJo xx1V7d+qw6Kem4crolMeUlEZFDK7QH0upJ45G0gf7W8JrL96Y= X-Received: by 2002:a05:6a21:3384:b0:398:a659:eb0 with SMTP id adf61e73a8af0-3a7f1bc4ba4mr6919770637.34.1777827059924; Sun, 03 May 2026 09:50:59 -0700 (PDT) Received: from localhost.localdomain ([114.231.84.174]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-83707fab756sm1494277b3a.44.2026.05.03.09.50.51 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 03 May 2026 09:50:59 -0700 (PDT) From: Vernon Yang To: akpm@linux-foundation.org, david@kernel.org, ljs@kernel.org, roman.gushchin@linux.dev, inwardvessel@gmail.com, shakeel.butt@linux.dev, ast@kernel.org, daniel@iogearbox.net, surenb@google.com Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, bpf@vger.kernel.org, baohua@kernel.org, lance.yang@linux.dev, dev.jain@arm.com, Vernon Yang Subject: [PATCH 1/4] psi: add psi_group_flush_stats() function Date: Mon, 4 May 2026 00:50:21 +0800 Message-ID: <20260503165024.1526680-2-vernon2gm@gmail.com> X-Mailer: git-send-email 2.53.0 In-Reply-To: <20260503165024.1526680-1-vernon2gm@gmail.com> References: <20260503165024.1526680-1-vernon2gm@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Vernon Yang Add psi_group_flush_stats() function to prepare for the subsequent mthp_ext ebpf program. no function changes. Signed-off-by: Vernon Yang --- include/linux/psi.h | 1 + kernel/sched/psi.c | 34 ++++++++++++++++++++++++++-------- 2 files changed, 27 insertions(+), 8 deletions(-) diff --git a/include/linux/psi.h b/include/linux/psi.h index e0745873e3f2..7b4fd8190810 100644 --- a/include/linux/psi.h +++ b/include/linux/psi.h @@ -22,6 +22,7 @@ void psi_init(void); void psi_memstall_enter(unsigned long *flags); void psi_memstall_leave(unsigned long *flags); =20 +void psi_group_flush_stats(struct psi_group *group); int psi_show(struct seq_file *s, struct psi_group *group, enum psi_res res= ); struct psi_trigger *psi_trigger_create(struct psi_group *group, char *buf, enum psi_res res, struct file *file, diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index d9c9d9480a45..76ffad90b0b5 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -1242,11 +1242,35 @@ void psi_cgroup_restart(struct psi_group *group) } #endif /* CONFIG_CGROUPS */ =20 +/* + * __psi_group_flush_stats - flush the total stall time of a psi group + * @group: psi group to flush + */ +static void __psi_group_flush_stats(struct psi_group *group) +{ + u64 now; + + /* Update averages before reporting them */ + mutex_lock(&group->avgs_lock); + now =3D sched_clock(); + collect_percpu_times(group, PSI_AVGS, NULL); + if (now >=3D group->avg_next_update) + group->avg_next_update =3D update_averages(group, now); + mutex_unlock(&group->avgs_lock); +} + +void psi_group_flush_stats(struct psi_group *group) +{ + if (static_branch_likely(&psi_disabled)) + return; + + __psi_group_flush_stats(group); +} + int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res) { bool only_full =3D false; int full; - u64 now; =20 if (static_branch_likely(&psi_disabled)) return -EOPNOTSUPP; @@ -1256,13 +1280,7 @@ int psi_show(struct seq_file *m, struct psi_group *g= roup, enum psi_res res) return -EOPNOTSUPP; #endif =20 - /* Update averages before reporting them */ - mutex_lock(&group->avgs_lock); - now =3D sched_clock(); - collect_percpu_times(group, PSI_AVGS, NULL); - if (now >=3D group->avg_next_update) - group->avg_next_update =3D update_averages(group, now); - mutex_unlock(&group->avgs_lock); + __psi_group_flush_stats(group); =20 #ifdef CONFIG_IRQ_TIME_ACCOUNTING only_full =3D res =3D=3D PSI_IRQ; --=20 2.53.0 From nobody Sun Jun 14 04:08:52 2026 Received: from mail-pl1-f181.google.com (mail-pl1-f181.google.com [209.85.214.181]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 76456377004 for ; Sun, 3 May 2026 16:51:11 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.181 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777827075; cv=none; b=tZh1deIsOHbDjF3H/Wf5UevfJh50C+FsogKUAMcOzHXrQ/4CFKP2eHPPm9FGvF6ZawgARVDEoOAAffH5TDS4DDwt0p33UTO0HrceZQHYg2/fT6qEfnIhN1qiHzUKoSUhxePtaMTRjlopmEfP81SC3XwfGXm/WalzfrpVeqYA9Uw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777827075; c=relaxed/simple; bh=cg0Bbj9AjsnGxOq+y64e/b7OOW2yTNEQkbhwRm29D18=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=PLXIe0yOxP7TVqZpWmuTNPRd0euaem3ZBxoQbZfBPGyixaDbPmoUPj+gvwjXk+j80FXSgJt0tyC/+AwHnSPvCre/dz6Wbd4eXfLkmDSrT1giM9jnLpEPG5rYr+Bm8A9zR2wpzRARhg4fsqEd8Txqbnd99EJ+am1fRf3SG+nnXr0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=F8uc6kUE; arc=none smtp.client-ip=209.85.214.181 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="F8uc6kUE" Received: by mail-pl1-f181.google.com with SMTP id d9443c01a7336-2b45cb89f7eso21915975ad.0 for ; Sun, 03 May 2026 09:51:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1777827070; x=1778431870; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=54CfH1z6TNt780aoevT0/jV5wyQdFbwCsijFcJJLlPI=; b=F8uc6kUE4I6zWfN5IDXIex5D7Mpv+4viKNr/FUdgjEpJtG6ECX2Htn1wSq6dxFng42 lERQ3CqDvSR7kftoEHGNv5i6E6cW+3zoj5lLucwIGcur3DR9tvKf+7fDeuRmOswKtS7i dGeIr2BRUN7lVaL4AOjEQ4IZCRewz3T5a61Ji5clBsR/vFlduKYn9Whsn031jBn782xJ CXsJvcYOPP0pmcoRrLP+3Dj3LSc8RDKhoBYh4U9bPdfFvFcxzPYbzUYLSftiPj2XIGXm oOs5UFrj7rbgPlGCcVnKE9Ic5mXdwB3Bt4DvJZNhGPxlHHpQFlAqERf8LitN9o54OkZ3 s5uw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1777827070; x=1778431870; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=54CfH1z6TNt780aoevT0/jV5wyQdFbwCsijFcJJLlPI=; b=KktAGCvl/Glz8DUF4ye8zorokhqea0j51Oei3UawvCa943Af8BBjOovILa037sbRp/ m4048bph3tqY/DUZoNE92/U34WcK3YyuvowyYok6djyFYcTOk8Pmvoerdi68+pP6xrZa HsrwS4diQNYLilTbX41KDFO5+HgcSizTCGLrl958Mu1ZQpun0zWkEMzFRdVw1z+UB65l EyMxdS5qW4b8tlePT1mRUyUZFhexsf4XCfsTXv4M9enZ4gUISQ4Ofoeu59/JgKQn9mqg MDU6jrvqJT6hjcfZkojffpgGSBlYMNlex6222i562EC6IFuQE6lh6rmmBdC4d4F2wc45 FDoA== X-Gm-Message-State: AOJu0Yw8u+t7Ni4dBPoou3ITqyRB4DGGkQoNo9gnMDjHuj8osd1ptt0W aAiqTRh+9QjB+V6h8rxIBAJ0UlhUwrUOLnktylZCWwy/ya8sCAIcg6Oc X-Gm-Gg: AeBDiet2oi2dc9dua+/fsmHldLnieozeAOd+Hljq5KjGYWBQsy8g1qopUPP2O4w+NKy jW868qcmMtV8vM3bGyZOL+WjKkamqjzJmmyHH9++mRf4w8J336hwar0DO68RJauNnl9m0NqPhD2 kEl+tpTcSRaWroqg3VPeT0hKrwqCv6ltf7BsgODhCJmZVbdNh14DOD/qmdLeZcTncmpXCTux2Pi dVb5e+LTIoDXIQazCeYqeLoBYCaoUSbbulqqqwGQR2W5pIq0K6pWPYlcmPMk8rZ8MQ7rzNM6sP8 G+Bc1rpMlzLQGhW4MuPsBdgYhowmYvzVXQGC+3vrdHi6xeMaGrRaBKomFpq3b2ek1jxBycS4own ENXt3Nqn+Xdfbg3Xxt9WyREXq2ol0YsLEidMBFRfPwQoukntTEIAsdPw+RkBGbv7DLhreiwvTtf bQJH6jdvbJtjsQrFrXi+uXyL7nzTHiZieTBxUXaOo3EhUufgI= X-Received: by 2002:a05:6a20:4327:b0:3a0:b65a:5df0 with SMTP id adf61e73a8af0-3a7f1bc4159mr6507949637.29.1777827069644; Sun, 03 May 2026 09:51:09 -0700 (PDT) Received: from localhost.localdomain ([114.231.84.174]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-83707fab756sm1494277b3a.44.2026.05.03.09.51.00 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 03 May 2026 09:51:09 -0700 (PDT) From: Vernon Yang To: akpm@linux-foundation.org, david@kernel.org, ljs@kernel.org, roman.gushchin@linux.dev, inwardvessel@gmail.com, shakeel.butt@linux.dev, ast@kernel.org, daniel@iogearbox.net, surenb@google.com Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, bpf@vger.kernel.org, baohua@kernel.org, lance.yang@linux.dev, dev.jain@arm.com, Vernon Yang Subject: [PATCH 2/4] bpf: add bpf_cgroup_{flush_stats,stall} function Date: Mon, 4 May 2026 00:50:22 +0800 Message-ID: <20260503165024.1526680-3-vernon2gm@gmail.com> X-Mailer: git-send-email 2.53.0 In-Reply-To: <20260503165024.1526680-1-vernon2gm@gmail.com> References: <20260503165024.1526680-1-vernon2gm@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Vernon Yang Add bpf_cgroup_{flush_stats,stall} function to prepare for the subsequent mthp_ext ebpf program. no function changes. Signed-off-by: Vernon Yang --- kernel/bpf/helpers.c | 29 +++++++++++++++++++++++++++++ 1 file changed, 29 insertions(+) diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index 2bb60200c266..87f3072adce3 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -29,6 +29,7 @@ #include #include #include +#include =20 #include "../../lib/kstrtox.h" =20 @@ -2819,6 +2820,32 @@ __bpf_kfunc struct cgroup *bpf_cgroup_from_id(u64 cg= id) return cgrp; } =20 +/** + * bpf_cgroup_stall - acquire the total stall time of cgroup + * @cgrp: cgroup struct + * @states: psi states + * + * Return the total stall time. + */ +__bpf_kfunc unsigned long bpf_cgroup_stall(struct cgroup *cgrp, + enum psi_states states) +{ + struct psi_group *group =3D cgroup_psi(cgrp); + + return div_u64(group->total[PSI_AVGS][states], NSEC_PER_MSEC); +} + +/** + * bpf_cgroup_flush_stats - Flush cgroup's statistics + * @cgrp: cgroup struct + */ +__bpf_kfunc void bpf_cgroup_flush_stats(struct cgroup *cgrp) +{ + struct psi_group *group =3D cgroup_psi(cgrp); + + psi_group_flush_stats(group); +} + /** * bpf_task_under_cgroup - wrap task_under_cgroup_hierarchy() as a kfunc, = test * task's membership of cgroup ancestry. @@ -4732,6 +4759,8 @@ BTF_ID_FLAGS(func, bpf_cgroup_acquire, KF_ACQUIRE | K= F_RCU | KF_RET_NULL) BTF_ID_FLAGS(func, bpf_cgroup_release, KF_RELEASE) BTF_ID_FLAGS(func, bpf_cgroup_ancestor, KF_ACQUIRE | KF_RCU | KF_RET_NULL) BTF_ID_FLAGS(func, bpf_cgroup_from_id, KF_ACQUIRE | KF_RET_NULL) +BTF_ID_FLAGS(func, bpf_cgroup_stall) +BTF_ID_FLAGS(func, bpf_cgroup_flush_stats, KF_SLEEPABLE) BTF_ID_FLAGS(func, bpf_task_under_cgroup, KF_RCU) BTF_ID_FLAGS(func, bpf_task_get_cgroup1, KF_ACQUIRE | KF_RCU | KF_RET_NULL) #endif --=20 2.53.0 From nobody Sun Jun 14 04:08:52 2026 Received: from mail-pg1-f171.google.com (mail-pg1-f171.google.com [209.85.215.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E4C223CF685 for ; Sun, 3 May 2026 16:51:20 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.171 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777827086; cv=none; b=U+Xpx/aLrhguJlicZO1BwIv28Yo46o6CXR/gpn64IBYd3DdxYFnjO8vxrxTQizIkuy4coNEGphPz2SfVqaDwKe1E2KALpN3G5G3Bi5CQw6HNYTYodz5OAYVfD53AE/dFvqGguNyt61ffF9Wngc3RcrdvnghvFFnfIvYDcFHnYPM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777827086; c=relaxed/simple; bh=CKDGmZqs2BcDQFkKKJ3/LB08rMsN0Zlvgjt7hwsN2ZI=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=Eyuyfncw5F+TaEiL+NNTZBeiXzX9zK1OX1YjZTNjnjqQr9P3Vd1kICqfgvi59iBq1ba9ZIVPRBaNT6NBOYarZ4Iym2twAm5sqlUfLWvM2V157HM/4S8fTh5qkWN483krLVv3wuzu6oxVDwBt3fh/pAoHndmwNaEjKZtPjx9JbWs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=ezq3/77G; arc=none smtp.client-ip=209.85.215.171 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="ezq3/77G" Received: by mail-pg1-f171.google.com with SMTP id 41be03b00d2f7-c8025aecc40so186495a12.0 for ; Sun, 03 May 2026 09:51:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1777827080; x=1778431880; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=/G70uBdUm2gD1KEfqVpMXmzdSE5Aaw6QQY+7czX4maU=; b=ezq3/77GM8zEbliP43MqIFBz/4f5duzT/BUQ11KmxGehT9ywZtsbZ6QjE95XGFOqYZ YmO/ZClY/6gIYCIGUFAvVN4ZyIMWg7tSNTJFx34Mou+cB7cO3IGKri8wRsZTmHdIYvK6 xKUothfWi6WpH6U3HVr7fP6yAUvrpxWiGA3A7BBLOFTgP56d+/mP/aw7ewsoW+rlUxF0 WDO7ZSdMDUxZvYwbiNB49ZgYkWLIi8HCJh3k6IXkHkF+9q9pkjVDpl9cu1AtJRswOoLg E7M7xmVGZ7IYPnGPtovbkQGpAqU/hmgwAXXZfbgIDtT2+7BNnssFbwp6BzAL1VNZsv70 WIvw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1777827080; x=1778431880; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=/G70uBdUm2gD1KEfqVpMXmzdSE5Aaw6QQY+7czX4maU=; b=WUEca2W5NZGucYXdhX8uSSIoeHXJWdpE9AOYytMyJYvRE1sYsv0QR3RPfBpCHhhM75 1oCytXmaHN32Fc9/w/aBoyB+UH19yHs1cQL7E61cBGYsU3kMvcF37LN5mtAVKLarsS/l mup9LK5un9bN0a9mblOrAXlg8BRaXa5DKlu9jLm56BX2HCTPMRYR91CVc89gyHOR2rww MB2Cf/N6o77ROLrvnb8gTh6LocYGrmE52wcPYw6ouD3ks88hxs0MS0yEnpL3/zcGbR3I pbfMtNT04iGSSGPIuz6E1qT8A4/1Ozei4vkjOVuslN6jIik5v09ilxgmx6dOTF8V/cbx FCag== X-Gm-Message-State: AOJu0YyqYrQfYgX3g1Xqp0qA0nJ7G3v6/OBOBUiZpdIS48BrfO5t6yyF GaXuuXkBlsi8zTqZganunYCGHKtQ83/ovpCrcHe7LkAADSpx6/GuBWQP X-Gm-Gg: AeBDietl/UurlURyyYo+WxfCUEQNnNQzRA+vYE3vZ4xZjZGUKYzpMSn8ZWa8WNnDtgx HPRVoR9PIKTV+Bjh19kRgwCNFVJSp7K4eL1d9FbBoOvwXgVELE9/qCCOkr+2e+OcIwzrEMsk/1j fbqEU2U/ZwALC+XJ7fZ+gr9qb5sdeFg6e9dQahrvf/3N6dlhzrqJ+rUtlwhPZW4Tmrnrb6u4L+j NRO97BCEMkCyBTICiCSYHuTPZ08/caMgZUGttdy3aB+RXsfOsJ9Q+6sbhUv5AFQf/U5BLJuWKo9 oZqut7Q3zu6jjjGf9dSJo4ZwS44lABExUo882Mm1l71Bl1cVqf0TOm7J5nHSORjnLKXd9zUbp4D 1Z4z3W84jFCncbceKHXfUulZq9r8KzM4qPoRLDmUcr1Ktpc6skrrd15LfOGwqyND60w4sHuXSUp 4sYF1HYZjgZb7RdQ6/MGOvZwOxNeFybyAbPlBF/BkQEuIM2rs= X-Received: by 2002:a05:6a20:4311:b0:35d:cc9a:8bc1 with SMTP id adf61e73a8af0-3a7f041ebf8mr5417275637.27.1777827079789; Sun, 03 May 2026 09:51:19 -0700 (PDT) Received: from localhost.localdomain ([114.231.84.174]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-83707fab756sm1494277b3a.44.2026.05.03.09.51.10 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 03 May 2026 09:51:17 -0700 (PDT) From: Vernon Yang To: akpm@linux-foundation.org, david@kernel.org, ljs@kernel.org, roman.gushchin@linux.dev, inwardvessel@gmail.com, shakeel.butt@linux.dev, ast@kernel.org, daniel@iogearbox.net, surenb@google.com Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, bpf@vger.kernel.org, baohua@kernel.org, lance.yang@linux.dev, dev.jain@arm.com, Vernon Yang Subject: [PATCH 3/4] mm: introduce bpf_mthp_ops struct ops Date: Mon, 4 May 2026 00:50:23 +0800 Message-ID: <20260503165024.1526680-4-vernon2gm@gmail.com> X-Mailer: git-send-email 2.53.0 In-Reply-To: <20260503165024.1526680-1-vernon2gm@gmail.com> References: <20260503165024.1526680-1-vernon2gm@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable From: Vernon Yang Introducing bpf_mthp_ops enables eBPF programs to register the mthp_choose callback function via cgroup-ebpf. Using cgroup-bpf to customize mTHP size for different scenarios=EF=BC=8C automatically select different mTHP sizes for different cgroups, let's focus on making them truly transparent. Signed-off-by: Vernon Yang --- MAINTAINERS | 3 + include/linux/bpf_huge_memory.h | 35 +++++++ include/linux/cgroup-defs.h | 1 + include/linux/huge_mm.h | 6 ++ mm/Kconfig | 14 +++ mm/Makefile | 1 + mm/bpf_huge_memory.c | 169 ++++++++++++++++++++++++++++++++ 7 files changed, 229 insertions(+) create mode 100644 include/linux/bpf_huge_memory.h create mode 100644 mm/bpf_huge_memory.c diff --git a/MAINTAINERS b/MAINTAINERS index 27a073f53cea..39f00676eeb7 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -4887,7 +4887,10 @@ M: Shakeel Butt L: bpf@vger.kernel.org L: linux-mm@kvack.org S: Maintained +F: include/linux/bpf_huge_memory.h +F: mm/bpf_huge_memory.c F: mm/bpf_memcontrol.c +F: samples/bpf/mthp_ext.* =20 BPF [MISC] L: bpf@vger.kernel.org diff --git a/include/linux/bpf_huge_memory.h b/include/linux/bpf_huge_memor= y.h new file mode 100644 index 000000000000..1c8a6f7ad8f1 --- /dev/null +++ b/include/linux/bpf_huge_memory.h @@ -0,0 +1,35 @@ +/* SPDX-License-Identifier: GPL-2.0+ */ + +#ifndef __BPF_HUGE_MEMORY_H +#define __BPF_HUGE_MEMORY_H + +/** + * struct bpf_mthp_ops - BPF callbacks for mTHP operations + * @mthp_choose: Choose the custom mTHP orders + * + * This structure defines the interface for BPF programs to customize + * mTHP behavior through struct_ops programs. + */ +struct bpf_mthp_ops { + unsigned long (*mthp_choose)(struct cgroup *cgrp, unsigned long orders); +}; + +#if defined(CONFIG_BPF_TRANSPARENT_HUGEPAGE) && defined(CONFIG_BPF_SYSCALL) +/** + * bpf_mthp_choose: Choose the custom mTHP orders using bpf + * @mm: task mm_struct + * @orders: original orders + * + * Return suited mTHP orders. + */ +unsigned long bpf_mthp_choose(struct mm_struct *mm, unsigned long orders); +#else +static inline unsigned long bpf_mthp_choose(struct mm_struct *mm, + unsigned long orders) +{ + return orders; +} +#endif /* CONFIG_BPF_SYSCALL */ + +#endif /* __BPF_HUGE_MEMORY_H */ + diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index f42563739d2e..78854d0e06ab 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -628,6 +628,7 @@ struct cgroup { =20 #ifdef CONFIG_BPF_SYSCALL struct bpf_local_storage __rcu *bpf_cgrp_storage; + struct bpf_mthp_ops *mthp_ops; #endif #ifdef CONFIG_EXT_SUB_SCHED struct scx_sched __rcu *scx_sched; diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 2949e5acff35..80ec622213df 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -3,6 +3,7 @@ #define _LINUX_HUGE_MM_H =20 #include +#include =20 #include /* only for vma_is_dax() */ #include @@ -291,6 +292,11 @@ unsigned long thp_vma_allowable_orders(struct vm_area_= struct *vma, enum tva_type type, unsigned long orders) { + /* The eBPF-specified orders overrides which order is selected. */ + orders &=3D bpf_mthp_choose(vma->vm_mm, orders); + if (!orders) + return 0; + /* * Optimization to check if required orders are enabled early. Only * forced collapse ignores sysfs configs. diff --git a/mm/Kconfig b/mm/Kconfig index e8bf1e9e6ad9..12382431ddc7 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -963,6 +963,20 @@ config NO_PAGE_MAPCOUNT =20 EXPERIMENTAL because the impact of some changes is still unclear. =20 +config BPF_TRANSPARENT_HUGEPAGE + bool "BPF-based transparent hugepage (EXPERIMENTAL)" + depends on TRANSPARENT_HUGEPAGE + help + Using cgroup-bpf to customize mTHP size for different scenarios, + automatically select different mTHP sizes for different cgroups, + let's focus on making them truly transparent. + + This is an experimental feature, that might go away at any time, + Please do not rely any production environment. + + EXPERIMENTAL because the BPF interface is unstable and may be removed + at any time. + endif # TRANSPARENT_HUGEPAGE =20 # simple helper to make the code a bit easier to read diff --git a/mm/Makefile b/mm/Makefile index 8ad2ab08244e..b474c21c3253 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -108,6 +108,7 @@ obj-$(CONFIG_MEMCG) +=3D swap_cgroup.o endif ifdef CONFIG_BPF_SYSCALL obj-$(CONFIG_MEMCG) +=3D bpf_memcontrol.o +obj-$(CONFIG_BPF_TRANSPARENT_HUGEPAGE) +=3D bpf_huge_memory.o endif obj-$(CONFIG_CGROUP_HUGETLB) +=3D hugetlb_cgroup.o obj-$(CONFIG_GUP_TEST) +=3D gup_test.o diff --git a/mm/bpf_huge_memory.c b/mm/bpf_huge_memory.c new file mode 100644 index 000000000000..e34e0a35edac --- /dev/null +++ b/mm/bpf_huge_memory.c @@ -0,0 +1,169 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Huge memory related BPF code + * + * Author: Vernon Yang + */ + +#include +#include + +/* Protects cgrp->mthp_ops pointer for read and write. */ +DEFINE_SRCU(mthp_bpf_srcu); + +unsigned long bpf_mthp_choose(struct mm_struct *mm, unsigned long orders) +{ + struct cgroup *cgrp; + struct mem_cgroup *memcg; + struct bpf_mthp_ops *ops; + int idx; + + memcg =3D get_mem_cgroup_from_mm(mm); + if (!memcg) + return orders; + + cgrp =3D memcg->css.cgroup; + ops =3D READ_ONCE(cgrp->mthp_ops); + if (unlikely(ops)) { + idx =3D srcu_read_lock(&mthp_bpf_srcu); + if (ops->mthp_choose) + orders =3D ops->mthp_choose(cgrp, orders); + srcu_read_unlock(&mthp_bpf_srcu, idx); + } + + mem_cgroup_put(memcg); + + return orders; +} + +static int bpf_mthp_ops_btf_struct_access(struct bpf_verifier_log *log, + const struct bpf_reg_state *reg, int off, int size) +{ + return -EACCES; +} + +static bool bpf_mthp_ops_is_valid_access(int off, int size, enum bpf_acces= s_type type, + const struct bpf_prog *prog, struct bpf_insn_access_aux *info) +{ + return bpf_tracing_btf_ctx_access(off, size, type, prog, info); +} + +const struct bpf_verifier_ops bpf_mthp_verifier_ops =3D { + .get_func_proto =3D bpf_base_func_proto, + .btf_struct_access =3D bpf_mthp_ops_btf_struct_access, + .is_valid_access =3D bpf_mthp_ops_is_valid_access, +}; + +static int bpf_mthp_ops_reg(void *kdata, struct bpf_link *link) +{ + struct bpf_struct_ops_link *st_link =3D (struct bpf_struct_ops_link *)lin= k; + struct bpf_mthp_ops *ops =3D kdata; + struct cgroup *cgrp =3D st_link->cgroup; + struct cgroup_subsys_state *pos; + + /* The link is not yet fully initialized, but cgroup should be set */ + if (!link) + return -EOPNOTSUPP; + + cgroup_lock(); + css_for_each_descendant_pre(pos, &cgrp->self) { + struct cgroup *child =3D pos->cgroup; + + if (READ_ONCE(child->mthp_ops)) { + /* TODO + * Do not destroy the cgroup hierarchy property. + * If an eBPF program already exists in the sub-cgroup, + * trigger an error and clear the already set + * bpf_mthp_ops data. + */ + continue; + } + WRITE_ONCE(child->mthp_ops, ops); + } + cgroup_unlock(); + + return 0; +} + +static void bpf_mthp_ops_unreg(void *kdata, struct bpf_link *link) +{ + struct bpf_struct_ops_link *st_link =3D (struct bpf_struct_ops_link *)lin= k; + struct bpf_mthp_ops *ops =3D kdata; + struct cgroup *cgrp =3D st_link->cgroup; + struct cgroup_subsys_state *pos; + + cgroup_lock(); + css_for_each_descendant_pre(pos, &cgrp->self) { + struct cgroup *child =3D pos->cgroup; + + if (READ_ONCE(child->mthp_ops) =3D=3D ops) + WRITE_ONCE(child->mthp_ops, NULL); + } + cgroup_unlock(); + + synchronize_srcu(&mthp_bpf_srcu); +} + +static int bpf_mthp_ops_check_member(const struct btf_type *t, + const struct btf_member *member, + const struct bpf_prog *prog) +{ + u32 moff =3D __btf_member_bit_offset(t, member) / 8; + + switch (moff) { + case offsetof(struct bpf_mthp_ops, mthp_choose): + break; + default: + return -EINVAL; + } + + if (prog->sleepable) + return -EINVAL; + + return 0; +} + +static int bpf_mthp_ops_init_member(const struct btf_type *t, + const struct btf_member *member, + void *kdata, const void *udata) +{ + return 0; +} + +static int bpf_mthp_ops_init(struct btf *btf) +{ + return 0; +} + +static unsigned long cfi_mthp_choose(struct cgroup *cgrp, unsigned long or= ders) +{ + return 0; +} + +static struct bpf_mthp_ops cfi_bpf_mthp_ops =3D { + .mthp_choose =3D cfi_mthp_choose, +}; + +static struct bpf_struct_ops bso_bpf_mthp_ops =3D { + .verifier_ops =3D &bpf_mthp_verifier_ops, + .reg =3D bpf_mthp_ops_reg, + .unreg =3D bpf_mthp_ops_unreg, + .check_member =3D bpf_mthp_ops_check_member, + .init_member =3D bpf_mthp_ops_init_member, + .init =3D bpf_mthp_ops_init, + .name =3D "bpf_mthp_ops", + .owner =3D THIS_MODULE, + .cfi_stubs =3D &cfi_bpf_mthp_ops, +}; + +static int __init bpf_huge_memory_init(void) +{ + int err; + + err =3D register_bpf_struct_ops(&bso_bpf_mthp_ops, bpf_mthp_ops); + if (err) + pr_warn("Registration of bpf_mthp_ops failed, err %d\n", err); + + return err; +} +late_initcall(bpf_huge_memory_init); --=20 2.53.0 From nobody Sun Jun 14 04:08:52 2026 Received: from mail-pf1-f172.google.com (mail-pf1-f172.google.com [209.85.210.172]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 72B3D3CF02C for ; Sun, 3 May 2026 16:51:32 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.172 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777827099; cv=none; b=NJO5/U0bdHsAHL7tmDrizN4qYNvmtvZDqUVaP/ikFvW+mhSE2ir8dxFjq0iOkpShroQgmfrUb/+Vb6sSIK7SW/31HnBHs1pfLTth3ZWz+M3jET2NJcT1lngaM3GBKJUdh6ymdvR3up14YCKxwqULyHrjGIu2kOBrK9Yk6forbDQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777827099; c=relaxed/simple; bh=z7gyAYMig6uJ3OQmowNQvBSFZn0SdQWRfQXdX2kdkmg=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=Gr0bCvo0yufnUZIjj23M9wVT+ObwgmHgfYO3mUaj1KCPU/xDpM5rC5RxlD0hB6kwnBUf82dNoXDYgAdw6pXhuqV+oC3SEHJHS3TJRFx50wJ3SxPKUBQOKDc+nlNa6o31PnTHFY8o+nSM4lsL9OXFX1sBDxxnsX4Ahbfz6LsRMTo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=TgGpS4F8; arc=none smtp.client-ip=209.85.210.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="TgGpS4F8" Received: by mail-pf1-f172.google.com with SMTP id d2e1a72fcca58-82735a41920so1215530b3a.2 for ; Sun, 03 May 2026 09:51:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1777827089; x=1778431889; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=Kdu5+fSnVWCGaHSDNy5Sqn0Y2MXs1Uyb3JHpXbPoVvc=; b=TgGpS4F86K1VxJ/MHrPzY0fKK8kOj44E3XC5Gi/HGDU9pgaQJ6Rwdf8qPjvRJashvh QNABaF0Do8HtNKDk4P/hbVcwmtUEyvpjmgLRXWe8qM4NwlI9XSOgpVRjz8ySIr/LHJo9 jlKAzLn2l12+2pWENNlqcWXHxfOMLGzcKo4iRf7GFRRNyyo1MoyoslW9V1QtIpSbj2Q7 hKmSLgRGKRHr9WZcQKXqSDPaNaMXgOAGuCbF6dEtZ27fR+ziRn86B9bHORLGuayPI4w0 WR/hHRRtAH8qsxXDef2RXt2eRiSdql0L6EAGdQ7KVzN6LV6IvjlHni7sGurEYqfl9MRK yOEQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1777827089; x=1778431889; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=Kdu5+fSnVWCGaHSDNy5Sqn0Y2MXs1Uyb3JHpXbPoVvc=; b=mDCnRITRHB8AiHIuiP7t8SEKRpteNbQTBSegCPf7JOTHZVLgqJ4XSryQDyzQjbe+ol zkr3PWWe6XGnhg7TfezxBhhx2DLzJ0zkpdWe5TVyrPl17q9SkrU1jD+Hg5rs1ekKstv/ vtsVslzuLM+8Nz50FDIz8f5+SheOiNHhJddrPv4vwBf2u8sUC/9n3Dqk4xkAlqcU2TkN G2xFNaWRzQ5C6mMKIhd6eF1qYtVgphHBzNCUX49RtIP8mxOIw1/7cKGANX+o/+T7pser eWcehk9HS10NHuKpDYzpHb5wsGgzv2D6nr0KS+68lBRvD4CZzOGSwXVnFxdX4bFUw4nZ M03g== X-Gm-Message-State: AOJu0Yw6qCUx8t8jm/FGiK2hW/T1oEvUzGPTLtILk1R6iOEw8DL8LsCh doqE1UF+2wU1WwZagtHFDk5GFoQrUAqI3BlVga6z1AKC+ZD/+QjS5WhS X-Gm-Gg: AeBDiet9plZO6jw3lnPteDvWySUqXQ8gdisB5QEGiIMSx3jaNvUS2FmUyHF876vCL7U 4PuQ++WP+tx4DcIc6z9agwAlBOCy+B0Hwbw2a4033ptabjWQk6U7GF1S122sRYQxtbYP90CqltT qy7q8q11IQqxsTmbtaL6/rJeCGUQQQw6giyh6mkeTJq9hrKC5HTFVScn5aBHBe70QGYkY+pLvkS 2ikXtPnW6rbckKixv/0+6VbJadEt1HVMtHlljiQVRM+nphGbeG31RFK3vUBCjOFNK3gjgFZdWif 9R5WQTg8xA+u/Su6+RdAQCH07Z/bo29s3MLYyBc7c5IXkqHc3mnEOiS+foSvlRJhxchanSHJHu4 ptTzpK6FwkPA8FOw/nHfQp0KOIGBwsHGI9G679QSyVYBVbD337Pt3KRiLgxPw3XBTlVKivziruP Qu4zfmEMLlsOrEMA1VQN8sc78RznlCfg4Twy0dEVW3qtNCisI= X-Received: by 2002:a05:6a00:8188:b0:835:351c:f236 with SMTP id d2e1a72fcca58-835351cf525mr3911226b3a.29.1777827089059; Sun, 03 May 2026 09:51:29 -0700 (PDT) Received: from localhost.localdomain ([114.231.84.174]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-83707fab756sm1494277b3a.44.2026.05.03.09.51.20 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 03 May 2026 09:51:28 -0700 (PDT) From: Vernon Yang To: akpm@linux-foundation.org, david@kernel.org, ljs@kernel.org, roman.gushchin@linux.dev, inwardvessel@gmail.com, shakeel.butt@linux.dev, ast@kernel.org, daniel@iogearbox.net, surenb@google.com Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, bpf@vger.kernel.org, baohua@kernel.org, lance.yang@linux.dev, dev.jain@arm.com, Vernon Yang Subject: [PATCH 4/4] samples: bpf: add mthp_ext Date: Mon, 4 May 2026 00:50:24 +0800 Message-ID: <20260503165024.1526680-5-vernon2gm@gmail.com> X-Mailer: git-send-email 2.53.0 In-Reply-To: <20260503165024.1526680-1-vernon2gm@gmail.com> References: <20260503165024.1526680-1-vernon2gm@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Vernon Yang Design mthp_ext case to address real workload issues. The main functions of the mthp_ext are as follows: - When sub-cgroup is under high memory pressure (default, full 100ms 1s), it will automatically fallback to using 4KB. - When the anon+shmem memory usage of sub-cgroup falls below the minimum memory (default 16MB), small-memory processes will automatically fallback to using 4KB. - Under normal conditions, when there is no memory pressure and the anon+shmem memory usage exceeds the minimum memory, all mTHP sizes shall be utilized by kernel. - Monitor the root-cgroup (/sys/fs/cgroup) directory by default, with support for specifying any cgroup directory. Signed-off-by: Vernon Yang --- samples/bpf/.gitignore | 1 + samples/bpf/Makefile | 7 +- samples/bpf/mthp_ext.bpf.c | 142 ++++++++++++++++ samples/bpf/mthp_ext.c | 340 +++++++++++++++++++++++++++++++++++++ samples/bpf/mthp_ext.h | 30 ++++ 5 files changed, 519 insertions(+), 1 deletion(-) create mode 100644 samples/bpf/mthp_ext.bpf.c create mode 100644 samples/bpf/mthp_ext.c create mode 100644 samples/bpf/mthp_ext.h diff --git a/samples/bpf/.gitignore b/samples/bpf/.gitignore index 0002cd359fb1..2a73581876b4 100644 --- a/samples/bpf/.gitignore +++ b/samples/bpf/.gitignore @@ -49,3 +49,4 @@ iperf.* /vmlinux.h /bpftool/ /libbpf/ +mthp_ext diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile index 95a4fa1f1e44..357c7d1c45ef 100644 --- a/samples/bpf/Makefile +++ b/samples/bpf/Makefile @@ -37,6 +37,7 @@ tprogs-y +=3D xdp_fwd tprogs-y +=3D task_fd_query tprogs-y +=3D ibumad tprogs-y +=3D hbm +tprogs-y +=3D mthp_ext =20 # Libbpf dependencies LIBBPF_SRC =3D $(TOOLS_PATH)/lib/bpf @@ -122,6 +123,7 @@ always-y +=3D task_fd_query_kern.o always-y +=3D ibumad_kern.o always-y +=3D hbm_out_kern.o always-y +=3D hbm_edt_kern.o +always-y +=3D mthp_ext.bpf.o =20 COMMON_CFLAGS =3D $(TPROGS_USER_CFLAGS) TPROGS_LDFLAGS =3D $(TPROGS_USER_LDFLAGS) @@ -289,6 +291,8 @@ $(obj)/hbm_out_kern.o: $(src)/hbm.h $(src)/hbm_kern.h $(obj)/hbm.o: $(src)/hbm.h $(obj)/hbm_edt_kern.o: $(src)/hbm.h $(src)/hbm_kern.h =20 +mthp_ext: $(obj)/mthp_ext.skel.h + # Override includes for xdp_sample_user.o because $(srctree)/usr/include in # TPROGS_CFLAGS causes conflicts XDP_SAMPLE_CFLAGS +=3D -Wall -O2 \ @@ -347,10 +351,11 @@ $(obj)/%.bpf.o: $(src)/%.bpf.c $(obj)/vmlinux.h $(src= )/xdp_sample.bpf.h $(src)/x -I$(LIBBPF_INCLUDE) $(CLANG_SYS_INCLUDES) \ -c $(filter %.bpf.c,$^) -o $@ =20 -LINKED_SKELS :=3D xdp_router_ipv4.skel.h +LINKED_SKELS :=3D xdp_router_ipv4.skel.h mthp_ext.skel.h clean-files +=3D $(LINKED_SKELS) =20 xdp_router_ipv4.skel.h-deps :=3D xdp_router_ipv4.bpf.o xdp_sample.bpf.o +mthp_ext.skel.h-deps :=3D mthp_ext.bpf.o =20 LINKED_BPF_SRCS :=3D $(patsubst %.bpf.o,%.bpf.c,$(foreach skel,$(LINKED_SK= ELS),$($(skel)-deps))) =20 diff --git a/samples/bpf/mthp_ext.bpf.c b/samples/bpf/mthp_ext.bpf.c new file mode 100644 index 000000000000..bbee3e9f679c --- /dev/null +++ b/samples/bpf/mthp_ext.bpf.c @@ -0,0 +1,142 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include "vmlinux.h" +#include "mthp_ext.h" +#include +#include +#include +#include + +struct mem_info { + unsigned long stall; + unsigned int order; +}; + +struct { + __uint(type, BPF_MAP_TYPE_CGRP_STORAGE); + __uint(map_flags, BPF_F_NO_PREALLOC); + __type(key, int); + __type(value, struct mem_info); +} cgrp_storage SEC(".maps"); + +struct { + __uint(type, BPF_MAP_TYPE_RINGBUF); + __uint(max_entries, 256 * 1024); +} events SEC(".maps"); + +struct config_local configs; + +/* + * mthp_choose_impl: Choose the custom mTHP orders, read order from cgrp_s= torage, + * which is Adjustment by the cgroup_scan(). + * @cgrp: control group + * @orders: original orders + * + * Return suited mTHP orders. + */ +SEC("struct_ops/mthp_choose") +unsigned long BPF_PROG(mthp_choose_impl, struct cgroup *cgrp, unsigned lon= g orders) +{ + struct mem_info *info; + unsigned int order; + + if (configs.fixed) { + order =3D configs.init_order; + goto out; + } + + info =3D bpf_cgrp_storage_get(&cgrp_storage, cgrp, 0, 0); + if (!info) + return orders; + + order =3D info->order; +out: + if (!order) + return 0; + + orders &=3D BIT(order + 1) - 1; + return orders; +} + +SEC(".struct_ops.link") +struct bpf_mthp_ops mthp_ops =3D { + .mthp_choose =3D (void *)mthp_choose_impl, +}; + +/* backport from kernel/cgroup/cgroup.c */ +static bool cgroup_has_tasks(struct cgroup *cgrp) +{ + return cgrp->nr_populated_csets; +} + +/* + * cgroup_scan: scan all descendant cgroups under root cgroup. + * + * 1. When the memory usage of the sub-cgroup falls below the thresh= old, + * it will automatically fall back to using 4KB size; otherwise, it will + * use all mTHP sizes. + * 2. When memory.pressure stall time of the sub-cgroup exceeds , + * it will automatically fall back to using 4KB size; otherwise, it will + * use all mTHP sizes. + * + * Return 1 indicates termination of the iteration loop, and return 0 indi= cates + * iteration to the next sub-cgroup. + */ +SEC("iter.s/cgroup") +int cgroup_scan(struct bpf_iter__cgroup *ctx) +{ + struct cgroup *cgrp =3D ctx->cgroup; + struct mem_cgroup *memcg; + struct mem_info *info; + struct alert_event *e; + unsigned long curr_stall; + unsigned long curr_mem; + unsigned long delta; + + if (!cgrp) + return 1; + + if (!cgroup_has_tasks(cgrp)) + return 0; + + info =3D bpf_cgrp_storage_get(&cgrp_storage, cgrp, 0, + BPF_LOCAL_STORAGE_GET_F_CREATE); + if (!info) + return 0; + + memcg =3D bpf_get_mem_cgroup(&cgrp->self); + if (!memcg) + return 0; + + bpf_cgroup_flush_stats(cgrp); + curr_stall =3D bpf_cgroup_stall(cgrp, PSI_MEM_FULL); + delta =3D curr_stall - info->stall; + bpf_mem_cgroup_flush_stats(memcg); + curr_mem =3D bpf_mem_cgroup_page_state(memcg, NR_ANON_MAPPED) + + bpf_mem_cgroup_page_state(memcg, NR_SHMEM); + if (curr_mem < FROM_MB(configs.min_mem) || delta >=3D configs.threshold) + info->order =3D 0; + else + info->order =3D PMD_ORDER; + + if (configs.debug) { + e =3D bpf_ringbuf_reserve(&events, sizeof(*e), 0); + if (e) { + e->prev_stall =3D info->stall; + e->curr_stall =3D curr_stall; + e->delta =3D delta; + e->mem =3D curr_mem; + e->order =3D info->order; + bpf_probe_read_kernel_str(e->name, sizeof(e->name), + cgrp->kn->name); + bpf_ringbuf_submit(e, 0); + } + } + + info->stall =3D curr_stall; + bpf_put_mem_cgroup(memcg); + + return 0; +} + +char LICENSE[] SEC("license") =3D "GPL"; diff --git a/samples/bpf/mthp_ext.c b/samples/bpf/mthp_ext.c new file mode 100644 index 000000000000..0e064bad136f --- /dev/null +++ b/samples/bpf/mthp_ext.c @@ -0,0 +1,340 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "mthp_ext.h" +#include "mthp_ext.skel.h" + +#define DEFAULT_ROOT "/sys/fs/cgroup" +#define DEFAULT_THRESHOLD_MS 100UL +#define DEFAULT_INTERVAL_MS 1000UL +#define DEFAULT_ORDER PMD_ORDER +#define DEFAULT_MIN_MEM 16 + +static bool exiting; + +static void usage(const char *name) +{ + fprintf(stderr, + "Usage: %s [OPTIONS]\n\n" + "Monitor specified cgroup, adjust mTHP size via cgroup_bpf.\n\n" + "Currently supports fixed mTHP size and automatic mTHP size adjustment.\= n" + "By default, it monitors the entire cgroup and automatically\n" + "adjusts mTHP size within the specified time window .\n" + "1. When the memory size of the sub-cgroup falls below\n" + " the threshold, it will automatically fall back to\n" + " using 4KB size; otherwise, it will use all mTHP sizes.\n" + "2. When memory.pressure stall time of the sub-cgroup exceeds\n" + " , it will automatically fall back to using 4KB\n" + " size; otherwise, it will use all mTHP sizes.\n\n" + "Options:\n" + " -r, --root=3DPATH Root cgroup path (default: /sys/fs/cgroup)\n" + " -t, --threshold=3DMS threshold in ms (default: %lu)\n" + " -i, --interval=3DMS interval in ms (default: %lu)\n" + " -o, --order=3DNR Initial mthp order (default: %d)\n" + " -m, --min=3DMB Minimum memory size for mTHP (default: %d)\n" + " -f, --fixed Use fixed order, disable auto-adjustment\n" + " -d, --debug Enable debug output\n" + " -h, --help Show this help\n", + name, DEFAULT_THRESHOLD_MS, DEFAULT_INTERVAL_MS, DEFAULT_ORDER, + DEFAULT_MIN_MEM); +} + +static void sig_handler(int sig) +{ + exiting =3D true; +} + +static int setup_psi_trigger(const char *cgroup_path, const char *type, + unsigned long stall_us, unsigned long window_us) +{ + char path[PATH_MAX]; + char trigger[128]; + int fd, nr; + + snprintf(path, sizeof(path), "%s/memory.pressure", cgroup_path); + fd =3D open(path, O_RDWR | O_NONBLOCK); + if (fd < 0) { + fprintf(stderr, "ERROR: open PSI file failed\n"); + return -errno; + } + + nr =3D snprintf(trigger, sizeof(trigger), "%s %lu %lu", + type, stall_us, window_us); + if (write(fd, trigger, nr) < 0) { + fprintf(stderr, "ERROR: write PSI trigger failed\n"); + close(fd); + return -errno; + } + + return fd; +} + +static int trigger_scan(struct bpf_link *iter_link) +{ + char buf[256]; + int fd; + + fd =3D bpf_iter_create(bpf_link__fd(iter_link)); + if (fd < 0) { + fprintf(stderr, "ERROR: bpf_iter_create failed: %s\n", + strerror(errno)); + return -1; + } + + /* Read to trigger the iter program execution */ + while (read(fd, buf, sizeof(buf))) + ; + + close(fd); + return 0; +} + +static void *monitor_thread(int psi_fd, struct config_local *configs, + struct bpf_link *iter_link, struct ring_buffer *rb) +{ + struct epoll_event e; + int epoll_fd; + int nfds; + + epoll_fd =3D epoll_create1(0); + if (epoll_fd < 0) { + fprintf(stderr, "ERROR: epoll_create1 failed\n"); + return NULL; + } + + e.events =3D EPOLLPRI; + e.data.fd =3D psi_fd; + if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, psi_fd, &e)) { + fprintf(stderr, "ERROR: epoll_ctl failed\n"); + goto CLOSE; + } + + /* First initialization */ + trigger_scan(iter_link); + if (configs->debug) + ring_buffer__poll(rb, 0); + + /* Auto adjustment */ + while (!exiting) { + nfds =3D epoll_wait(epoll_fd, &e, 1, configs->interval); + trigger_scan(iter_link); + + if (configs->debug) { + printf("PSI: memory pressure %s\n", nfds ? "high" : "low"); + ring_buffer__poll(rb, 0); + } + } + +CLOSE: + close(epoll_fd); + return NULL; +} + +static int handle_event(void *ctx, void *data, size_t len) +{ + struct alert_event *e =3D data; + + printf("cgroup %s: stall %lu -> %lu (+%lu), mem %luMB, mthp order=3D%d\n", + e->name[0] ? e->name : "/", + e->prev_stall, e->curr_stall, e->delta, TO_MB(e->mem), e->order); + + return 0; +} + +int main(int argc, char **argv) +{ + const char *root_path =3D DEFAULT_ROOT; + unsigned long threshold =3D DEFAULT_THRESHOLD_MS; + unsigned long interval =3D DEFAULT_INTERVAL_MS; + unsigned int init_order =3D DEFAULT_ORDER; + unsigned int min_mem =3D DEFAULT_MIN_MEM; + bool fixed =3D false; + bool debug =3D false; + struct mthp_ext *skel; + struct bpf_link *iter_link; + struct bpf_link *ops_link; + struct ring_buffer *rb; + int root_fd; + int psi_fd; + int err =3D 0; + int opt; + + static struct option long_options[] =3D { + {"root", required_argument, 0, 'r'}, + {"threshold", required_argument, 0, 't'}, + {"interval", required_argument, 0, 'i'}, + {"order", required_argument, 0, 'o'}, + {"min", required_argument, 0, 'm'}, + {"fixed", no_argument, 0, 'f'}, + {"debug", no_argument, 0, 'd'}, + {"help", no_argument, 0, 'h'}, + {0, 0, 0, 0} + }; + + while ((opt =3D getopt_long(argc, argv, "r:t:i:o:m:fdh", + long_options, NULL)) !=3D -1) { + switch (opt) { + case 'r': + root_path =3D optarg; + break; + case 't': + threshold =3D strtoul(optarg, NULL, 10); + break; + case 'i': + interval =3D strtoul(optarg, NULL, 10); + break; + case 'o': + init_order =3D min(strtoul(optarg, NULL, 10), PMD_ORDER); + break; + case 'm': + min_mem =3D strtoul(optarg, NULL, 10); + break; + case 'f': + fixed =3D true; + break; + case 'd': + debug =3D true; + break; + case 'h': + usage(argv[0]); + return 0; + default: + usage(argv[0]); + return -EINVAL; + } + } + + if (!threshold || !interval) { + fprintf(stderr, "ERROR: threshold and interval must be > 0\n"); + usage(argv[0]); + return -EINVAL; + } + + signal(SIGINT, sig_handler); + signal(SIGTERM, sig_handler); + + root_fd =3D open(root_path, O_RDONLY); + if (root_fd < 0) { + fprintf(stderr, "ERROR: open '%s' failed: %s\n", + root_path, strerror(errno)); + return -errno; + } + + skel =3D mthp_ext__open(); + if (!skel) { + fprintf(stderr, "ERROR: failed to open BPF skeleton\n"); + err =3D -ENOMEM; + goto open_skel_fail; + } + + skel->bss->configs.threshold =3D threshold; + skel->bss->configs.interval =3D interval; + skel->bss->configs.init_order =3D init_order; + skel->bss->configs.min_mem =3D min_mem; + skel->bss->configs.fixed =3D fixed; + skel->bss->configs.debug =3D debug; + + err =3D mthp_ext__load(skel); + if (err) { + fprintf(stderr, "ERROR: failed to load BPF program: %d\n", err); + goto load_skel_fail; + } + + /* Attach struct_ops to root cgroup for mthp_choose */ + DECLARE_LIBBPF_OPTS(bpf_struct_ops_opts, opts); + opts.flags =3D BPF_F_CGROUP_FD; + opts.target_fd =3D root_fd; + ops_link =3D bpf_map__attach_struct_ops_opts(skel->maps.mthp_ops, &opts); + err =3D libbpf_get_error(ops_link); + if (err) { + fprintf(stderr, "ERROR: attach struct_ops failed: %d\n", err); + ops_link =3D NULL; + goto attach_opts_fail; + } + + printf("Monitoring : %s\n" + "threshold : %lums\n" + "Interval : %lums\n" + "Initial order : %d%s\n" + "min memory : %dMB\n" + "Debug : %s\n" + "Press Ctrl+C to exit.\n\n", + root_path, threshold, interval, init_order, + fixed ? " (fixed)" : " (auto)", min_mem, + debug ? "on" : "off"); + + if (fixed) { + while (!exiting) + usleep(interval * 1000); + goto exit_fixed; + } + + /* Auto adjustment, attach cgroup iter for scanning root + descendants */ + DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, iter_opts); + union bpf_iter_link_info linfo =3D { + .cgroup.cgroup_fd =3D root_fd, + .cgroup.order =3D BPF_CGROUP_ITER_DESCENDANTS_PRE, + }; + iter_opts.link_info =3D &linfo; + iter_opts.link_info_len =3D sizeof(linfo); + iter_link =3D bpf_program__attach_iter(skel->progs.cgroup_scan, &iter_opt= s); + err =3D libbpf_get_error(iter_link); + if (err) { + fprintf(stderr, "ERROR: attach cgroup iter failed: %d\n", err); + iter_link =3D NULL; + goto attach_iter_fail; + } + + /* Set up ring buffer for receiving alerts */ + rb =3D ring_buffer__new(bpf_map__fd(skel->maps.events), + handle_event, NULL, NULL); + if (!rb) { + fprintf(stderr, "ERROR: failed to create ring buffer\n"); + err =3D -ENOMEM; + goto rb_fail; + } + + + psi_fd =3D setup_psi_trigger(root_path, "some", threshold * 1000, + interval * 1000); + if (psi_fd < 0) { + fprintf(stderr, "ERROR: PSI trigger setup failed\n"); + goto psi_setup_fail; + } + + monitor_thread(psi_fd, &skel->bss->configs, iter_link, rb); + + close(psi_fd); +psi_setup_fail: + ring_buffer__free(rb); +rb_fail: + bpf_link__destroy(iter_link); +exit_fixed: +attach_iter_fail: + bpf_link__destroy(ops_link); +attach_opts_fail: +load_skel_fail: + mthp_ext__destroy(skel); +open_skel_fail: + close(root_fd); + + printf("\nExiting...\n"); + + return err; +} diff --git a/samples/bpf/mthp_ext.h b/samples/bpf/mthp_ext.h new file mode 100644 index 000000000000..33dc01bcebd3 --- /dev/null +++ b/samples/bpf/mthp_ext.h @@ -0,0 +1,30 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +#ifndef __MTHP_EXT_H__ +#define __MTHP_EXT_H__ + +#define CGROUP_NAME_LEN 128 +#define PMD_ORDER 9 +#define min(a, b) ((a) < (b) ? a : b) +#define FROM_MB(s) (s * 1024 * 1024) +#define TO_MB(s) (s / 1024 / 1024) + +struct config_local { + unsigned long threshold; + unsigned long interval; + unsigned int init_order; + unsigned int min_mem; + bool fixed; + bool debug; +}; + +struct alert_event { + unsigned long prev_stall; + unsigned long curr_stall; + unsigned long delta; + unsigned long mem; + unsigned int order; + char name[CGROUP_NAME_LEN]; +}; + +#endif /* __MTHP_EXT_H__ */ --=20 2.53.0