From nobody Sat Jun 13 07:47:24 2026 Received: from mail-pf1-f174.google.com (mail-pf1-f174.google.com [209.85.210.174]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D54E121ADB7 for ; Fri, 8 May 2026 15:01:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.174 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778252478; cv=none; b=a2IrdVvbSXnTF14fJ4HFaESaHeXeiOueG16+BRpyNEGEtI+YxXRioQuZHKm7/hwigXKs0Wkwn1zoomrKSMRuaXPD08OxwovqLDjPnsurevTY+O2XCXgv60rZ1r4GYzdMSBsMheicoRCyBv7lYUFWcWhpq1X8eHb15ZpQ40ZOn6o= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778252478; c=relaxed/simple; bh=I4my/LYj+yUD/4GouKxkZKg9azWROTPg1jyt3aShvm4=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=uqse99u06yU8IwO4N75gZghIpmBqHT1dlUJy157JUbKTydpsaE6xPtjZXsfunX5MPSwF2c+jH0WxtX+r1kLy1yqCaqqFcrO78KkN0O/1sEsA6MjBf5IzeBTbpFEBlLElsTPC/0KQ/jORyg80FJEhA1lznqeKXf9Cx3ebVViEL+8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=VhxuXyyj; arc=none smtp.client-ip=209.85.210.174 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="VhxuXyyj" Received: by mail-pf1-f174.google.com with SMTP id d2e1a72fcca58-837c09d2268so950952b3a.0 for ; Fri, 08 May 2026 08:01:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1778252475; x=1778857275; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=ZQGcz1ZaEzbqgScDrowVXARi09LONkb6J25l1BcDqBA=; b=VhxuXyyjwJ+jQxRJYVyuReBo7qp3XSRZWtRMTaL+On3/IYebkvJmepRf7YvB4gjtSG hcb32t2fgcXQ0vHMpmifRusIXJeWztPjkEhIljZWs4/28Gw0Co/scvGvDArEpIqsF4gi 8DI+r+WXe+65WwF7/+JjU4MGvG7V2DfEMoKcTO+jyL6QIfhJnKAHFzt8cwRqIM1VlQw7 XzMbupn73ETcDPdgebghLUu09X+3fhlpkATohZfFpPPNVjNeUNylzDNn6ajlwpTqqwmp 0gbGUggv6HHMOw5GPYgOuRW/jlpQoE6oTKeplmYhTaogi9CHdOGtcW9juQpgv9z02sC3 ziGQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1778252475; x=1778857275; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=ZQGcz1ZaEzbqgScDrowVXARi09LONkb6J25l1BcDqBA=; b=oGtolxyRFZ734FsTjsI+H7b/w4q851xmz/6m8LiowFFb+iAp2dwcyq9pPPNIFSacY1 p9hmCHIJ1M/SdS2NuzTuYRMaeRfsXyl0Ig1JKkoz60nGu6f2DR0DVebwcjy9TfpybK2G jAc+hLCKNO054mN12Ee+Cy8uBoUzfSbDoMK255sS6OuToXVDqbAVwlHv9I1RBIW0uQJf 0nLzl8yY09IEVkrfsSzQoRXe3ptg970Vkuv4GkNNK+Fe8DgQKziZDAUbAiAbDTm1dugu bg35/fEL4385GuqoOCN8Xcc2K8hxbOzgGZbyv12Ny7ssE9K4xCQiRzXiLLdxKwTYjVit 3qjw== X-Forwarded-Encrypted: i=1; AFNElJ+C4j8zOH8v8byBWAgXwGoMx6AFZroH9lWjsIhtr4u458+9OIxTnKK4xNdw2EiTt4uZUhJDAIhXmxbvaMI=@vger.kernel.org X-Gm-Message-State: AOJu0YyIJxgEMsS3UzIu3xxZ22DhoVQiPwBeXwbGvEU5CbEJ5qYTshXM btRkXhjbRTp4V1eE9GMHuGe1oNm5rkY/YYOXEQfrLcUZn60q4EoWV5XX X-Gm-Gg: Acq92OEl71iUw41PVQCXelYWKXwTEJwWWBvWtJ78NEg5frJ4gRLURAWy9JDBniRyRnc jZbUhiMDluP1qHxLrQrDcZjjuRdXd+NiBrV+FZIhwXRChxCd03d7uDUrCsgZ+sCCpENCZWPc4gT 4a8JBO6gDIX3HWDaI7vmG0/beRoxwOzy49Yy3mXFg2mvXSDe4+nE817+VaQ6w/cwflXI1dnJ/DG bftFUxwcnQ9Xn2NLz1PXRbTBgWa0OD2Qp4BJUgUQkWRpmrqnOKJzDEp2+hAgSFP44178XXbl1l9 t940HeADEJjxxMwNXszChVXnIRysCE0INFFJn0KY7U4pOq2KkSIPbU4IqegA2hjRinI0mpTOMl7 YX9G6yP87DU0XFDcLzHBpXBd7/pMebmrK5G6wWxSl26kgYSOuE3sTJcryUFbic3T11tVcnyfMQb VLyJpQhxl7Z7+RIgsRs5kQeZGZliaFoUARMFzKFekncKSb6xI= X-Received: by 2002:a05:6a00:4191:b0:82f:abc8:ae0 with SMTP id d2e1a72fcca58-83a5b9dcbe5mr11724552b3a.17.1778252474804; Fri, 08 May 2026 08:01:14 -0700 (PDT) Received: from localhost.localdomain ([114.231.84.174]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-83965945c1bsm13110064b3a.15.2026.05.08.08.01.08 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 08 May 2026 08:01:14 -0700 (PDT) From: Vernon Yang To: akpm@linux-foundation.org, david@kernel.org, ljs@kernel.org, roman.gushchin@linux.dev, inwardvessel@gmail.com, shakeel.butt@linux.dev, ast@kernel.org, daniel@iogearbox.net, surenb@google.com Cc: tz2294@columbia.edu, baohua@kernel.org, lance.yang@linux.dev, dev.jain@arm.com, laoar.shao@gmail.com, gutierrez.asier@huawei-partners.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, bpf@vger.kernel.org, Vernon Yang Subject: [PATCH v2 1/4] psi: add psi_group_flush_stats() function Date: Fri, 8 May 2026 23:00:52 +0800 Message-ID: <20260508150055.680136-2-vernon2gm@gmail.com> X-Mailer: git-send-email 2.53.0 In-Reply-To: <20260508150055.680136-1-vernon2gm@gmail.com> References: <20260508150055.680136-1-vernon2gm@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Vernon Yang Add psi_group_flush_stats() function to prepare for the subsequent mthp_ext ebpf program. no function changes. Signed-off-by: Vernon Yang --- include/linux/psi.h | 1 + kernel/sched/psi.c | 34 ++++++++++++++++++++++++++-------- 2 files changed, 27 insertions(+), 8 deletions(-) diff --git a/include/linux/psi.h b/include/linux/psi.h index e0745873e3f2..7b4fd8190810 100644 --- a/include/linux/psi.h +++ b/include/linux/psi.h @@ -22,6 +22,7 @@ void psi_init(void); void psi_memstall_enter(unsigned long *flags); void psi_memstall_leave(unsigned long *flags); =20 +void psi_group_flush_stats(struct psi_group *group); int psi_show(struct seq_file *s, struct psi_group *group, enum psi_res res= ); struct psi_trigger *psi_trigger_create(struct psi_group *group, char *buf, enum psi_res res, struct file *file, diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index d9c9d9480a45..76ffad90b0b5 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -1242,11 +1242,35 @@ void psi_cgroup_restart(struct psi_group *group) } #endif /* CONFIG_CGROUPS */ =20 +/* + * __psi_group_flush_stats - flush the total stall time of a psi group + * @group: psi group to flush + */ +static void __psi_group_flush_stats(struct psi_group *group) +{ + u64 now; + + /* Update averages before reporting them */ + mutex_lock(&group->avgs_lock); + now =3D sched_clock(); + collect_percpu_times(group, PSI_AVGS, NULL); + if (now >=3D group->avg_next_update) + group->avg_next_update =3D update_averages(group, now); + mutex_unlock(&group->avgs_lock); +} + +void psi_group_flush_stats(struct psi_group *group) +{ + if (static_branch_likely(&psi_disabled)) + return; + + __psi_group_flush_stats(group); +} + int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res) { bool only_full =3D false; int full; - u64 now; =20 if (static_branch_likely(&psi_disabled)) return -EOPNOTSUPP; @@ -1256,13 +1280,7 @@ int psi_show(struct seq_file *m, struct psi_group *g= roup, enum psi_res res) return -EOPNOTSUPP; #endif =20 - /* Update averages before reporting them */ - mutex_lock(&group->avgs_lock); - now =3D sched_clock(); - collect_percpu_times(group, PSI_AVGS, NULL); - if (now >=3D group->avg_next_update) - group->avg_next_update =3D update_averages(group, now); - mutex_unlock(&group->avgs_lock); + __psi_group_flush_stats(group); =20 #ifdef CONFIG_IRQ_TIME_ACCOUNTING only_full =3D res =3D=3D PSI_IRQ; --=20 2.53.0 From nobody Sat Jun 13 07:47:24 2026 Received: from mail-pf1-f175.google.com (mail-pf1-f175.google.com [209.85.210.175]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B1F893D1705 for ; Fri, 8 May 2026 15:01:23 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.175 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778252484; cv=none; b=T+ku3gW9YTzba1w3m0Uv2gTLVzAYBIsY4fYhBhNJW4BqjBFdijfB1Bt+5w2ejh9Sx2lsgJZXXAHR7j9QToZaeF+oyuP36JfHj+UsBtVfYi69PUrX1PJZ8wYEs5MuoGixf2m72Cvu8VI0F913Zgs5eHeCzZ6d+W/ZS2Vyocepo3c= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778252484; c=relaxed/simple; bh=jGeM0yncAreFGWSxvrJhUl2XzEwjtH2RSF4Lh7ouqhw=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=NxpUpLR6HmwpMnVII+4U4diyznXw1jOuAqjuNo3C7WMShHdwTCVz2Idb76Fvmoe2t9SuzmHz6yzpPwTBiX2rgyaqxe4eLhJsmE6CcTQZXsqm3Z1ANzczyqtHz08RKrFZyeOb4DYApRBEi3c/oxOdNUC/Ab2uPEpVXhjPahmk544= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=Jy63HMjd; arc=none smtp.client-ip=209.85.210.175 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="Jy63HMjd" Received: by mail-pf1-f175.google.com with SMTP id d2e1a72fcca58-8353c9f24d2so1186486b3a.3 for ; Fri, 08 May 2026 08:01:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1778252483; x=1778857283; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=QueeMM+8Rtdi14b6LI2wohmsfdt2dh1GY7mMDDoIeZw=; b=Jy63HMjdEH3LRh32D6yLv47E8yiK/+GhjDBlXvUNaRKMYz/0ljk1I+UbuyPZDP+9aR L7LEqDFHKU9To9tLk/157nPESEkftoazEYzifd59jq6uG1eMZhCdK2x/W9anK2D2Bjnq bxjlN+Mvx74wuTaRSXAv3Cle1cMTDBkuyr0rgqKnCf2pgjgcSF6T68rotCxv38RbbIqC pyupXGBECDbNJnI0PcSndbsl+PFm+9u3dn6fegk145VXnTEtUVDlR3QXbx9PMA/zUiPl XKQpr1a358syuXIHbP7xAKmtnwt5D8cC652axEg9+vX7lGQqxI+FvGo2HOkGsDVNEqqp dLeQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1778252483; x=1778857283; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=QueeMM+8Rtdi14b6LI2wohmsfdt2dh1GY7mMDDoIeZw=; b=T86assW40f1A26PxSD/Q/Teo2vG0PPjI2JxnfkdN0TEONy/0ah+WtRI9vZc6PBRnpe fQ0FxN9xPCXpCGdSgrTxXCYRpvrVz5C00jETPU3f/GMGoby+B1gtFMJxMHkseRwz34xk iosnCBIIzgwMWD1yYKfPVYovqidrnPpFbt5s3Yk5rM+MGSKpbigTcSvpLGkvMNM3dpfY 1ys1mGzuPg2HKPW7ZBrItxaGtLWEjUVvQ8zLpK9EY3n9SOfgoE7FNaOY7WF4QJ4cVZ5w X9H1edM8TP5oCVkQGaTHvI0dFTgTnkxZ1QgWjyw4X2rPT8Ovo8jmuHkIE0yvlmxUfuid RP/Q== X-Forwarded-Encrypted: i=1; AFNElJ9Wcxs4Uw/IhFoqdNS26F3652nwcNDV68XKsR26JboX75jVbvCbyFL51SRIFB2jLDvcqGnSFn/Q0+L2W60=@vger.kernel.org X-Gm-Message-State: AOJu0YwvNpLelKZfj9sZ0vH+Ls9I5XhF67eDTXl3xhffpj+Krsixb0vX ZOmOnEH/4mGkF34EHp7u76ZHVtwKC7+jlcuCuLKSNFgVbtw2p7RA9wWh X-Gm-Gg: Acq92OGzqfUxhWCik5M19fiN/IsehRndO5G7rM53F4hZ1QC2v8cUqtrGFfqAOXI/3J/ Q4nU8y1HFHCLkrGmV55cUnhdYTMxL6b/mThj6sKHRBFyfcj4TwARVgvJl1JPNDuB0OyPUYeC8XJ vzze30kjX7nRir6N5OAUryNOHmqaYzBvfT2NgDJhbK1tRNWITi+WwHtAXmkU09VosoNg0XWQCRI IvWNl1IHI7KWZuUCFoaVtXnWdnJTh3I5w8xS0u3Mm3fSEasIvAX4i0/IvHItwijbcXZvpZ45oAJ oFtp8HSaOAxZ4YeNhM9d7v8KIWAZDMLAB4OxjmauQKUjLFTriSjnRJ5pewLaZVj5Saa1b4navJW 7OC3NRDWBrhIxAD7H8JvDCXbHCFxSU7LtCzs1gvovyjh9qka/z/tGHoiflzAMKmTprC7J8SYZMl 0yQyxPYsKuP9O7aVFYmMg0TQhmFEHghv7pWjVA X-Received: by 2002:a05:6a00:aa08:b0:837:f79d:909 with SMTP id d2e1a72fcca58-83a5e13d6d5mr14036326b3a.39.1778252482867; Fri, 08 May 2026 08:01:22 -0700 (PDT) Received: from localhost.localdomain ([114.231.84.174]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-83965945c1bsm13110064b3a.15.2026.05.08.08.01.15 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 08 May 2026 08:01:22 -0700 (PDT) From: Vernon Yang To: akpm@linux-foundation.org, david@kernel.org, ljs@kernel.org, roman.gushchin@linux.dev, inwardvessel@gmail.com, shakeel.butt@linux.dev, ast@kernel.org, daniel@iogearbox.net, surenb@google.com Cc: tz2294@columbia.edu, baohua@kernel.org, lance.yang@linux.dev, dev.jain@arm.com, laoar.shao@gmail.com, gutierrez.asier@huawei-partners.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, bpf@vger.kernel.org, Vernon Yang Subject: [PATCH v2 2/4] bpf: add bpf_cgroup_{flush_stats,stall} function Date: Fri, 8 May 2026 23:00:53 +0800 Message-ID: <20260508150055.680136-3-vernon2gm@gmail.com> X-Mailer: git-send-email 2.53.0 In-Reply-To: <20260508150055.680136-1-vernon2gm@gmail.com> References: <20260508150055.680136-1-vernon2gm@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Vernon Yang Add bpf_cgroup_{flush_stats,stall} function to prepare for the subsequent mthp_ext ebpf program. no function changes. Signed-off-by: Vernon Yang --- include/linux/psi.h | 4 ++++ kernel/bpf/helpers.c | 34 ++++++++++++++++++++++++++++++++++ 2 files changed, 38 insertions(+) diff --git a/include/linux/psi.h b/include/linux/psi.h index 7b4fd8190810..243dcf97bea4 100644 --- a/include/linux/psi.h +++ b/include/linux/psi.h @@ -52,6 +52,10 @@ static inline void psi_memstall_enter(unsigned long *fla= gs) {} static inline void psi_memstall_leave(unsigned long *flags) {} =20 #ifdef CONFIG_CGROUPS +static inline struct psi_group *cgroup_psi(struct cgroup *cgrp) +{ + return NULL; +} static inline int psi_cgroup_alloc(struct cgroup *cgrp) { return 0; diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index 2bb60200c266..1c353e0ff14f 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -29,6 +29,7 @@ #include #include #include +#include =20 #include "../../lib/kstrtox.h" =20 @@ -2881,6 +2882,37 @@ bpf_task_get_cgroup1(struct task_struct *task, int h= ierarchy_id) return NULL; return cgrp; } + +/** + * bpf_cgroup_stall - acquire the total stall time of cgroup + * @cgrp: cgroup struct + * @states: psi states + * + * Return the total stall time. + */ +__bpf_kfunc u64 bpf_cgroup_stall(struct cgroup *cgrp, enum psi_states stat= es) +{ + struct psi_group *group =3D cgroup_psi(cgrp); + + if (unlikely(!group || (u32)states >=3D NR_PSI_STATES - 1)) + return (u64)-1; + + return div_u64(group->total[PSI_AVGS][states], NSEC_PER_MSEC); +} + +/** + * bpf_cgroup_flush_stats - Flush cgroup's statistics + * @cgrp: cgroup struct + */ +__bpf_kfunc void bpf_cgroup_flush_stats(struct cgroup *cgrp) +{ + struct psi_group *group =3D cgroup_psi(cgrp); + + if (unlikely(!group)) + return; + + psi_group_flush_stats(group); +} #endif /* CONFIG_CGROUPS */ =20 /** @@ -4734,6 +4766,8 @@ BTF_ID_FLAGS(func, bpf_cgroup_ancestor, KF_ACQUIRE | = KF_RCU | KF_RET_NULL) BTF_ID_FLAGS(func, bpf_cgroup_from_id, KF_ACQUIRE | KF_RET_NULL) BTF_ID_FLAGS(func, bpf_task_under_cgroup, KF_RCU) BTF_ID_FLAGS(func, bpf_task_get_cgroup1, KF_ACQUIRE | KF_RCU | KF_RET_NULL) +BTF_ID_FLAGS(func, bpf_cgroup_stall) +BTF_ID_FLAGS(func, bpf_cgroup_flush_stats, KF_SLEEPABLE) #endif BTF_ID_FLAGS(func, bpf_task_from_pid, KF_ACQUIRE | KF_RET_NULL) BTF_ID_FLAGS(func, bpf_task_from_vpid, KF_ACQUIRE | KF_RET_NULL) --=20 2.53.0 From nobody Sat Jun 13 07:47:24 2026 Received: from mail-pf1-f178.google.com (mail-pf1-f178.google.com [209.85.210.178]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9127C3FA5D9 for ; Fri, 8 May 2026 15:01:31 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.178 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778252493; cv=none; b=Oi0QkPq9dCAf4D3mcv4WQaLobAwnPRR66+FenRRz2FQVmC6uXrgn8Wl4LzSRuSO4RTdztse7q1myo6xg28ARkTa5zIx22c4EaeFkDAux/yqu01+01fsp2auKmWMdZReD4Fms5ttZHrCjV2AvNpjpvNSnaO+412i5jDGZX+IaAEY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778252493; c=relaxed/simple; bh=c1WOAjKtLto+9++pozNr1RGscclDkwpzhv6f1ahBKYM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=DFbbB9kLGguVriYf+8G1uM/5vmIbtW6+vwcgcTCWo4ruDRLIZnu/PI1MtDtplMCS3/C5D+ZhzjZqCo8IWvHeuPLSDL4CRwlUBnlvW9VjRX7C9NpFuicpp4Sb+2iw26noSZ7c3MzzezoDqZpxS+QzcgrM5fEmJAcIl32+8F1nSW8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=bAKlzhZ+; arc=none smtp.client-ip=209.85.210.178 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="bAKlzhZ+" Received: by mail-pf1-f178.google.com with SMTP id d2e1a72fcca58-8367df48711so971526b3a.1 for ; Fri, 08 May 2026 08:01:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1778252491; x=1778857291; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=dsM3MZHPJG4jSqIS5oTV4ZYkVaqZoomzaAcAboB87ak=; b=bAKlzhZ+BggOlVGHr+4e0NzY763oBkFE7BWjiNQPulP2EMZ0CSs8YJDBNfIsVh/Wec oQ/d0BH3USr/FWSIizuJwegs9PhE75lOexT3NH37EzUun4PTCQ2yC/lwQXf51m41wG4J tfLuK/DVyCcxZ+jpeltPk/NHPlBURGPCUVLBaYac6De6OoW+oZqHh/eAI/4eykPMETMo 7AQDdv/RKbuIbyQkIa6KbwZamqOda36w9oN15vXygSE8pG20jKftt1lFZu42NlUNGql+ FMcJClfr5XK1/gKsybCcmQNa/2iB+NI6EJLsBCtryCP7UgItPsvTOBlysvZDvhSudX8w JGLQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1778252491; x=1778857291; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=dsM3MZHPJG4jSqIS5oTV4ZYkVaqZoomzaAcAboB87ak=; b=K5yFjarQ05BWrsRi6wrMR0yzw/IfkdLuOjgB/cpHIZYemzLXjSmthEyrlNIF4dV8F/ Oz5BdNyTnrsMOoyOdGODoTT3VzUBsGfnK60NbxIM8hDnVzVZs3ZqZrIKQwGZwCgpnt8K K+jj+FoE+EdeX8UqZzSMdqYCEFBrPPHHRE96IKuJfeFkUulbBUFOEQ0+ADMNUMcdK1J1 cSJtJGRtNMFrqwr57148i7+AVC8zPZOTVf4AhJmQBt/sXSWyTZcPawtuv7JDzxk8/xT6 jJnINksrnHbXLSrxSDxZAlWg4fs5c7N/ZA/fL5+/++saheqxUi8fJqjJo+JxAcuRJqVF OIjQ== X-Forwarded-Encrypted: i=1; AFNElJ87MhLsDasyMXlWr6oXAsBc/glbE7EsdjTfUbiopEjhwu5D0efEIUUgV6wzVupIruIGjUfPCUzI3dqLzw4=@vger.kernel.org X-Gm-Message-State: AOJu0YwBQSOnCiKP5sGpnkmO/+oIllFeaBFv/YMtx4fOWzq0pdgDleRi lmuzx3hk2CPSsfGydCYpNWgF9qWKSpNaY0YHEA2ySMANhiwx1y+x0b5y X-Gm-Gg: Acq92OHYlhIy9NxnFHtqBB/0NW+u1ZKhcJkEP46wCaRl9vue/w/HW+jqZ3SyF1GbM73 qVSb9hfJSVtDAa4w1XV8HwVGrK2X6TK7O0IdITkjOhMZEBU/DURy1Q//TevKWel/tO9i9dzn42A HlxYff/wwiH82Dh0jISSEv5De2GZLleZc5ARjB6XlR+OzJZE7+ab/zTjO9gfa1DiZOjHiAW6+VR 9grsBlBtx7FSNZR0t+Zm9spN8IyJumIMy1e3OXfju2YkqN5/2kBbMCfAzYNwncnRhL9sEAbvBuf Bt1TktyHyPdrece99ix3l2wBAQ7quV8cznXIyThZs3Ts1YdCVTwi61dsookJO+5eBnFSk1MCNrf TqU3cavMalyi4bp9lqp+MQt1M7nFZzbi7BYhgI3W+RH7l2I0OylrBpQj5T+kSRiKgn2YKitf4TH O9/Ls+J0SbVrgqiVIu1Jp6t0jvyMulYybdMBl25gaWBRMNgTo= X-Received: by 2002:a05:6a00:39a5:b0:82c:2205:507d with SMTP id d2e1a72fcca58-83a5dd5bc34mr12224188b3a.36.1778252490591; Fri, 08 May 2026 08:01:30 -0700 (PDT) Received: from localhost.localdomain ([114.231.84.174]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-83965945c1bsm13110064b3a.15.2026.05.08.08.01.23 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 08 May 2026 08:01:29 -0700 (PDT) From: Vernon Yang To: akpm@linux-foundation.org, david@kernel.org, ljs@kernel.org, roman.gushchin@linux.dev, inwardvessel@gmail.com, shakeel.butt@linux.dev, ast@kernel.org, daniel@iogearbox.net, surenb@google.com Cc: tz2294@columbia.edu, baohua@kernel.org, lance.yang@linux.dev, dev.jain@arm.com, laoar.shao@gmail.com, gutierrez.asier@huawei-partners.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, bpf@vger.kernel.org, Vernon Yang Subject: [PATCH v2 3/4] mm: introduce bpf_mthp_ops struct ops Date: Fri, 8 May 2026 23:00:54 +0800 Message-ID: <20260508150055.680136-4-vernon2gm@gmail.com> X-Mailer: git-send-email 2.53.0 In-Reply-To: <20260508150055.680136-1-vernon2gm@gmail.com> References: <20260508150055.680136-1-vernon2gm@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable From: Vernon Yang Introducing bpf_mthp_ops enables eBPF programs to register the mthp_choose callback function via cgroup-ebpf. Using cgroup-bpf to customize mTHP size for different scenarios=EF=BC=8C automatically select different mTHP sizes for different cgroups, let's focus on making them truly transparent. Signed-off-by: Vernon Yang --- MAINTAINERS | 3 + include/linux/bpf_huge_memory.h | 52 ++++++++++ include/linux/cgroup-defs.h | 1 + include/linux/huge_mm.h | 6 ++ kernel/cgroup/cgroup.c | 2 + mm/Kconfig | 14 +++ mm/Makefile | 1 + mm/bpf_huge_memory.c | 168 ++++++++++++++++++++++++++++++++ 8 files changed, 247 insertions(+) create mode 100644 include/linux/bpf_huge_memory.h create mode 100644 mm/bpf_huge_memory.c diff --git a/MAINTAINERS b/MAINTAINERS index caaa0d6e6056..f1113eaa1193 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -4887,7 +4887,10 @@ M: Shakeel Butt L: bpf@vger.kernel.org L: linux-mm@kvack.org S: Maintained +F: include/linux/bpf_huge_memory.h +F: mm/bpf_huge_memory.c F: mm/bpf_memcontrol.c +F: samples/bpf/mthp_ext.* =20 BPF [MISC] L: bpf@vger.kernel.org diff --git a/include/linux/bpf_huge_memory.h b/include/linux/bpf_huge_memor= y.h new file mode 100644 index 000000000000..ffda445c9572 --- /dev/null +++ b/include/linux/bpf_huge_memory.h @@ -0,0 +1,52 @@ +/* SPDX-License-Identifier: GPL-2.0+ */ + +#ifndef __BPF_HUGE_MEMORY_H +#define __BPF_HUGE_MEMORY_H + +#include + +/** + * struct bpf_mthp_ops - BPF callbacks for mTHP operations + * @mthp_choose: Choose the custom mTHP orders + * + * This structure defines the interface for BPF programs to customize + * mTHP behavior through struct_ops programs. + */ +struct bpf_mthp_ops { + unsigned long (*mthp_choose)(struct cgroup *cgrp, unsigned long orders); +}; + +#ifdef CONFIG_BPF_TRANSPARENT_HUGEPAGE +/** + * bpf_mthp_choose - Choose the custom mTHP orders using bpf + * @mm: task mm_struct + * @orders: original orders + * + * Return suited mTHP orders. + */ +unsigned long bpf_mthp_choose(struct mm_struct *mm, unsigned long orders); + +/** + * cgroup_bpf_set_mthp_ops - Set sub-cgroup mthp_ops to parent cgroup + * @cgrp: want to set mthp_ops of sub-cgroup + * @parent: parent cgroup + */ +static inline void cgroup_bpf_set_mthp_ops(struct cgroup *cgrp, + struct cgroup *parent) +{ + WRITE_ONCE(cgrp->mthp_ops, parent->mthp_ops); +} +#else +static inline unsigned long bpf_mthp_choose(struct mm_struct *mm, + unsigned long orders) +{ + return orders; +} +static inline void cgroup_bpf_set_mthp_ops(struct cgroup *cgrp, + struct cgroup *parent) +{ +} +#endif /* CONFIG_BPF_TRANSPARENT_HUGEPAGE */ + +#endif /* __BPF_HUGE_MEMORY_H */ + diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index f42563739d2e..78854d0e06ab 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -628,6 +628,7 @@ struct cgroup { =20 #ifdef CONFIG_BPF_SYSCALL struct bpf_local_storage __rcu *bpf_cgrp_storage; + struct bpf_mthp_ops *mthp_ops; #endif #ifdef CONFIG_EXT_SUB_SCHED struct scx_sched __rcu *scx_sched; diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 127f9e1e7604..65da35fb0980 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -3,6 +3,7 @@ #define _LINUX_HUGE_MM_H =20 #include +#include =20 #include /* only for vma_is_dax() */ #include @@ -296,6 +297,11 @@ unsigned long thp_vma_allowable_orders(struct vm_area_= struct *vma, enum tva_type type, unsigned long orders) { + /* The eBPF-specified orders overrides which order is selected. */ + orders &=3D bpf_mthp_choose(vma->vm_mm, orders); + if (!orders) + return 0; + /* * Optimization to check if required orders are enabled early. Only * forced collapse ignores sysfs configs. diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 43adc96c7f1a..1dbef3e8b179 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -5836,6 +5836,8 @@ static struct cgroup *cgroup_create(struct cgroup *pa= rent, const char *name, if (ret) goto out_stat_exit; =20 + cgroup_bpf_set_mthp_ops(cgrp, parent); + for (tcgrp =3D cgrp; tcgrp; tcgrp =3D cgroup_parent(tcgrp)) cgrp->ancestors[tcgrp->level] =3D tcgrp; =20 diff --git a/mm/Kconfig b/mm/Kconfig index 27dc5b0139ba..be49bde783a7 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -949,6 +949,20 @@ config NO_PAGE_MAPCOUNT =20 EXPERIMENTAL because the impact of some changes is still unclear. =20 +config BPF_TRANSPARENT_HUGEPAGE + bool "BPF-based transparent hugepage (EXPERIMENTAL)" + depends on TRANSPARENT_HUGEPAGE && CGROUP_BPF + help + Using cgroup-bpf to customize mTHP size for different scenarios, + automatically select different mTHP sizes for different cgroups, + let's focus on making them truly transparent. + + This is an experimental feature, that might go away at any time, + Please do not rely any production environment. + + EXPERIMENTAL because the BPF interface is unstable and may be removed + at any time. + endif # TRANSPARENT_HUGEPAGE =20 # simple helper to make the code a bit easier to read diff --git a/mm/Makefile b/mm/Makefile index 8ad2ab08244e..b474c21c3253 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -108,6 +108,7 @@ obj-$(CONFIG_MEMCG) +=3D swap_cgroup.o endif ifdef CONFIG_BPF_SYSCALL obj-$(CONFIG_MEMCG) +=3D bpf_memcontrol.o +obj-$(CONFIG_BPF_TRANSPARENT_HUGEPAGE) +=3D bpf_huge_memory.o endif obj-$(CONFIG_CGROUP_HUGETLB) +=3D hugetlb_cgroup.o obj-$(CONFIG_GUP_TEST) +=3D gup_test.o diff --git a/mm/bpf_huge_memory.c b/mm/bpf_huge_memory.c new file mode 100644 index 000000000000..851c6ebe2933 --- /dev/null +++ b/mm/bpf_huge_memory.c @@ -0,0 +1,168 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Huge memory related BPF code + * + * Author: Vernon Yang + */ + +#include +#include + +/* Protects cgrp->mthp_ops pointer for read and write. */ +DEFINE_SRCU(mthp_bpf_srcu); + +unsigned long bpf_mthp_choose(struct mm_struct *mm, unsigned long orders) +{ + struct cgroup *cgrp; + struct mem_cgroup *memcg; + struct bpf_mthp_ops *ops; + int idx; + + memcg =3D get_mem_cgroup_from_mm(mm); + if (!memcg) + return orders; + + cgrp =3D memcg->css.cgroup; + + idx =3D srcu_read_lock(&mthp_bpf_srcu); + ops =3D READ_ONCE(cgrp->mthp_ops); + if (unlikely(ops && ops->mthp_choose)) + orders =3D ops->mthp_choose(cgrp, orders); + srcu_read_unlock(&mthp_bpf_srcu, idx); + + mem_cgroup_put(memcg); + + return orders; +} + +static int bpf_mthp_ops_btf_struct_access(struct bpf_verifier_log *log, + const struct bpf_reg_state *reg, int off, int size) +{ + return -EACCES; +} + +static bool bpf_mthp_ops_is_valid_access(int off, int size, enum bpf_acces= s_type type, + const struct bpf_prog *prog, struct bpf_insn_access_aux *info) +{ + return bpf_tracing_btf_ctx_access(off, size, type, prog, info); +} + +const struct bpf_verifier_ops bpf_mthp_verifier_ops =3D { + .get_func_proto =3D bpf_base_func_proto, + .btf_struct_access =3D bpf_mthp_ops_btf_struct_access, + .is_valid_access =3D bpf_mthp_ops_is_valid_access, +}; + +static int bpf_mthp_ops_reg(void *kdata, struct bpf_link *link) +{ + struct bpf_struct_ops_link *st_link =3D (struct bpf_struct_ops_link *)lin= k; + struct bpf_mthp_ops *ops =3D kdata; + struct cgroup_subsys_state *child; + struct cgroup *cgrp; + + if (!link) + return -EOPNOTSUPP; + + cgrp =3D st_link->cgroup; + if (!cgrp) + return -EINVAL; + + cgroup_lock(); + css_for_each_descendant_pre(child, &cgrp->self) { + if (READ_ONCE(child->cgroup->mthp_ops)) { + pr_warn("sub-cgroup has already registered.\n"); + cgroup_unlock(); + return -EBUSY; + } + } + css_for_each_descendant_pre(child, &cgrp->self) + WRITE_ONCE(child->cgroup->mthp_ops, ops); + cgroup_unlock(); + + return 0; +} + +static void bpf_mthp_ops_unreg(void *kdata, struct bpf_link *link) +{ + struct bpf_struct_ops_link *st_link =3D (struct bpf_struct_ops_link *)lin= k; + struct cgroup_subsys_state *child; + struct cgroup *cgrp; + + if (!link) + return; + + cgrp =3D st_link->cgroup; + if (!cgrp) + return; + + cgroup_lock(); + css_for_each_descendant_pre(child, &cgrp->self) + WRITE_ONCE(child->cgroup->mthp_ops, NULL); + cgroup_unlock(); + + synchronize_srcu(&mthp_bpf_srcu); +} + +static int bpf_mthp_ops_check_member(const struct btf_type *t, + const struct btf_member *member, + const struct bpf_prog *prog) +{ + u32 moff =3D __btf_member_bit_offset(t, member) / 8; + + switch (moff) { + case offsetof(struct bpf_mthp_ops, mthp_choose): + break; + default: + return -EINVAL; + } + + if (prog->sleepable) + return -EINVAL; + + return 0; +} + +static int bpf_mthp_ops_init_member(const struct btf_type *t, + const struct btf_member *member, + void *kdata, const void *udata) +{ + return 0; +} + +static int bpf_mthp_ops_init(struct btf *btf) +{ + return 0; +} + +static unsigned long cfi_mthp_choose(struct cgroup *cgrp, unsigned long or= ders) +{ + return 0; +} + +static struct bpf_mthp_ops cfi_bpf_mthp_ops =3D { + .mthp_choose =3D cfi_mthp_choose, +}; + +static struct bpf_struct_ops bso_bpf_mthp_ops =3D { + .verifier_ops =3D &bpf_mthp_verifier_ops, + .reg =3D bpf_mthp_ops_reg, + .unreg =3D bpf_mthp_ops_unreg, + .check_member =3D bpf_mthp_ops_check_member, + .init_member =3D bpf_mthp_ops_init_member, + .init =3D bpf_mthp_ops_init, + .name =3D "bpf_mthp_ops", + .owner =3D THIS_MODULE, + .cfi_stubs =3D &cfi_bpf_mthp_ops, +}; + +static int __init bpf_huge_memory_init(void) +{ + int err; + + err =3D register_bpf_struct_ops(&bso_bpf_mthp_ops, bpf_mthp_ops); + if (err) + pr_warn("Registration of bpf_mthp_ops failed, err %d\n", err); + + return err; +} +late_initcall(bpf_huge_memory_init); --=20 2.53.0 From nobody Sat Jun 13 07:47:24 2026 Received: from mail-pf1-f169.google.com (mail-pf1-f169.google.com [209.85.210.169]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D358C3F7AA5 for ; Fri, 8 May 2026 15:01:40 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.169 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778252503; cv=none; b=ECL5WlBPy6UWrrRBrpmYvlf0N7OitRdSwlKqaNRdRjH9+QVjOkbBTH2b7zXMm4qUZr35Pkyjh1rQJEnAuvPl2J++DG+9XHuZ6ItONXP150op+SefU15q3sj43uq90gMwRUmVVxanoG2nhJvQPDh4GvelCC2N87aW9lCo3hRL+t8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778252503; c=relaxed/simple; bh=3d/Fyjd46ZVO2gRY09JC7UlwzifghFjy4ul4H0Kss9s=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=jKTx6WEoYyECPHyXvxUqrBkToSSOFEZkt3YSwjU6LAZd95qDMvnN6bDggHO1vaxmn0wHOvr2QUm8o6XyDXADuK0hLroLWQorrfI9a5MflQbd799E3/gKPCSGwq0vUMrqLf45r/tsL9d/KpvERoZXquq5M+uKhT2meNhI2npUT4g= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=SJZHPn+3; arc=none smtp.client-ip=209.85.210.169 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="SJZHPn+3" Received: by mail-pf1-f169.google.com with SMTP id d2e1a72fcca58-8353fd1cb5fso1056204b3a.0 for ; Fri, 08 May 2026 08:01:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1778252500; x=1778857300; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=hsLW/tTbgLTFufuJHQiQcJEA4R3DBVkdsJ0iG5BPYaY=; b=SJZHPn+3WU33MM+Ov+qrPF2bY1aFl/gaUrBvJIa0BBNX6s01pgWJropr+kDA2+fMXj VbORGibVMnXqZai7a3M5GLTVqnNDSF6BFIu6Q1akuDb5G434mQJzWqigL3nBMtQHttNT yPgm5yFFcI59cFCnCXpBXdL8X/qeBtmXOuMVdW12asPcQZ+O0niB6o1o2VeFs96Er2JR l4NMiA8cHJGDHaZ3l7DzteLXX0R5fc1wVikiPDhQxaKgtRzqQfI2G3LffAvxPtCIxRRx SS3W8bjUjfSdkSpDs1Ha4J/aFNlbnxceNjKjiFutJ+T9A80KTkgy1ArCxeNGC7NHHywC nrYw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1778252500; x=1778857300; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=hsLW/tTbgLTFufuJHQiQcJEA4R3DBVkdsJ0iG5BPYaY=; b=pvEeMghtTCWSHiyW75Mbd7hNZCqVSEhX3ztbqU5O4vXFDvEYu/o8glNQmpa/Zg9voV UM9xqMVSPtidWiUu9KPYSBgvYZp5PwHGJvbYXyvP5VKoShoTXhrDqVj1VmMaD0gQhtyr FfqdXx56ouUHGCJWLgO/X843ZRZMoMMA7/L3zC+lEAXPCQjYtq8JZkJbFghEoRZQvLlb 2Bxb0EM7amXdHaUZa7oM5jdP7V1jlJbUirVM5KyqY5Vx6g0gzRXYG9M59oAoyZMHMHGm 2tcTAVw6SITK+qgbFip3CNH1aSAIK2G3/5JznxND4OBOk5hNTLHFzI7mpHOoHUM/al/4 tTdw== X-Forwarded-Encrypted: i=1; AFNElJ981q+VtGqghhibcsTc5l4M8gozxFmZdwEKofBryVe5qAR9Foz6iMNsPT3AWRZ5rOY9cMp5IYYt9N/TiQI=@vger.kernel.org X-Gm-Message-State: AOJu0YwC52a4bQUOLbHVzqBPyTuiqcR2v44wv4k4aEm5rkapJ0CBNRlh Q7OO01SmHLtwV6TxfSXfUC+ncQokYbjoaWMZVY5YLIjDiVBJPI2bbfwA X-Gm-Gg: Acq92OFEfqdlHBcJEavGJbhr3x9I+TL1l4NiRCJvK5JEhN/1Y5jL3yGB3BJGARB6AEh qaE4Xkf0lyv30HrNEv8BB9B4xlsKokYBB+tH4BvEAycW/TveFROTBxgMkYRs3FEu0wqG3CagCu2 N8jwRhg433g8MNT1/sA03L4+H3huSUz9Hy2oGsaflNr2809n1DAQfFZjTMQO2nQpARdZs7cfwjL yKF1UCy0rGsdfjFlQdCVfEHYsqC2jVPsPACnPXRB2tCE01bKObFL3dtKxecYnW4rq2RNnoBqbQR bdE2lymxqEuQma1ULUqs9saKjh9hs8gbuzRpjfCTFCv7Wm2ZYv47TPM1zEmSPkY4sN9KaqpWvTU 1HFGuRnoN17oSXZuOiGZYemeFGA3zFvC3Ucq/pQc1/BhCsVJ3Mtn4xAYY/GA3JhpT3LTssObvJH vKiGHC3f8lMtwkEQlCXqO/ZS32VH6CHvDnR8JZ X-Received: by 2002:a05:6a00:e05:b0:835:3861:812c with SMTP id d2e1a72fcca58-83bb82a196bmr6454088b3a.23.1778252499764; Fri, 08 May 2026 08:01:39 -0700 (PDT) Received: from localhost.localdomain ([114.231.84.174]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-83965945c1bsm13110064b3a.15.2026.05.08.08.01.31 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 08 May 2026 08:01:39 -0700 (PDT) From: Vernon Yang To: akpm@linux-foundation.org, david@kernel.org, ljs@kernel.org, roman.gushchin@linux.dev, inwardvessel@gmail.com, shakeel.butt@linux.dev, ast@kernel.org, daniel@iogearbox.net, surenb@google.com Cc: tz2294@columbia.edu, baohua@kernel.org, lance.yang@linux.dev, dev.jain@arm.com, laoar.shao@gmail.com, gutierrez.asier@huawei-partners.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, bpf@vger.kernel.org, Vernon Yang Subject: [PATCH v2 4/4] samples: bpf: add mthp_ext Date: Fri, 8 May 2026 23:00:55 +0800 Message-ID: <20260508150055.680136-5-vernon2gm@gmail.com> X-Mailer: git-send-email 2.53.0 In-Reply-To: <20260508150055.680136-1-vernon2gm@gmail.com> References: <20260508150055.680136-1-vernon2gm@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Vernon Yang Design mthp_ext case to address real workload issues. The main functions of the mthp_ext are as follows: - When sub-cgroup is under high memory pressure (default, full 100ms 1s), it will automatically fallback to using 4KB. - When the anon+shmem memory usage of sub-cgroup falls below the minimum memory (default 16MB), small-memory processes will automatically fallback to using 4KB. - Under normal conditions, when there is no memory pressure and the anon+shmem memory usage exceeds the minimum memory, all mTHP sizes shall be utilized by kernel. - Monitor the root-cgroup (/sys/fs/cgroup) directory by default, with support for specifying any cgroup directory. Signed-off-by: Vernon Yang --- samples/bpf/.gitignore | 1 + samples/bpf/Makefile | 7 +- samples/bpf/mthp_ext.bpf.c | 148 ++++++++++++++++ samples/bpf/mthp_ext.c | 339 +++++++++++++++++++++++++++++++++++++ samples/bpf/mthp_ext.h | 30 ++++ 5 files changed, 524 insertions(+), 1 deletion(-) create mode 100644 samples/bpf/mthp_ext.bpf.c create mode 100644 samples/bpf/mthp_ext.c create mode 100644 samples/bpf/mthp_ext.h diff --git a/samples/bpf/.gitignore b/samples/bpf/.gitignore index 0002cd359fb1..2a73581876b4 100644 --- a/samples/bpf/.gitignore +++ b/samples/bpf/.gitignore @@ -49,3 +49,4 @@ iperf.* /vmlinux.h /bpftool/ /libbpf/ +mthp_ext diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile index 95a4fa1f1e44..357c7d1c45ef 100644 --- a/samples/bpf/Makefile +++ b/samples/bpf/Makefile @@ -37,6 +37,7 @@ tprogs-y +=3D xdp_fwd tprogs-y +=3D task_fd_query tprogs-y +=3D ibumad tprogs-y +=3D hbm +tprogs-y +=3D mthp_ext =20 # Libbpf dependencies LIBBPF_SRC =3D $(TOOLS_PATH)/lib/bpf @@ -122,6 +123,7 @@ always-y +=3D task_fd_query_kern.o always-y +=3D ibumad_kern.o always-y +=3D hbm_out_kern.o always-y +=3D hbm_edt_kern.o +always-y +=3D mthp_ext.bpf.o =20 COMMON_CFLAGS =3D $(TPROGS_USER_CFLAGS) TPROGS_LDFLAGS =3D $(TPROGS_USER_LDFLAGS) @@ -289,6 +291,8 @@ $(obj)/hbm_out_kern.o: $(src)/hbm.h $(src)/hbm_kern.h $(obj)/hbm.o: $(src)/hbm.h $(obj)/hbm_edt_kern.o: $(src)/hbm.h $(src)/hbm_kern.h =20 +mthp_ext: $(obj)/mthp_ext.skel.h + # Override includes for xdp_sample_user.o because $(srctree)/usr/include in # TPROGS_CFLAGS causes conflicts XDP_SAMPLE_CFLAGS +=3D -Wall -O2 \ @@ -347,10 +351,11 @@ $(obj)/%.bpf.o: $(src)/%.bpf.c $(obj)/vmlinux.h $(src= )/xdp_sample.bpf.h $(src)/x -I$(LIBBPF_INCLUDE) $(CLANG_SYS_INCLUDES) \ -c $(filter %.bpf.c,$^) -o $@ =20 -LINKED_SKELS :=3D xdp_router_ipv4.skel.h +LINKED_SKELS :=3D xdp_router_ipv4.skel.h mthp_ext.skel.h clean-files +=3D $(LINKED_SKELS) =20 xdp_router_ipv4.skel.h-deps :=3D xdp_router_ipv4.bpf.o xdp_sample.bpf.o +mthp_ext.skel.h-deps :=3D mthp_ext.bpf.o =20 LINKED_BPF_SRCS :=3D $(patsubst %.bpf.o,%.bpf.c,$(foreach skel,$(LINKED_SK= ELS),$($(skel)-deps))) =20 diff --git a/samples/bpf/mthp_ext.bpf.c b/samples/bpf/mthp_ext.bpf.c new file mode 100644 index 000000000000..3524dc45fda4 --- /dev/null +++ b/samples/bpf/mthp_ext.bpf.c @@ -0,0 +1,148 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include "vmlinux.h" +#include "mthp_ext.h" +#include +#include +#include +#include + +struct mem_info { + unsigned long long stall; + unsigned int order; +}; + +struct { + __uint(type, BPF_MAP_TYPE_CGRP_STORAGE); + __uint(map_flags, BPF_F_NO_PREALLOC); + __type(key, int); + __type(value, struct mem_info); +} cgrp_storage SEC(".maps"); + +struct { + __uint(type, BPF_MAP_TYPE_RINGBUF); + __uint(max_entries, 256 * 1024); +} events SEC(".maps"); + +struct config_local configs; + +/* + * mthp_choose_impl - Choose the custom mTHP orders, read order from cgrp_= storage, + * which is Adjustment by the cgroup_scan(). + * @cgrp: control group + * @orders: original orders + * + * Return suited mTHP orders. + */ +SEC("struct_ops/mthp_choose") +unsigned long BPF_PROG(mthp_choose_impl, struct cgroup *cgrp, unsigned lon= g orders) +{ + struct mem_info *info; + unsigned int order; + + if (configs.fixed) { + order =3D configs.init_order; + goto out; + } + + info =3D bpf_cgrp_storage_get(&cgrp_storage, cgrp, 0, 0); + if (!info) + return orders; + + order =3D info->order; +out: + if (!order) + return 0; + + orders &=3D BIT(order + 1) - 1; + return orders; +} + +SEC(".struct_ops.link") +struct bpf_mthp_ops mthp_ops =3D { + .mthp_choose =3D (void *)mthp_choose_impl, +}; + +/* backport from kernel/cgroup/cgroup.c */ +static bool cgroup_has_tasks(struct cgroup *cgrp) +{ + return cgrp->nr_populated_csets; +} + +/* + * cgroup_scan - scan all descendant cgroups under root cgroup. + * + * 1. When the memory usage of the sub-cgroup falls below the thresh= old, + * it will automatically fall back to using 4KB size; otherwise, it will + * use all mTHP sizes. + * 2. When memory.pressure stall time of the sub-cgroup exceeds , + * it will automatically fall back to using 4KB size; otherwise, it will + * use all mTHP sizes. + * + * Return 1 indicates termination of the iteration loop, and return 0 indi= cates + * iteration to the next sub-cgroup. + */ +SEC("iter.s/cgroup") +int cgroup_scan(struct bpf_iter__cgroup *ctx) +{ + struct cgroup *cgrp =3D ctx->cgroup; + struct mem_cgroup *memcg; + struct mem_info *info; + struct alert_event *e; + unsigned long curr_mem; + unsigned long long curr_stall; + unsigned long long delta; + + if (!cgrp) + return 1; + + if (!cgroup_has_tasks(cgrp)) + return 0; + + info =3D bpf_cgrp_storage_get(&cgrp_storage, cgrp, 0, + BPF_LOCAL_STORAGE_GET_F_CREATE); + if (!info) + return 0; + + memcg =3D bpf_get_mem_cgroup(&cgrp->self); + if (!memcg) + return 0; + + bpf_cgroup_flush_stats(cgrp); + curr_stall =3D bpf_cgroup_stall(cgrp, PSI_MEM_FULL); + if (!info->stall) { + info->order =3D configs.init_order; + goto UPDATE; + } + delta =3D curr_stall - info->stall; + bpf_mem_cgroup_flush_stats(memcg); + curr_mem =3D bpf_mem_cgroup_page_state(memcg, NR_ANON_MAPPED) + + bpf_mem_cgroup_page_state(memcg, NR_SHMEM); + if ((curr_mem && curr_mem < FROM_MB(configs.min_mem)) || + delta >=3D configs.threshold) + info->order =3D 0; + else + info->order =3D PMD_ORDER; + + if (configs.debug) { + e =3D bpf_ringbuf_reserve(&events, sizeof(*e), 0); + if (e) { + e->prev_stall =3D info->stall; + e->curr_stall =3D curr_stall; + e->delta =3D delta; + e->mem =3D curr_mem; + e->order =3D info->order; + bpf_probe_read_kernel_str(e->name, sizeof(e->name), + cgrp->kn->name); + bpf_ringbuf_submit(e, 0); + } + } + +UPDATE: + info->stall =3D curr_stall; + bpf_put_mem_cgroup(memcg); + + return 0; +} + +char LICENSE[] SEC("license") =3D "GPL"; diff --git a/samples/bpf/mthp_ext.c b/samples/bpf/mthp_ext.c new file mode 100644 index 000000000000..120c331ff26a --- /dev/null +++ b/samples/bpf/mthp_ext.c @@ -0,0 +1,339 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "mthp_ext.h" +#include "mthp_ext.skel.h" + +#define DEFAULT_ROOT "/sys/fs/cgroup" +#define DEFAULT_THRESHOLD_MS 100UL +#define DEFAULT_INTERVAL_MS 1000UL +#define DEFAULT_ORDER PMD_ORDER +#define DEFAULT_MIN_MEM 16 + +static bool exiting; + +static void usage(const char *name) +{ + fprintf(stderr, + "Usage: %s [OPTIONS]\n\n" + "Monitor specified cgroup, adjust mTHP size via cgroup_bpf.\n\n" + "Currently supports fixed mTHP size and automatic mTHP size adjustment.\= n" + "By default, it monitors the entire cgroup and automatically\n" + "adjusts mTHP size within the specified time window .\n" + "1. When the memory size of the sub-cgroup falls below\n" + " the threshold, it will automatically fall back to\n" + " using 4KB size; otherwise, it will use all mTHP sizes.\n" + "2. When memory.pressure stall time of the sub-cgroup exceeds\n" + " , it will automatically fall back to using 4KB\n" + " size; otherwise, it will use all mTHP sizes.\n\n" + "Options:\n" + " -r, --root=3DPATH Root cgroup path (default: /sys/fs/cgroup)\n" + " -t, --threshold=3DMS threshold in ms (default: %lu)\n" + " -i, --interval=3DMS interval in ms (default: %lu)\n" + " -o, --order=3DNR Initial mthp order (default: %d)\n" + " -m, --min=3DMB Minimum memory size for mTHP (default: %d)\n" + " -f, --fixed Use fixed order, disable auto-adjustment\n" + " -d, --debug Enable debug output\n" + " -h, --help Show this help\n", + name, DEFAULT_THRESHOLD_MS, DEFAULT_INTERVAL_MS, DEFAULT_ORDER, + DEFAULT_MIN_MEM); +} + +static void sig_handler(int sig) +{ + exiting =3D true; +} + +static int setup_psi_trigger(const char *cgroup_path, const char *type, + unsigned long stall_us, unsigned long window_us) +{ + char path[PATH_MAX]; + char trigger[128]; + int fd, nr; + + snprintf(path, sizeof(path), "%s/memory.pressure", cgroup_path); + fd =3D open(path, O_RDWR | O_NONBLOCK); + if (fd < 0) { + fprintf(stderr, "ERROR: open PSI file failed\n"); + return -errno; + } + + nr =3D snprintf(trigger, sizeof(trigger), "%s %lu %lu", + type, stall_us, window_us); + if (write(fd, trigger, nr) < 0) { + fprintf(stderr, "ERROR: write PSI trigger failed\n"); + close(fd); + return -errno; + } + + return fd; +} + +static int trigger_scan(struct bpf_link *iter_link) +{ + char buf[256]; + int fd; + + fd =3D bpf_iter_create(bpf_link__fd(iter_link)); + if (fd < 0) { + fprintf(stderr, "ERROR: bpf_iter_create failed: %s\n", + strerror(errno)); + return -1; + } + + /* Read to trigger the iter program execution */ + while (read(fd, buf, sizeof(buf)) > 0) + ; + + close(fd); + return 0; +} + +static void *monitor_thread(int psi_fd, struct config_local *configs, + struct bpf_link *iter_link, struct ring_buffer *rb) +{ + struct epoll_event e; + int epoll_fd; + int nfds; + + epoll_fd =3D epoll_create1(0); + if (epoll_fd < 0) { + fprintf(stderr, "ERROR: epoll_create1 failed\n"); + return NULL; + } + + e.events =3D EPOLLPRI; + e.data.fd =3D psi_fd; + if (epoll_ctl(epoll_fd, EPOLL_CTL_ADD, psi_fd, &e)) { + fprintf(stderr, "ERROR: epoll_ctl failed\n"); + goto CLOSE; + } + + /* First initialization */ + trigger_scan(iter_link); + + /* Auto adjustment */ + while (!exiting) { + nfds =3D epoll_wait(epoll_fd, &e, 1, configs->interval * 2); + trigger_scan(iter_link); + + if (configs->debug) { + printf("PSI: memory pressure %s\n", nfds ? "high" : "low"); + ring_buffer__poll(rb, 0); + } + } + +CLOSE: + close(epoll_fd); + return NULL; +} + +static int handle_event(void *ctx, void *data, size_t len) +{ + struct alert_event *e =3D data; + + printf("cgroup %s: stall %llu -> %llu (+%llu), mem %luMB, mthp order=3D%d= \n", + e->name[0] ? e->name : "/", + e->prev_stall, e->curr_stall, e->delta, TO_MB(e->mem), e->order); + + return 0; +} + +int main(int argc, char **argv) +{ + const char *root_path =3D DEFAULT_ROOT; + unsigned long threshold =3D DEFAULT_THRESHOLD_MS; + unsigned long interval =3D DEFAULT_INTERVAL_MS; + unsigned int init_order =3D DEFAULT_ORDER; + unsigned int min_mem =3D DEFAULT_MIN_MEM; + bool fixed =3D false; + bool debug =3D false; + struct mthp_ext *skel; + struct bpf_link *iter_link; + struct bpf_link *ops_link; + struct ring_buffer *rb; + int root_fd; + int psi_fd; + int err =3D 0; + int opt; + + static struct option long_options[] =3D { + {"root", required_argument, 0, 'r'}, + {"threshold", required_argument, 0, 't'}, + {"interval", required_argument, 0, 'i'}, + {"order", required_argument, 0, 'o'}, + {"min", required_argument, 0, 'm'}, + {"fixed", no_argument, 0, 'f'}, + {"debug", no_argument, 0, 'd'}, + {"help", no_argument, 0, 'h'}, + {0, 0, 0, 0} + }; + + while ((opt =3D getopt_long(argc, argv, "r:t:i:o:m:fdh", + long_options, NULL)) !=3D -1) { + switch (opt) { + case 'r': + root_path =3D optarg; + break; + case 't': + threshold =3D strtoul(optarg, NULL, 10); + break; + case 'i': + interval =3D strtoul(optarg, NULL, 10); + break; + case 'o': + init_order =3D min(strtoul(optarg, NULL, 10), PMD_ORDER); + break; + case 'm': + min_mem =3D strtoul(optarg, NULL, 10); + break; + case 'f': + fixed =3D true; + break; + case 'd': + debug =3D true; + break; + case 'h': + usage(argv[0]); + return 0; + default: + usage(argv[0]); + return -EINVAL; + } + } + + if (!threshold || !interval) { + fprintf(stderr, "ERROR: threshold and interval must be > 0\n"); + usage(argv[0]); + return -EINVAL; + } + + signal(SIGINT, sig_handler); + signal(SIGTERM, sig_handler); + + root_fd =3D open(root_path, O_RDONLY); + if (root_fd < 0) { + fprintf(stderr, "ERROR: open '%s' failed: %s\n", + root_path, strerror(errno)); + return -errno; + } + + skel =3D mthp_ext__open(); + if (!skel) { + fprintf(stderr, "ERROR: failed to open BPF skeleton\n"); + err =3D -ENOMEM; + goto open_skel_fail; + } + + skel->bss->configs.threshold =3D threshold; + skel->bss->configs.interval =3D interval; + skel->bss->configs.init_order =3D init_order; + skel->bss->configs.min_mem =3D min_mem; + skel->bss->configs.fixed =3D fixed; + skel->bss->configs.debug =3D debug; + + err =3D mthp_ext__load(skel); + if (err) { + fprintf(stderr, "ERROR: failed to load BPF program: %d\n", err); + goto load_skel_fail; + } + + /* Attach struct_ops to root cgroup for mthp_choose */ + DECLARE_LIBBPF_OPTS(bpf_struct_ops_opts, opts); + opts.flags =3D BPF_F_CGROUP_FD; + opts.target_fd =3D root_fd; + ops_link =3D bpf_map__attach_struct_ops_opts(skel->maps.mthp_ops, &opts); + err =3D libbpf_get_error(ops_link); + if (err) { + fprintf(stderr, "ERROR: attach struct_ops failed: %d\n", err); + ops_link =3D NULL; + goto attach_opts_fail; + } + + printf("Monitoring : %s\n" + "threshold : %lums\n" + "Interval : %lums\n" + "Initial order : %d%s\n" + "min memory : %dMB\n" + "Debug : %s\n" + "Press Ctrl+C to exit.\n\n", + root_path, threshold, interval, init_order, + fixed ? " (fixed)" : " (auto)", min_mem, + debug ? "on" : "off"); + + if (fixed) { + while (!exiting) + usleep(interval * 1000); + goto exit_fixed; + } + + /* Auto adjustment, attach cgroup iter for scanning root + descendants */ + DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, iter_opts); + union bpf_iter_link_info linfo =3D { + .cgroup.cgroup_fd =3D root_fd, + .cgroup.order =3D BPF_CGROUP_ITER_DESCENDANTS_PRE, + }; + iter_opts.link_info =3D &linfo; + iter_opts.link_info_len =3D sizeof(linfo); + iter_link =3D bpf_program__attach_iter(skel->progs.cgroup_scan, &iter_opt= s); + err =3D libbpf_get_error(iter_link); + if (err) { + fprintf(stderr, "ERROR: attach cgroup iter failed: %d\n", err); + iter_link =3D NULL; + goto attach_iter_fail; + } + + /* Set up ring buffer for receiving alerts */ + rb =3D ring_buffer__new(bpf_map__fd(skel->maps.events), + handle_event, NULL, NULL); + if (!rb) { + fprintf(stderr, "ERROR: failed to create ring buffer\n"); + err =3D -ENOMEM; + goto rb_fail; + } + + + psi_fd =3D setup_psi_trigger(root_path, "some", threshold * 1000, + interval * 1000); + if (psi_fd < 0) { + fprintf(stderr, "ERROR: PSI trigger setup failed\n"); + err =3D -EINVAL; + goto psi_setup_fail; + } + + monitor_thread(psi_fd, &skel->bss->configs, iter_link, rb); + + close(psi_fd); +psi_setup_fail: + ring_buffer__free(rb); +rb_fail: + bpf_link__destroy(iter_link); +exit_fixed: +attach_iter_fail: + bpf_link__destroy(ops_link); +attach_opts_fail: +load_skel_fail: + mthp_ext__destroy(skel); +open_skel_fail: + close(root_fd); + + printf("\nExiting...\n"); + + return err; +} diff --git a/samples/bpf/mthp_ext.h b/samples/bpf/mthp_ext.h new file mode 100644 index 000000000000..e29d80aa15bf --- /dev/null +++ b/samples/bpf/mthp_ext.h @@ -0,0 +1,30 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +#ifndef __MTHP_EXT_H__ +#define __MTHP_EXT_H__ + +#define CGROUP_NAME_LEN 128 +#define PMD_ORDER 9 +#define min(a, b) ((a) < (b) ? a : b) +#define FROM_MB(s) (s * 1024UL * 1024UL) +#define TO_MB(s) (s / 1024UL / 1024UL) + +struct config_local { + unsigned long threshold; + unsigned long interval; + unsigned int init_order; + unsigned int min_mem; + bool fixed; + bool debug; +}; + +struct alert_event { + unsigned long long prev_stall; + unsigned long long curr_stall; + unsigned long long delta; + unsigned long mem; + unsigned int order; + char name[CGROUP_NAME_LEN]; +}; + +#endif /* __MTHP_EXT_H__ */ --=20 2.53.0