From nobody Sat Apr 11 18:38:01 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1F18DC00140 for ; Mon, 8 Aug 2022 11:04:42 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S242601AbiHHLEk (ORCPT ); Mon, 8 Aug 2022 07:04:40 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48790 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237657AbiHHLEb (ORCPT ); Mon, 8 Aug 2022 07:04:31 -0400 Received: from mail-pf1-x429.google.com (mail-pf1-x429.google.com [IPv6:2607:f8b0:4864:20::429]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1D8FC14091 for ; Mon, 8 Aug 2022 04:04:31 -0700 (PDT) Received: by mail-pf1-x429.google.com with SMTP id f28so7801802pfk.1 for ; Mon, 08 Aug 2022 04:04:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=CFgIL4xu1jYaeQ3o92prjVPs2tt3tKhzJurHXG1zK+o=; b=mCn/FUv8vJn3gM6TWAH7m0EZ0zSqeK/a3/8rhqv+87kOgowTq51ILVa8RPtQ4CJ4e0 +oAxuhaBJwC6Y2w9xm0tcIArMne+5pSc3BOXm1Kg+AK2J+YdoVwEKs3dMBp/6CaIXs8G maC4pXIGM8GFtTPqdn7//qVcjblHL0NBdkagTz45mlfa2Nwgxuzz3OEwUV42DGUkL7hc Hw1CZoPTI0jN3JI6hnb5qsK2orPYPLWKu7k15zZ2B3cIgymfdVNnTuyIvE33uxvqGbTH YN7JCpiRk/a9kv5LE3JcRzAmAzR42UgfSRuAqWxbRWhD5g9y/yPfY5rfh3m4nAZLP6Tl ub5A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=CFgIL4xu1jYaeQ3o92prjVPs2tt3tKhzJurHXG1zK+o=; b=ynhe33BaobXvmAaJ3tI6ul6QlTcTLLpa+LTZmJWwLrBtZpFdJoUg9hoPHri6etLHhH SzsE9ry991S27RBSpPatwTUUrglhGJdHtWKEa8ykC5y68z2Bixe7thtpZkOZ4IGyOSYp LS1qRiRQE5bbSz9RGvRhTvU8iSyGhxVt22EFp/1Rjn9TS6vsigWzm+dqPtNUIOWn8Fjg fIB5SUVybVlsptnIgEInPxpPTvn1n4pugxo6wOfWx7BFe4qWo2YcXRYH3TPMnp2+OoS4 rtGfg85VSbX/SCogbm9iVH00MlfXAS0d2uaRsnhkQQI+vAuVm1QEVYO9I7AbnIsofrmN I36A== X-Gm-Message-State: ACgBeo3VI9oCAtcViA4yA7pKns3gwI2/eyGzD8Q2b6pjJx9r2M47xu0o jNXrOtjVy+FyHzMoBY4h3dutJQ== X-Google-Smtp-Source: AA6agR7tRjncfSmTCS6i6FXwFGDHhpSti35ZJAmns5pKoWT2V+5P7OjJ7+Pu0qaMl2oBoTfBN7dGvQ== X-Received: by 2002:a63:8bc3:0:b0:41d:4b74:b975 with SMTP id j186-20020a638bc3000000b0041d4b74b975mr6368442pge.309.1659956670430; Mon, 08 Aug 2022 04:04:30 -0700 (PDT) Received: from C02CV1DAMD6P.bytedance.net ([139.177.225.240]) by smtp.gmail.com with ESMTPSA id o12-20020aa7978c000000b0052dbad1ea2esm8393180pfp.6.2022.08.08.04.04.23 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 08 Aug 2022 04:04:30 -0700 (PDT) From: Chengming Zhou To: hannes@cmpxchg.org, tj@kernel.org, corbet@lwn.net, surenb@google.com, mingo@redhat.com, peterz@infradead.org, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com Cc: cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, songmuchun@bytedance.com, Chengming Zhou Subject: [PATCH v2 01/10] sched/psi: fix periodic aggregation shut off Date: Mon, 8 Aug 2022 19:03:32 +0800 Message-Id: <20220808110341.15799-2-zhouchengming@bytedance.com> X-Mailer: git-send-email 2.35.1 In-Reply-To: <20220808110341.15799-1-zhouchengming@bytedance.com> References: <20220808110341.15799-1-zhouchengming@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" We don't want to wake periodic aggregation work back up if the task change is the aggregation worker itself going to sleep, or we'll ping-pong forever. Previously, we would use psi_task_change() in psi_dequeue() when task going to sleep, so this check was put in psi_task_change(). But commit 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups") defer task sleep handling to psi_task_switch(), won't go through psi_task_change() anymore. So this patch move this check to psi_task_switch(). Fixes: 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups") Signed-off-by: Chengming Zhou Acked-by: Johannes Weiner --- kernel/sched/psi.c | 28 ++++++++++++++-------------- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index a337f3e35997..115a7e52fa23 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -800,7 +800,6 @@ void psi_task_change(struct task_struct *task, int clea= r, int set) { int cpu =3D task_cpu(task); struct psi_group *group; - bool wake_clock =3D true; void *iter =3D NULL; u64 now; =20 @@ -810,19 +809,9 @@ void psi_task_change(struct task_struct *task, int cle= ar, int set) psi_flags_change(task, clear, set); =20 now =3D cpu_clock(cpu); - /* - * Periodic aggregation shuts off if there is a period of no - * task changes, so we wake it back up if necessary. However, - * don't do this if the task change is the aggregation worker - * itself going to sleep, or we'll ping-pong forever. - */ - if (unlikely((clear & TSK_RUNNING) && - (task->flags & PF_WQ_WORKER) && - wq_worker_last_func(task) =3D=3D psi_avgs_work)) - wake_clock =3D false; =20 while ((group =3D iterate_groups(task, &iter))) - psi_group_change(group, cpu, clear, set, now, wake_clock); + psi_group_change(group, cpu, clear, set, now, true); } =20 void psi_task_switch(struct task_struct *prev, struct task_struct *next, @@ -858,6 +847,7 @@ void psi_task_switch(struct task_struct *prev, struct t= ask_struct *next, =20 if (prev->pid) { int clear =3D TSK_ONCPU, set =3D 0; + bool wake_clock =3D true; =20 /* * When we're going to sleep, psi_dequeue() lets us @@ -871,13 +861,23 @@ void psi_task_switch(struct task_struct *prev, struct= task_struct *next, clear |=3D TSK_MEMSTALL_RUNNING; if (prev->in_iowait) set |=3D TSK_IOWAIT; + + /* + * Periodic aggregation shuts off if there is a period of no + * task changes, so we wake it back up if necessary. However, + * don't do this if the task change is the aggregation worker + * itself going to sleep, or we'll ping-pong forever. + */ + if (unlikely((prev->flags & PF_WQ_WORKER) && + wq_worker_last_func(prev) =3D=3D psi_avgs_work)) + wake_clock =3D false; } =20 psi_flags_change(prev, clear, set); =20 iter =3D NULL; while ((group =3D iterate_groups(prev, &iter)) && group !=3D common) - psi_group_change(group, cpu, clear, set, now, true); + psi_group_change(group, cpu, clear, set, now, wake_clock); =20 /* * TSK_ONCPU is handled up to the common ancestor. If we're tasked @@ -886,7 +886,7 @@ void psi_task_switch(struct task_struct *prev, struct t= ask_struct *next, if (sleep) { clear &=3D ~TSK_ONCPU; for (; group; group =3D iterate_groups(prev, &iter)) - psi_group_change(group, cpu, clear, set, now, true); + psi_group_change(group, cpu, clear, set, now, wake_clock); } } } --=20 2.36.1 From nobody Sat Apr 11 18:38:01 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E23ABC00140 for ; Mon, 8 Aug 2022 11:04:46 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S242460AbiHHLEo (ORCPT ); Mon, 8 Aug 2022 07:04:44 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48928 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S242501AbiHHLEj (ORCPT ); Mon, 8 Aug 2022 07:04:39 -0400 Received: from mail-pf1-x42c.google.com (mail-pf1-x42c.google.com [IPv6:2607:f8b0:4864:20::42c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AAB05140ED for ; Mon, 8 Aug 2022 04:04:37 -0700 (PDT) Received: by mail-pf1-x42c.google.com with SMTP id u133so7769691pfc.10 for ; Mon, 08 Aug 2022 04:04:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=hPrlmdp1+V2T+kE3vc+RCj17IKZw6QBjGZGzgKFOvkI=; b=Kb18Kkz/FcTk+kpvnWEWXxWKUiHWjU/a38UAe0J9oi34EDsm6Co1bo3Zqvx9ylNFpi zSy8zc6Gy9HN623y5k2Zm0+SIs424CxyWxlwpiF+qPhBebYmNbPMIF9wfMaUreHm0fRd hwQq0i47LAlZqSdcPgzP6BRJBE1EM8QefCU2z9kmajjawcOBYhzyEljrQ7va+ajt1x79 uvBKqQg31+mXaZ5PEP4feNkSwcXNpug/23Q8jXHuVOOcGsxoeJAe8uIfSZY3Ou/FKr4H Mgd+HJ2nZsK1RIFHmMGpbh1bK90rLlVNnZShq/GMo8EzgL7CzOCJy7/vE7xoD586s+0f XMxg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=hPrlmdp1+V2T+kE3vc+RCj17IKZw6QBjGZGzgKFOvkI=; b=me6+WyVeQmmDBGP4CmBKHcAXQ5oJPCn9+mRxatcvdOdiQcxrWogEb4xz67BKYQWIk3 XnVfT/nsEKd3GVwLnb+36fpGJgd1Yo3/QLAvqnSzLKhEmJ2rudukj6mNy2Nl36ITwNCh 8m7WstXismsTfmQpSwvzvqwsjb0gYDy2BLdNyVZJrf1ST4OwcgrE92CLMYbby+MNOAKW Q2bAuFNJvrnUyyhqFkqFrRMyhW5C16Bw+cHIoZrGEia3U4r894vsmFvHo9z6z2A/eWzs JGN/XKV8S546yKHR9zjGI+s4RwKIti7aw8Izzukxn0tmBlfhXG+Sbjf688K/czkfe+77 fFiA== X-Gm-Message-State: ACgBeo1GAFZMX5n2UctM1eFPN3KIR0cUOORiefIFc0ZXPrbevbywQw68 YsImQegOFcOTZLyuXkb0qlQlOw== X-Google-Smtp-Source: AA6agR4pOivxoStasMf24iM7IcGaku9J/jzQvjDl167AhLiVLSfuj4vW2FifEAlTjhzroMzfcG5ilA== X-Received: by 2002:a63:8548:0:b0:41b:f048:1761 with SMTP id u69-20020a638548000000b0041bf0481761mr15269871pgd.10.1659956677049; Mon, 08 Aug 2022 04:04:37 -0700 (PDT) Received: from C02CV1DAMD6P.bytedance.net ([139.177.225.240]) by smtp.gmail.com with ESMTPSA id o12-20020aa7978c000000b0052dbad1ea2esm8393180pfp.6.2022.08.08.04.04.30 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 08 Aug 2022 04:04:36 -0700 (PDT) From: Chengming Zhou To: hannes@cmpxchg.org, tj@kernel.org, corbet@lwn.net, surenb@google.com, mingo@redhat.com, peterz@infradead.org, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com Cc: cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, songmuchun@bytedance.com, Chengming Zhou Subject: [PATCH v2 02/10] sched/psi: optimize task switch inside shared cgroups again Date: Mon, 8 Aug 2022 19:03:33 +0800 Message-Id: <20220808110341.15799-3-zhouchengming@bytedance.com> X-Mailer: git-send-email 2.35.1 In-Reply-To: <20220808110341.15799-1-zhouchengming@bytedance.com> References: <20220808110341.15799-1-zhouchengming@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" commit 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups") defer prev task sleep handling to psi_task_switch(), so we don't need to clear and set TSK_ONCPU state for common cgroups. A | B / \ C D / \ prev next After that commit psi_task_switch() do: 1. psi_group_change(next, .set=3DTSK_ONCPU) for D 2. psi_group_change(prev, .clear=3DTSK_ONCPU | TSK_RUNNING) for C 3. psi_group_change(prev, .clear=3DTSK_RUNNING) for B, A But there is a limitation "prev->psi_flags =3D=3D next->psi_flags" that if not satisfied, will make this cgroups optimization unusable for both sleep switch or running switch cases. For example: prev->in_memstall !=3D next->in_memstall when sleep switch: 1. psi_group_change(next, .set=3DTSK_ONCPU) for D, B, A 2. psi_group_change(prev, .clear=3DTSK_ONCPU | TSK_RUNNING) for C, B, A prev->in_memstall !=3D next->in_memstall when running switch: 1. psi_group_change(next, .set=3DTSK_ONCPU) for D, B, A 2. psi_group_change(prev, .clear=3DTSK_ONCPU) for C, B, A The reason why this limitation exist is that we consider a group is PSI_MEM_FULL if the CPU is actively reclaiming and nothing productive could run even if it were runnable. So when CPU curr changed from prev to next and their in_memstall status is different, we have to change PSI_MEM_FULL status for their common cgroups. This patch remove this limitation by making psi_group_change() change PSI_MEM_FULL status depend on CPU curr->in_memstall status. Signed-off-by: Chengming Zhou --- kernel/sched/psi.c | 8 ++------ 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 115a7e52fa23..9e8c5d9e585c 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -823,8 +823,6 @@ void psi_task_switch(struct task_struct *prev, struct t= ask_struct *next, u64 now =3D cpu_clock(cpu); =20 if (next->pid) { - bool identical_state; - psi_flags_change(next, 0, TSK_ONCPU); /* * When switching between tasks that have an identical @@ -832,11 +830,9 @@ void psi_task_switch(struct task_struct *prev, struct = task_struct *next, * we reach the first common ancestor. Iterate @next's * ancestors only until we encounter @prev's ONCPU. */ - identical_state =3D prev->psi_flags =3D=3D next->psi_flags; iter =3D NULL; while ((group =3D iterate_groups(next, &iter))) { - if (identical_state && - per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) { + if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) { common =3D group; break; } @@ -883,7 +879,7 @@ void psi_task_switch(struct task_struct *prev, struct t= ask_struct *next, * TSK_ONCPU is handled up to the common ancestor. If we're tasked * with dequeuing too, finish that for the rest of the hierarchy. */ - if (sleep) { + if (sleep || unlikely(prev->in_memstall !=3D next->in_memstall)) { clear &=3D ~TSK_ONCPU; for (; group; group =3D iterate_groups(prev, &iter)) psi_group_change(group, cpu, clear, set, now, wake_clock); --=20 2.36.1 From nobody Sat Apr 11 18:38:01 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8559EC00140 for ; Mon, 8 Aug 2022 11:04:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S242803AbiHHLE6 (ORCPT ); Mon, 8 Aug 2022 07:04:58 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49188 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S242693AbiHHLEp (ORCPT ); Mon, 8 Aug 2022 07:04:45 -0400 Received: from mail-pg1-x530.google.com (mail-pg1-x530.google.com [IPv6:2607:f8b0:4864:20::530]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5D183140F2 for ; Mon, 8 Aug 2022 04:04:44 -0700 (PDT) Received: by mail-pg1-x530.google.com with SMTP id bh13so8265854pgb.4 for ; Mon, 08 Aug 2022 04:04:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=MhJX1xAuM+Dg2Au0JlwK8MVyZU4paj2H+YmQz1widm8=; b=ro7oTcXZNHFFdtiUu9akcDqIYcDZ2EV9Uc3qPLMtS5Nl3RGs9c/V+BXWCTw5CB33NH p8t5GdSJp+fWDaviDxFY1DCL23G1RudgkuGLQ9R8ib5w0nGwEVV3GtmqLlU6sCPtpXLD CWp/kZSPIXAA1zSVquXQAk9Ffs6pBsUXl5xlaRjfGXPyTgF/0i1fB4/TdwsnAdug2d9H H2iAB3R435PO5U8v7vj7FBU2C5HMp/HjttoeT+EnfkpwVTD2sF+4y2pcqjXi+MId6YjK gJDxMbw+Tz69NYobCSvgE+kxSVvJG/pLk6HmoX19v7yq18h4rKwXA87JUA9X1QNlPeqf jcZQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=MhJX1xAuM+Dg2Au0JlwK8MVyZU4paj2H+YmQz1widm8=; b=atvZJpz70YIbCEsaKQvMWTl/jBJHuFDX/W4mYBeTC9bWclkgGAesLghoayH6aRbtRV Exu5pQlRIBMXM80/oKwQwr7LITtzl8CUt4cV/sYpLfHd79fMTN9U1+XBgR1BO+AEyovT g/4/j0GKkYzsneCDoL4FZe2fzLOj5BS9/3W1+H2Ljt86KMmZK8E2Gv/NPP08ve28/NnE xeFg2sDbkZk08uDCxq7YrPf3rufcgX34QPkzLf16TN45NPvb6qPmamlW73AAgwHoDIP8 OhQNIjNeeIYUp45hAAsAKMTQTyv5HvfZJZOvoe/yLRhvRAn5CFZN2t0So6r3FuuYCSc0 lPMg== X-Gm-Message-State: ACgBeo3w3af8wa08ZTDAXkQybLDqNVEKBLV4K/hgGs6WaY+N4tFBwU9q n5lvi6i2fS+SGGKcOLX674ryyg== X-Google-Smtp-Source: AA6agR6zOoa6zw+DxLk4NcZHGKOcPeCvD1IUEebpqeEKHDlFsobWhZCcZq3C/2tRNmbBV67Pgx5ZOQ== X-Received: by 2002:a65:6556:0:b0:41c:9c36:98fa with SMTP id a22-20020a656556000000b0041c9c3698famr15358773pgw.491.1659956683658; Mon, 08 Aug 2022 04:04:43 -0700 (PDT) Received: from C02CV1DAMD6P.bytedance.net ([139.177.225.240]) by smtp.gmail.com with ESMTPSA id o12-20020aa7978c000000b0052dbad1ea2esm8393180pfp.6.2022.08.08.04.04.37 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 08 Aug 2022 04:04:43 -0700 (PDT) From: Chengming Zhou To: hannes@cmpxchg.org, tj@kernel.org, corbet@lwn.net, surenb@google.com, mingo@redhat.com, peterz@infradead.org, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com Cc: cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, songmuchun@bytedance.com, Chengming Zhou Subject: [PATCH v2 03/10] sched/psi: move private helpers to sched/stats.h Date: Mon, 8 Aug 2022 19:03:34 +0800 Message-Id: <20220808110341.15799-4-zhouchengming@bytedance.com> X-Mailer: git-send-email 2.35.1 In-Reply-To: <20220808110341.15799-1-zhouchengming@bytedance.com> References: <20220808110341.15799-1-zhouchengming@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" This patch move psi_task_change/psi_task_switch declarations out of PSI public header, since they are only needed for implementing the PSI stats tracking in sched/stats.h psi_task_switch is obvious, psi_task_change can't be public helper since it doesn't check psi_disabled static key. And there is no any user now, so put it in sched/stats.h too. Signed-off-by: Chengming Zhou Acked-by: Johannes Weiner --- include/linux/psi.h | 4 ---- kernel/sched/stats.h | 4 ++++ 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/include/linux/psi.h b/include/linux/psi.h index 89784763d19e..aa168a038242 100644 --- a/include/linux/psi.h +++ b/include/linux/psi.h @@ -18,10 +18,6 @@ extern struct psi_group psi_system; =20 void psi_init(void); =20 -void psi_task_change(struct task_struct *task, int clear, int set); -void psi_task_switch(struct task_struct *prev, struct task_struct *next, - bool sleep); - void psi_memstall_enter(unsigned long *flags); void psi_memstall_leave(unsigned long *flags); =20 diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h index baa839c1ba96..c39b467ece43 100644 --- a/kernel/sched/stats.h +++ b/kernel/sched/stats.h @@ -107,6 +107,10 @@ __schedstats_from_se(struct sched_entity *se) } =20 #ifdef CONFIG_PSI +void psi_task_change(struct task_struct *task, int clear, int set); +void psi_task_switch(struct task_struct *prev, struct task_struct *next, + bool sleep); + /* * PSI tracks state that persists across sleeps, such as iowaits and * memory stalls. As a result, it has to distinguish between sleeps, --=20 2.36.1 From nobody Sat Apr 11 18:38:01 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3949CC25B07 for ; Mon, 8 Aug 2022 11:05:08 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S242821AbiHHLFG (ORCPT ); Mon, 8 Aug 2022 07:05:06 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49158 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S242648AbiHHLEw (ORCPT ); Mon, 8 Aug 2022 07:04:52 -0400 Received: from mail-pj1-x1029.google.com (mail-pj1-x1029.google.com [IPv6:2607:f8b0:4864:20::1029]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BFA3614D3E for ; Mon, 8 Aug 2022 04:04:50 -0700 (PDT) Received: by mail-pj1-x1029.google.com with SMTP id o5-20020a17090a3d4500b001ef76490983so8729947pjf.2 for ; Mon, 08 Aug 2022 04:04:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=yw8m7GYV4skWBTfKn2U9wJeMtXyACoICOaijkM7fL+0=; b=j3R7fzmwyeJeXJGoIudYJOvfyOUazjXFVpsCll1fn7GdOr4joFVmKLRNh6f3+AbuUu vXnythEvo3kboS3OlUY6wBMJRrVuyKospm268JH+5HgLaK7RK1EfwcB0sACHJyxwLlHO ZwMq8gKjZ8xUtNWLkrdcHdyTB4RFtlaiawiZqFWmtkFAqV2yt78Cd2+N1VlO5SeDFu1t 9/p/jqzBBL5SNUfAWPJ6XIxbeWR2CHNLmwxngRWDntRjpAWFLECFV7tP5Iq5jZIH1e/H uJjQIL5UdTEx086i9lEz4+v+uCDzc1j+76k0PEWX/P2E41E6wUOejKVVRee5xBrnmFQv 82SA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=yw8m7GYV4skWBTfKn2U9wJeMtXyACoICOaijkM7fL+0=; b=Hhzy+AGEeGV+k5r4HSGmNh6/6+Ki/UpA/Ddlv9/g7BQdR/ZG8jADdpUjgYu5lJ/UN6 5S5oV4O4ttLLiWXMNr+kzbDTDATi5zOvGta4snFMDQ2Xy5otR0lb2f9jB32QqzYz4VYL 848xJYlA9oPoSAYlQDVNuS0eSz3ZSdJG7+qMJpgA6LPoncPomRYsE8M+W8vu4TOpdGh4 dKYWMcGRJLwdsvc+Yz6c5Am8CpgOEqKkGpAJ110tVtMk2ByHi/LeOTu/e5Rx7P/VOCAt 21HjVLZ2+4cQDQ8iwpB1Voyq+WeiVeFOijeMt8Xrnl4oS2ZotvVvcV1VnWv8VxIkBMlS 2q/w== X-Gm-Message-State: ACgBeo2wKRdkXhD9UFuGQM5IusbQ3MvV/3bQchFXMcHAEh36t/2YwWGi h/s2vEyJJlMorOAxEECGZ/Iu9A== X-Google-Smtp-Source: AA6agR5i3CJjRs4D5crOMEcY8rIRCmV4PpvNMPxksURR+QO2WWL3FnYfwx3AxQZrb75dnmnekIQ6hA== X-Received: by 2002:a17:902:db0f:b0:16f:24e4:15ff with SMTP id m15-20020a170902db0f00b0016f24e415ffmr18247195plx.10.1659956690224; Mon, 08 Aug 2022 04:04:50 -0700 (PDT) Received: from C02CV1DAMD6P.bytedance.net ([139.177.225.240]) by smtp.gmail.com with ESMTPSA id o12-20020aa7978c000000b0052dbad1ea2esm8393180pfp.6.2022.08.08.04.04.44 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 08 Aug 2022 04:04:49 -0700 (PDT) From: Chengming Zhou To: hannes@cmpxchg.org, tj@kernel.org, corbet@lwn.net, surenb@google.com, mingo@redhat.com, peterz@infradead.org, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com Cc: cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, songmuchun@bytedance.com, Chengming Zhou Subject: [PATCH v2 04/10] sched/psi: don't change task psi_flags when migrate CPU/group Date: Mon, 8 Aug 2022 19:03:35 +0800 Message-Id: <20220808110341.15799-5-zhouchengming@bytedance.com> X-Mailer: git-send-email 2.35.1 In-Reply-To: <20220808110341.15799-1-zhouchengming@bytedance.com> References: <20220808110341.15799-1-zhouchengming@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" The current code use psi_task_change() at every scheduling point, in which change task psi_flags then change all its psi_groups. So we have to heavily rely on the task scheduling states to calculate what to set and what to clear at every scheduling point, which make the PSI stats tracking code much complex and error prone. In fact, the task psi_flags only change at wakeup and sleep (except ONCPU state at switch), it doesn't change at all when migrate CPU/group. If we keep its psi_flags unchanged when migrate CPU/group, we can just use task->psi_flags to clear(migrate out) or set(migrate in), which will make PSI stats tracking much simplier and more efficient. Note: ENQUEUE_WAKEUP only means wakeup task from sleep state, don't include wakeup new task, so add psi_enqueue() in wake_up_new_task(). Performance test on Intel Xeon Platinum with 3 levels of cgroup: 1. before the patch: $ perf bench sched all # Running sched/messaging benchmark... # 20 sender and receiver processes per group # 10 groups =3D=3D 400 processes run Total time: 0.034 [sec] # Running sched/pipe benchmark... # Executed 1000000 pipe operations between two processes Total time: 8.210 [sec] 8.210600 usecs/op 121793 ops/sec 2. after the patch: $ perf bench sched all # Running sched/messaging benchmark... # 20 sender and receiver processes per group # 10 groups =3D=3D 400 processes run Total time: 0.032 [sec] # Running sched/pipe benchmark... # Executed 1000000 pipe operations between two processes Total time: 8.077 [sec] 8.077648 usecs/op 123798 ops/sec Signed-off-by: Chengming Zhou --- include/linux/sched.h | 3 --- kernel/sched/core.c | 1 + kernel/sched/psi.c | 24 ++++++++++--------- kernel/sched/stats.h | 54 +++++++++++++++++++++---------------------- 4 files changed, 40 insertions(+), 42 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 88b8817b827d..20a94786cad8 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -879,9 +879,6 @@ struct task_struct { unsigned sched_reset_on_fork:1; unsigned sched_contributes_to_load:1; unsigned sched_migrated:1; -#ifdef CONFIG_PSI - unsigned sched_psi_wake_requeue:1; -#endif =20 /* Force alignment to the next boundary: */ unsigned :0; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 64c08993221b..3aa401689f7e 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4642,6 +4642,7 @@ void wake_up_new_task(struct task_struct *p) post_init_entity_util_avg(p); =20 activate_task(rq, p, ENQUEUE_NOCLOCK); + psi_enqueue(p, true); trace_sched_wakeup_new(p); check_preempt_curr(rq, p, WF_FORK); #ifdef CONFIG_SMP diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 9e8c5d9e585c..974471f212a3 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -796,22 +796,24 @@ static void psi_flags_change(struct task_struct *task= , int clear, int set) task->psi_flags |=3D set; } =20 -void psi_task_change(struct task_struct *task, int clear, int set) +void psi_change_groups(struct task_struct *task, int clear, int set) { int cpu =3D task_cpu(task); struct psi_group *group; void *iter =3D NULL; - u64 now; + u64 now =3D cpu_clock(cpu); + + while ((group =3D iterate_groups(task, &iter))) + psi_group_change(group, cpu, clear, set, now, true); +} =20 +void psi_task_change(struct task_struct *task, int clear, int set) +{ if (!task->pid) return; =20 psi_flags_change(task, clear, set); - - now =3D cpu_clock(cpu); - - while ((group =3D iterate_groups(task, &iter))) - psi_group_change(group, cpu, clear, set, now, true); + psi_change_groups(task, clear, set); } =20 void psi_task_switch(struct task_struct *prev, struct task_struct *next, @@ -1015,9 +1017,9 @@ void cgroup_move_task(struct task_struct *task, struc= t css_set *to) * pick_next_task() * rq_unlock() * rq_lock() - * psi_task_change() // old cgroup + * psi_change_groups() // old cgroup * task->cgroups =3D to - * psi_task_change() // new cgroup + * psi_change_groups() // new cgroup * rq_unlock() * rq_lock() * psi_sched_switch() // does deferred updates in new cgroup @@ -1027,13 +1029,13 @@ void cgroup_move_task(struct task_struct *task, str= uct css_set *to) task_flags =3D task->psi_flags; =20 if (task_flags) - psi_task_change(task, task_flags, 0); + psi_change_groups(task, task_flags, 0); =20 /* See comment above */ rcu_assign_pointer(task->cgroups, to); =20 if (task_flags) - psi_task_change(task, 0, task_flags); + psi_change_groups(task, 0, task_flags); =20 task_rq_unlock(rq, task, &rf); } diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h index c39b467ece43..e930b8fa6253 100644 --- a/kernel/sched/stats.h +++ b/kernel/sched/stats.h @@ -107,6 +107,7 @@ __schedstats_from_se(struct sched_entity *se) } =20 #ifdef CONFIG_PSI +void psi_change_groups(struct task_struct *task, int clear, int set); void psi_task_change(struct task_struct *task, int clear, int set); void psi_task_switch(struct task_struct *prev, struct task_struct *next, bool sleep); @@ -124,42 +125,46 @@ static inline void psi_enqueue(struct task_struct *p,= bool wakeup) if (static_branch_likely(&psi_disabled)) return; =20 - if (p->in_memstall) - set |=3D TSK_MEMSTALL_RUNNING; + if (!wakeup) { + if (p->psi_flags) + psi_change_groups(p, 0, p->psi_flags); + return; + } =20 - if (!wakeup || p->sched_psi_wake_requeue) { - if (p->in_memstall) + /* + * wakeup (including wakeup migrate) need to change task psi_flags, + * specifically need to set TSK_RUNNING or TSK_MEMSTALL_RUNNING. + * Since we clear task->psi_flags for wakeup migrated task, we need + * to check task->psi_flags to see what should be set and clear. + */ + if (unlikely(p->in_memstall)) { + set |=3D TSK_MEMSTALL_RUNNING; + if (!(p->psi_flags & TSK_MEMSTALL)) set |=3D TSK_MEMSTALL; - if (p->sched_psi_wake_requeue) - p->sched_psi_wake_requeue =3D 0; - } else { - if (p->in_iowait) - clear |=3D TSK_IOWAIT; } + if (p->psi_flags & TSK_IOWAIT) + clear |=3D TSK_IOWAIT; =20 psi_task_change(p, clear, set); } =20 static inline void psi_dequeue(struct task_struct *p, bool sleep) { - int clear =3D TSK_RUNNING; - if (static_branch_likely(&psi_disabled)) return; =20 + if (!sleep) { + if (p->psi_flags) + psi_change_groups(p, p->psi_flags, 0); + return; + } + /* * A voluntary sleep is a dequeue followed by a task switch. To * avoid walking all ancestors twice, psi_task_switch() handles * TSK_RUNNING and TSK_IOWAIT for us when it moves TSK_ONCPU. * Do nothing here. */ - if (sleep) - return; - - if (p->in_memstall) - clear |=3D (TSK_MEMSTALL | TSK_MEMSTALL_RUNNING); - - psi_task_change(p, clear, 0); } =20 static inline void psi_ttwu_dequeue(struct task_struct *p) @@ -169,21 +174,14 @@ static inline void psi_ttwu_dequeue(struct task_struc= t *p) /* * Is the task being migrated during a wakeup? Make sure to * deregister its sleep-persistent psi states from the old - * queue, and let psi_enqueue() know it has to requeue. + * queue. */ - if (unlikely(p->in_iowait || p->in_memstall)) { + if (unlikely(p->psi_flags)) { struct rq_flags rf; struct rq *rq; - int clear =3D 0; - - if (p->in_iowait) - clear |=3D TSK_IOWAIT; - if (p->in_memstall) - clear |=3D TSK_MEMSTALL; =20 rq =3D __task_rq_lock(p, &rf); - psi_task_change(p, clear, 0); - p->sched_psi_wake_requeue =3D 1; + psi_task_change(p, p->psi_flags, 0); __task_rq_unlock(rq, &rf); } } --=20 2.36.1 From nobody Sat Apr 11 18:38:01 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 505D9C00140 for ; Mon, 8 Aug 2022 11:05:15 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S242859AbiHHLFN (ORCPT ); Mon, 8 Aug 2022 07:05:13 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48872 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S242802AbiHHLE6 (ORCPT ); Mon, 8 Aug 2022 07:04:58 -0400 Received: from mail-pg1-x52a.google.com (mail-pg1-x52a.google.com [IPv6:2607:f8b0:4864:20::52a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 13A7E140FC for ; Mon, 8 Aug 2022 04:04:57 -0700 (PDT) Received: by mail-pg1-x52a.google.com with SMTP id 73so8250898pgb.9 for ; Mon, 08 Aug 2022 04:04:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=zHJHLaCoDzV5QUtZAIutwfB8iOBAVflqU47uK3W3CxI=; b=frlElavzwzibQasm6LTTz+dwYLm/7rqeeNLg0mE9N8JxS4JAEYhTzxo0uKrELF+WKw 7BEevAELdtUIHm0Xky0Bw2JYBA+kR11snSOYCiz4VncyE9W7aGBlRw2P8z8s1sBeNKde SERqentIU6/0Zn0YuMv5jlTA2bvPz9zxhgB94RoAwxf0qr3Ndpnr0fzN8t1VPXX7wrq0 SCJPtJNDDahbjM+oOF4wd1meW2kU/S192uFI594ytliYDDWzCTsxma71FB+4OSu2g87C qJeAIanKsgf1HweklW8pGokQtQ8q878zeCuJwj3ttpS3Z+CgYOC8CCfQyrRLDjIOR0ZN c+xw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=zHJHLaCoDzV5QUtZAIutwfB8iOBAVflqU47uK3W3CxI=; b=B1wKnrqA0/VBUadF5fSEh+liQMp36hyUJGC9hEH/9BVv/B5swVcPo+QrNqNGAgyqQb ha8JhaShSQRw5NHOzh6QhaXihKDtxQ00k7apAt8tpI0o4J7AGrMnP5gS94KgN1LGbexg tlDThAOX73l8X2p7oXSwZPnOTttJCTAPL/7RRXE0ilKj2M3zTWWAbp2HKpx0+8X+NztA fjZ1jBFE1srYZQuZ7jxwCyEhHZbn6M3f8tTunvTbL3TNYMz6NigjgBUOiPNW1+tQ+SoM boXiMfn9YD0Du+kcWcD7TL98XmI/3IT44WG3ADW3Eg5K4P3LPMnTqwClpvgMco3Uzdua Pvyg== X-Gm-Message-State: ACgBeo0YwIWsz9QPFmdAwZjNxzFpBKSDJbUw4rjJmXgqlPzX3sUViRt0 +F2ejl/AZtkMcGBBtsa03LoJjg== X-Google-Smtp-Source: AA6agR6hWWne6ZVloa2WFigujscawmh6Y7gt/eGM5EqEV9V9cEv5bTVEdBbwyLR2UDI42TdcC+aFzA== X-Received: by 2002:aa7:8147:0:b0:52f:3fd9:aced with SMTP id d7-20020aa78147000000b0052f3fd9acedmr4723823pfn.78.1659956696502; Mon, 08 Aug 2022 04:04:56 -0700 (PDT) Received: from C02CV1DAMD6P.bytedance.net ([139.177.225.240]) by smtp.gmail.com with ESMTPSA id o12-20020aa7978c000000b0052dbad1ea2esm8393180pfp.6.2022.08.08.04.04.50 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 08 Aug 2022 04:04:56 -0700 (PDT) From: Chengming Zhou To: hannes@cmpxchg.org, tj@kernel.org, corbet@lwn.net, surenb@google.com, mingo@redhat.com, peterz@infradead.org, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com Cc: cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, songmuchun@bytedance.com, Chengming Zhou Subject: [PATCH v2 05/10] sched/psi: don't create cgroup PSI files when psi_disabled Date: Mon, 8 Aug 2022 19:03:36 +0800 Message-Id: <20220808110341.15799-6-zhouchengming@bytedance.com> X-Mailer: git-send-email 2.35.1 In-Reply-To: <20220808110341.15799-1-zhouchengming@bytedance.com> References: <20220808110341.15799-1-zhouchengming@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking confi= gurable") make PSI can be configured to skip per-cgroup stall accounting. And doesn't expose PSI files in cgroup hierarchy. This patch do the same thing when psi_disabled. Signed-off-by: Chengming Zhou Acked-by: Johannes Weiner --- kernel/cgroup/cgroup.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 13c8e91d7862..5f88117fc81e 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -3709,6 +3709,9 @@ static void cgroup_pressure_release(struct kernfs_ope= n_file *of) =20 bool cgroup_psi_enabled(void) { + if (static_branch_likely(&psi_disabled)) + return false; + return (cgroup_feature_disable_mask & (1 << OPT_FEATURE_PRESSURE)) =3D=3D= 0; } =20 --=20 2.36.1 From nobody Sat Apr 11 18:38:01 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id EFB1BC00140 for ; Mon, 8 Aug 2022 11:05:27 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S242766AbiHHLF0 (ORCPT ); Mon, 8 Aug 2022 07:05:26 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49786 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S242809AbiHHLFE (ORCPT ); Mon, 8 Aug 2022 07:05:04 -0400 Received: from mail-pl1-x62e.google.com (mail-pl1-x62e.google.com [IPv6:2607:f8b0:4864:20::62e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 55EC315701 for ; Mon, 8 Aug 2022 04:05:03 -0700 (PDT) Received: by mail-pl1-x62e.google.com with SMTP id d16so8192839pll.11 for ; Mon, 08 Aug 2022 04:05:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=uLMRRkpz76WFAC6w4foL1Edra508Vk6fX2ixUPszkRg=; b=U6k3lehDVxN49RqBtV2xHyF+h26ildCUx/hcQLS9PcepCqK9IKcw4CdPAOeUIuPG1s mnq493zh8DB20EHT/ai2shQdEyKlIWuxBPLr9RwwFptYyqtgSHSYvaBTR82EiYDsqV+B 7HioJDTamptDapEshQrGUO5P4EWY1LC7erYwBHO8bW/02kfj0t9ek1st38Td6Fs+vuDX Uip9TgU+7YcBuNOJSM+ttx3zWrfaftX4/gZQJY14Tp23U8DKeiK1omha+8GKoSjbwSaT sM8kDBGUzcZbCVFNmleTqCYfQeZcqXfVtc+VLHbpn2Gpw9+9B+i7ZIHuDzt2KArEeqAH Svuw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=uLMRRkpz76WFAC6w4foL1Edra508Vk6fX2ixUPszkRg=; b=ArMU16R6yTF3/ZGNvXml49GGam05nEpAtvRiM3STHDs2DGPhMM6y3V5vlvB5q1eIob 4zAxelydcX0EETZ56LCzzCSJm05PA2EbR9oPDDgglsY3Ia/+XoSt5kS4eCYElwrnIpAq I4m5YcbAheaMc559323IC2CNG6A9HbaWTlRgQNVu+qRJRr/SdrthuzXcSkFFXnWP1fkw pRvx57CgaXZgpt5DiQdmMlRZuQaFLatTlq2JwAJa8QDtwK9a5UkXFDtyJO96zXuJE61T muY4e5o7nwe2aJ8XUVGp/T12se3Ax3wnZm9gsYoLpgrG65RlnMt8TiWg2ZKPkE/CHOE3 554w== X-Gm-Message-State: ACgBeo0WtSij9KM2xjehz/6Mk3q4QV1N3vM3n4ZD0qhOMM6LEC07slYu kpT//qza0eUicuTdxApW2RblIA== X-Google-Smtp-Source: AA6agR5CgK+PlaY0Ns19IbVG4pKfoQG1NbbChIIISPB+PJFmIeUjO/oI2kx3Riadbhug5fmTC+eLBA== X-Received: by 2002:a17:90b:3586:b0:1f4:d507:783e with SMTP id mm6-20020a17090b358600b001f4d507783emr28080636pjb.171.1659956702869; Mon, 08 Aug 2022 04:05:02 -0700 (PDT) Received: from C02CV1DAMD6P.bytedance.net ([139.177.225.240]) by smtp.gmail.com with ESMTPSA id o12-20020aa7978c000000b0052dbad1ea2esm8393180pfp.6.2022.08.08.04.04.56 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 08 Aug 2022 04:05:02 -0700 (PDT) From: Chengming Zhou To: hannes@cmpxchg.org, tj@kernel.org, corbet@lwn.net, surenb@google.com, mingo@redhat.com, peterz@infradead.org, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com Cc: cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, songmuchun@bytedance.com, Chengming Zhou Subject: [PATCH v2 06/10] sched/psi: save percpu memory when !psi_cgroups_enabled Date: Mon, 8 Aug 2022 19:03:37 +0800 Message-Id: <20220808110341.15799-7-zhouchengming@bytedance.com> X-Mailer: git-send-email 2.35.1 In-Reply-To: <20220808110341.15799-1-zhouchengming@bytedance.com> References: <20220808110341.15799-1-zhouchengming@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" We won't use cgroup psi_group when !psi_cgroups_enabled, so don't bother to alloc percpu memory and init for it. Also don't need to migrate task PSI stats between cgroups in cgroup_move_task(). Signed-off-by: Chengming Zhou Acked-by: Johannes Weiner --- kernel/sched/psi.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 974471f212a3..595a6c8230b7 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -205,6 +205,7 @@ void __init psi_init(void) { if (!psi_enable) { static_branch_enable(&psi_disabled); + static_branch_disable(&psi_cgroups_enabled); return; } =20 @@ -952,7 +953,7 @@ void psi_memstall_leave(unsigned long *flags) #ifdef CONFIG_CGROUPS int psi_cgroup_alloc(struct cgroup *cgroup) { - if (static_branch_likely(&psi_disabled)) + if (!static_branch_likely(&psi_cgroups_enabled)) return 0; =20 cgroup->psi.pcpu =3D alloc_percpu(struct psi_group_cpu); @@ -964,7 +965,7 @@ int psi_cgroup_alloc(struct cgroup *cgroup) =20 void psi_cgroup_free(struct cgroup *cgroup) { - if (static_branch_likely(&psi_disabled)) + if (!static_branch_likely(&psi_cgroups_enabled)) return; =20 cancel_delayed_work_sync(&cgroup->psi.avgs_work); @@ -991,7 +992,7 @@ void cgroup_move_task(struct task_struct *task, struct = css_set *to) struct rq_flags rf; struct rq *rq; =20 - if (static_branch_likely(&psi_disabled)) { + if (!static_branch_likely(&psi_cgroups_enabled)) { /* * Lame to do this here, but the scheduler cannot be locked * from the outside, so we move cgroups from inside sched/. --=20 2.36.1 From nobody Sat Apr 11 18:38:01 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id F3622C00140 for ; Mon, 8 Aug 2022 11:05:51 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S242721AbiHHLFt (ORCPT ); Mon, 8 Aug 2022 07:05:49 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50046 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S242856AbiHHLFM (ORCPT ); Mon, 8 Aug 2022 07:05:12 -0400 Received: from mail-pj1-x102a.google.com (mail-pj1-x102a.google.com [IPv6:2607:f8b0:4864:20::102a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7640A14D17 for ; Mon, 8 Aug 2022 04:05:10 -0700 (PDT) Received: by mail-pj1-x102a.google.com with SMTP id 15-20020a17090a098f00b001f305b453feso14178661pjo.1 for ; Mon, 08 Aug 2022 04:05:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=ZF+ptlZ7x1rV90unZdPOJJUv9JpctUIbzaWac6sYWYU=; b=IT9T5n3uzaIyYCDoAvf1GKOtIDooo4d192E3n80U67HsrhgOBrqqRKN65ZT3tIZeS8 r7J9Npsendjqi2CuZuFjuvWCjxioqgm2ZFQENgvKsOeMpFRYJRk4WeXmU3kDGo3uggLG C4Pk+lVauLT0M9bz1asuYe4TPOjS2qL9jxzDs1K0BOg7Uz/132vLzLC4OARg6kH+FrcX YF7g2xylrqFW05z8L54j37rsi0S7tiCDlkr7Y7notxS2bt+mYzYkhh6xZ04P4LKHUBqt Nii6RUV4eBr8iz9Y6oKxnuLvsAx1Dvb6E+ByIAWszbpkWUGpQRceGc0bBhM/vibPHULV 97nA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=ZF+ptlZ7x1rV90unZdPOJJUv9JpctUIbzaWac6sYWYU=; b=Ev8ACKd+l+5W0wuAVRa8iOXhy9N13sXIYPtVs6lh+hqGiwD9zwbGXR2eK8+VCBD8Xm 1nGlw34FKbaEjf3CC4tIlm3HlxgsrH9sqaip2wzIpJFexA+aa0cvQ9ehmoHqh3HOdGs0 yL3V5mQibTIVkBjsKGRYR69zYV6HJSsDpM5P4AnXV5uLpvFvhSQV1RnHjEhQ3tOpN6/l twEJG/Hn/TvdH76ETr8pRH4bXfDw7YE0rhGjst4sul7l9Pke22RN/ycplnci/P7Q31dk yDD2hHf74pT72NGhM8VqiJ1c59qfrwcTJEWs++opJSUZlRqhVxHA+rKRI09UP8oOM/ik cUFw== X-Gm-Message-State: ACgBeo2wXhemzsDh3ZBxmJ+de8tDMQtzTzbrX6l7XVIB3WxmGxXDM41e HC8SB8GVBWqrdatMLYtrgAsL0Q== X-Google-Smtp-Source: AA6agR5tMkDltpGNZ426qYm3ZBSzERjw0pUrjxLSG8+rd7Gc2baWLE+FRwX+ScJkWGExyuaSdEeajw== X-Received: by 2002:a17:902:edc4:b0:16d:4635:130c with SMTP id q4-20020a170902edc400b0016d4635130cmr18779724plk.64.1659956709766; Mon, 08 Aug 2022 04:05:09 -0700 (PDT) Received: from C02CV1DAMD6P.bytedance.net ([139.177.225.240]) by smtp.gmail.com with ESMTPSA id o12-20020aa7978c000000b0052dbad1ea2esm8393180pfp.6.2022.08.08.04.05.03 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 08 Aug 2022 04:05:09 -0700 (PDT) From: Chengming Zhou To: hannes@cmpxchg.org, tj@kernel.org, corbet@lwn.net, surenb@google.com, mingo@redhat.com, peterz@infradead.org, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com Cc: cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, songmuchun@bytedance.com, Chengming Zhou Subject: [PATCH v2 07/10] sched/psi: remove NR_ONCPU task accounting Date: Mon, 8 Aug 2022 19:03:38 +0800 Message-Id: <20220808110341.15799-8-zhouchengming@bytedance.com> X-Mailer: git-send-email 2.35.1 In-Reply-To: <20220808110341.15799-1-zhouchengming@bytedance.com> References: <20220808110341.15799-1-zhouchengming@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" From: Johannes Weiner We put all fields updated by the scheduler in the first cacheline of struct psi_group_cpu for performance. Since we want add another PSI_IRQ_FULL to track IRQ/SOFTIRQ pressure, we need to reclaim space first. This patch remove NR_ONCPU task accounting in struct psi_group_cpu, use one bit in state_mask to track instead. Signed-off-by: Johannes Weiner Signed-off-by: Chengming Zhou Reviewed-by: Chengming Zhou Tested-by: Chengming Zhou --- include/linux/psi_types.h | 16 +++++++--------- kernel/sched/psi.c | 36 ++++++++++++++++++++++++++++-------- 2 files changed, 35 insertions(+), 17 deletions(-) diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h index c7fe7c089718..54cb74946db4 100644 --- a/include/linux/psi_types.h +++ b/include/linux/psi_types.h @@ -15,13 +15,6 @@ enum psi_task_count { NR_IOWAIT, NR_MEMSTALL, NR_RUNNING, - /* - * This can't have values other than 0 or 1 and could be - * implemented as a bit flag. But for now we still have room - * in the first cacheline of psi_group_cpu, and this way we - * don't have to special case any state tracking for it. - */ - NR_ONCPU, /* * For IO and CPU stalls the presence of running/oncpu tasks * in the domain means a partial rather than a full stall. @@ -32,16 +25,18 @@ enum psi_task_count { * threads and memstall ones. */ NR_MEMSTALL_RUNNING, - NR_PSI_TASK_COUNTS =3D 5, + NR_PSI_TASK_COUNTS =3D 4, }; =20 /* Task state bitmasks */ #define TSK_IOWAIT (1 << NR_IOWAIT) #define TSK_MEMSTALL (1 << NR_MEMSTALL) #define TSK_RUNNING (1 << NR_RUNNING) -#define TSK_ONCPU (1 << NR_ONCPU) #define TSK_MEMSTALL_RUNNING (1 << NR_MEMSTALL_RUNNING) =20 +/* Only one task can be scheduled, no corresponding task count */ +#define TSK_ONCPU (1 << NR_PSI_TASK_COUNTS) + /* Resources that workloads could be stalled on */ enum psi_res { PSI_IO, @@ -68,6 +63,9 @@ enum psi_states { NR_PSI_STATES =3D 7, }; =20 +/* Use one bit in the state mask to track TSK_ONCPU */ +#define PSI_ONCPU (1 << NR_PSI_STATES) + enum psi_aggregators { PSI_AVGS =3D 0, PSI_POLL, diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 595a6c8230b7..1c675715ed33 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -216,7 +216,7 @@ void __init psi_init(void) group_init(&psi_system); } =20 -static bool test_state(unsigned int *tasks, enum psi_states state) +static bool test_state(unsigned int *tasks, enum psi_states state, bool on= cpu) { switch (state) { case PSI_IO_SOME: @@ -229,9 +229,9 @@ static bool test_state(unsigned int *tasks, enum psi_st= ates state) return unlikely(tasks[NR_MEMSTALL] && tasks[NR_RUNNING] =3D=3D tasks[NR_MEMSTALL_RUNNING]); case PSI_CPU_SOME: - return unlikely(tasks[NR_RUNNING] > tasks[NR_ONCPU]); + return unlikely(tasks[NR_RUNNING] > oncpu); case PSI_CPU_FULL: - return unlikely(tasks[NR_RUNNING] && !tasks[NR_ONCPU]); + return unlikely(tasks[NR_RUNNING] && !oncpu); case PSI_NONIDLE: return tasks[NR_IOWAIT] || tasks[NR_MEMSTALL] || tasks[NR_RUNNING]; @@ -693,9 +693,9 @@ static void psi_group_change(struct psi_group *group, i= nt cpu, bool wake_clock) { struct psi_group_cpu *groupc; - u32 state_mask =3D 0; unsigned int t, m; enum psi_states s; + u32 state_mask; =20 groupc =3D per_cpu_ptr(group->pcpu, cpu); =20 @@ -711,6 +711,26 @@ static void psi_group_change(struct psi_group *group, = int cpu, =20 record_times(groupc, now); =20 + /* + * Start with TSK_ONCPU, which doesn't have a corresponding + * task count - it's just a boolean flag directly encoded in + * the state mask. Clear, set, or carry the current state if + * no changes are requested. + */ + if (clear & TSK_ONCPU) { + state_mask =3D 0; + clear &=3D ~TSK_ONCPU; + } else if (set & TSK_ONCPU) { + state_mask =3D PSI_ONCPU; + set &=3D ~TSK_ONCPU; + } else { + state_mask =3D groupc->state_mask & PSI_ONCPU; + } + + /* + * The rest of the state mask is calculated based on the task + * counts. Update those first, then construct the mask. + */ for (t =3D 0, m =3D clear; m; m &=3D ~(1 << t), t++) { if (!(m & (1 << t))) continue; @@ -730,9 +750,8 @@ static void psi_group_change(struct psi_group *group, i= nt cpu, if (set & (1 << t)) groupc->tasks[t]++; =20 - /* Calculate state mask representing active states */ for (s =3D 0; s < NR_PSI_STATES; s++) { - if (test_state(groupc->tasks, s)) + if (test_state(groupc->tasks, s, state_mask & PSI_ONCPU)) state_mask |=3D (1 << s); } =20 @@ -744,7 +763,7 @@ static void psi_group_change(struct psi_group *group, i= nt cpu, * task in a cgroup is in_memstall, the corresponding groupc * on that cpu is in PSI_MEM_FULL state. */ - if (unlikely(groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall)) + if (unlikely((state_mask & PSI_ONCPU) && cpu_curr(cpu)->in_memstall)) state_mask |=3D (1 << PSI_MEM_FULL); =20 groupc->state_mask =3D state_mask; @@ -835,7 +854,8 @@ void psi_task_switch(struct task_struct *prev, struct t= ask_struct *next, */ iter =3D NULL; while ((group =3D iterate_groups(next, &iter))) { - if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) { + if (per_cpu_ptr(group->pcpu, cpu)->state_mask & + PSI_ONCPU) { common =3D group; break; } --=20 2.36.1 From nobody Sat Apr 11 18:38:01 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 44DDFC25B07 for ; Mon, 8 Aug 2022 11:05:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S242818AbiHHLFz (ORCPT ); Mon, 8 Aug 2022 07:05:55 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50144 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S242889AbiHHLFW (ORCPT ); Mon, 8 Aug 2022 07:05:22 -0400 Received: from mail-pg1-x52b.google.com (mail-pg1-x52b.google.com [IPv6:2607:f8b0:4864:20::52b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DC894266A for ; Mon, 8 Aug 2022 04:05:17 -0700 (PDT) Received: by mail-pg1-x52b.google.com with SMTP id q16so8258426pgq.6 for ; Mon, 08 Aug 2022 04:05:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=Sog88LybOCvHlT3337HIqyED+jJvkJWFArOosiQRxf4=; b=Zpu1ApVNsDKL+42r1juQPvCQE1XgOHQj3uARY1TQVApNITl5jeKONnWMX8pNs9276N +ZZX/RKmLJwZ1FZl27idZ4QZ5EHzI/XYj1SRooDFCfxXYlyp9vDEH8nClb6H5YDAZIKa D7lCbXp+II8c8AYDH5UntEiEBercHof/byjMh3pTkcu5Ox4WmR+RN2ggtd0ZrYfR2Tn9 uqnvbkL/jyqf5jBBc0c7igtQz0m58DniYfYZDAarxgKlm8orvR6xRrPyHtzuvaNiXpDB 6fsIh7ekJFbUALYvvTgDt6ndwQrAKjl1gATeMsN+m2ixp0v8QKr6bfV8/z9N8JqPfQQi rDGA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=Sog88LybOCvHlT3337HIqyED+jJvkJWFArOosiQRxf4=; b=fDYUDVeYdSoUPEbsG519XrS+VoqOUwZg14bUi14gks3v8gB11EOdNrl40MYqumu0ab w7c8RYgzOR9rhEiD+U/rQAhmr2CnRep7MMpUkJ3Af2FBeiV0alBJxKnzUrQ1Xtb221RI 7Q3e0nx++nlKMhwqaxCvHyf5HdmhUHjFzq2pbSHCncYZRd5bWrUpnTdJMioIo+XEAOlc 5IUTc6LbYCDuevYpIUV++kvk6fX7g3+rCxCto1ViJPcsKl7IS6HRDXGdClpBPSQqKRo6 jXSkE+Jk6EPmyDAX4ujR+Xh4g4leL3mrOPAgPqaw+7hymy7ycojvI0AgDH6s16fmGSJM wJqA== X-Gm-Message-State: ACgBeo1LHNabtV8wsck7dZxfoSCSoNutkN8pNkJUZN5/fO+bulfHwLUP TCbFDHvCri07mW3Hn7D4Frfb4w== X-Google-Smtp-Source: AA6agR4jQ9e7D1IPq9yJMgv0wDu2KkbYixhuGq1dVqhvSMr8p/sDkC8ibPXnsuWKkvKhElKS/3kFaQ== X-Received: by 2002:a05:6a00:14c7:b0:52e:efb7:bd05 with SMTP id w7-20020a056a0014c700b0052eefb7bd05mr9628890pfu.24.1659956716533; Mon, 08 Aug 2022 04:05:16 -0700 (PDT) Received: from C02CV1DAMD6P.bytedance.net ([139.177.225.240]) by smtp.gmail.com with ESMTPSA id o12-20020aa7978c000000b0052dbad1ea2esm8393180pfp.6.2022.08.08.04.05.10 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 08 Aug 2022 04:05:16 -0700 (PDT) From: Chengming Zhou To: hannes@cmpxchg.org, tj@kernel.org, corbet@lwn.net, surenb@google.com, mingo@redhat.com, peterz@infradead.org, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com Cc: cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, songmuchun@bytedance.com, Chengming Zhou Subject: [PATCH v2 08/10] sched/psi: add PSI_IRQ to track IRQ/SOFTIRQ pressure Date: Mon, 8 Aug 2022 19:03:39 +0800 Message-Id: <20220808110341.15799-9-zhouchengming@bytedance.com> X-Mailer: git-send-email 2.35.1 In-Reply-To: <20220808110341.15799-1-zhouchengming@bytedance.com> References: <20220808110341.15799-1-zhouchengming@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" Now PSI already tracked workload pressure stall information for CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have obvious impact on some workload productivity, such as web service workload. When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time from update_rq_clock_task(), in which we can record that delta to CPU curr task's cgroups as PSI_IRQ_FULL status. Note we don't use PSI_IRQ_SOME since IRQ/SOFTIRQ always happen in the current task on the CPU, make nothing productive could run even if it were runnable, so we only use PSI_IRQ_FULL. Signed-off-by: Chengming Zhou --- Documentation/admin-guide/cgroup-v2.rst | 6 +++ include/linux/psi_types.h | 6 ++- kernel/cgroup/cgroup.c | 27 ++++++++++ kernel/sched/core.c | 1 + kernel/sched/psi.c | 65 ++++++++++++++++++++++++- kernel/sched/stats.h | 2 + 6 files changed, 103 insertions(+), 4 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-= guide/cgroup-v2.rst index 176298f2f4de..dd84e34bc051 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -968,6 +968,12 @@ All cgroup core files are prefixed with "cgroup." killing cgroups is a process directed operation, i.e. it affects the whole thread-group. =20 + irq.pressure + A read-write nested-keyed file. + + Shows pressure stall information for IRQ/SOFTIRQ. See + :ref:`Documentation/accounting/psi.rst ` for details. + Controllers =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D =20 diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h index 54cb74946db4..4677655f6ca1 100644 --- a/include/linux/psi_types.h +++ b/include/linux/psi_types.h @@ -42,7 +42,8 @@ enum psi_res { PSI_IO, PSI_MEM, PSI_CPU, - NR_PSI_RESOURCES =3D 3, + PSI_IRQ, + NR_PSI_RESOURCES =3D 4, }; =20 /* @@ -58,9 +59,10 @@ enum psi_states { PSI_MEM_FULL, PSI_CPU_SOME, PSI_CPU_FULL, + PSI_IRQ_FULL, /* Only per-CPU, to weigh the CPU in the global average: */ PSI_NONIDLE, - NR_PSI_STATES =3D 7, + NR_PSI_STATES =3D 8, }; =20 /* Use one bit in the state mask to track TSK_ONCPU */ diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 5f88117fc81e..91de8ff7fa50 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -3692,6 +3692,23 @@ static ssize_t cgroup_cpu_pressure_write(struct kern= fs_open_file *of, return cgroup_pressure_write(of, buf, nbytes, PSI_CPU); } =20 +#ifdef CONFIG_IRQ_TIME_ACCOUNTING +static int cgroup_irq_pressure_show(struct seq_file *seq, void *v) +{ + struct cgroup *cgrp =3D seq_css(seq)->cgroup; + struct psi_group *psi =3D cgroup_ino(cgrp) =3D=3D 1 ? &psi_system : &cgrp= ->psi; + + return psi_show(seq, psi, PSI_IRQ); +} + +static ssize_t cgroup_irq_pressure_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, + loff_t off) +{ + return cgroup_pressure_write(of, buf, nbytes, PSI_IRQ); +} +#endif + static __poll_t cgroup_pressure_poll(struct kernfs_open_file *of, poll_table *pt) { @@ -5088,6 +5105,16 @@ static struct cftype cgroup_base_files[] =3D { .poll =3D cgroup_pressure_poll, .release =3D cgroup_pressure_release, }, +#ifdef CONFIG_IRQ_TIME_ACCOUNTING + { + .name =3D "irq.pressure", + .flags =3D CFTYPE_PRESSURE, + .seq_show =3D cgroup_irq_pressure_show, + .write =3D cgroup_irq_pressure_write, + .poll =3D cgroup_pressure_poll, + .release =3D cgroup_pressure_release, + }, +#endif #endif /* CONFIG_PSI */ { } /* terminate */ }; diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 3aa401689f7e..4cfb6ab32142 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -708,6 +708,7 @@ static void update_rq_clock_task(struct rq *rq, s64 del= ta) =20 rq->prev_irq_time +=3D irq_delta; delta -=3D irq_delta; + psi_account_irqtime(rq->curr, irq_delta); #endif #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING if (static_key_false((¶virt_steal_rq_enabled))) { diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 1c675715ed33..58f8092c938f 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -910,6 +910,34 @@ void psi_task_switch(struct task_struct *prev, struct = task_struct *next, } } =20 +void psi_account_irqtime(struct task_struct *task, u32 delta) +{ + int cpu =3D task_cpu(task); + void *iter =3D NULL; + struct psi_group *group; + struct psi_group_cpu *groupc; + u64 now; + + if (!task->pid) + return; + + now =3D cpu_clock(cpu); + + while ((group =3D iterate_groups(task, &iter))) { + groupc =3D per_cpu_ptr(group->pcpu, cpu); + + write_seqcount_begin(&groupc->seq); + + record_times(groupc, now); + groupc->times[PSI_IRQ_FULL] +=3D delta; + + write_seqcount_end(&groupc->seq); + + if (group->poll_states & (1 << PSI_IRQ_FULL)) + psi_schedule_poll_work(group, 1); + } +} + /** * psi_memstall_enter - mark the beginning of a memory stall section * @flags: flags to handle nested sections @@ -1078,7 +1106,7 @@ int psi_show(struct seq_file *m, struct psi_group *gr= oup, enum psi_res res) group->avg_next_update =3D update_averages(group, now); mutex_unlock(&group->avgs_lock); =20 - for (full =3D 0; full < 2; full++) { + for (full =3D 0; full < 2 - (res =3D=3D PSI_IRQ); full++) { unsigned long avg[3] =3D { 0, }; u64 total =3D 0; int w; @@ -1092,7 +1120,7 @@ int psi_show(struct seq_file *m, struct psi_group *gr= oup, enum psi_res res) } =20 seq_printf(m, "%s avg10=3D%lu.%02lu avg60=3D%lu.%02lu avg300=3D%lu.%02lu= total=3D%llu\n", - full ? "full" : "some", + full || (res =3D=3D PSI_IRQ) ? "full" : "some", LOAD_INT(avg[0]), LOAD_FRAC(avg[0]), LOAD_INT(avg[1]), LOAD_FRAC(avg[1]), LOAD_INT(avg[2]), LOAD_FRAC(avg[2]), @@ -1120,6 +1148,9 @@ struct psi_trigger *psi_trigger_create(struct psi_gro= up *group, else return ERR_PTR(-EINVAL); =20 + if ((res =3D=3D PSI_IRQ) && (--state !=3D PSI_IRQ_FULL)) + return ERR_PTR(-EINVAL); + if (state >=3D PSI_NONIDLE) return ERR_PTR(-EINVAL); =20 @@ -1404,6 +1435,33 @@ static const struct proc_ops psi_cpu_proc_ops =3D { .proc_release =3D psi_fop_release, }; =20 +#ifdef CONFIG_IRQ_TIME_ACCOUNTING +static int psi_irq_show(struct seq_file *m, void *v) +{ + return psi_show(m, &psi_system, PSI_IRQ); +} + +static int psi_irq_open(struct inode *inode, struct file *file) +{ + return psi_open(file, psi_irq_show); +} + +static ssize_t psi_irq_write(struct file *file, const char __user *user_bu= f, + size_t nbytes, loff_t *ppos) +{ + return psi_write(file, user_buf, nbytes, PSI_IRQ); +} + +static const struct proc_ops psi_irq_proc_ops =3D { + .proc_open =3D psi_irq_open, + .proc_read =3D seq_read, + .proc_lseek =3D seq_lseek, + .proc_write =3D psi_irq_write, + .proc_poll =3D psi_fop_poll, + .proc_release =3D psi_fop_release, +}; +#endif + static int __init psi_proc_init(void) { if (psi_enable) { @@ -1411,6 +1469,9 @@ static int __init psi_proc_init(void) proc_create("pressure/io", 0666, NULL, &psi_io_proc_ops); proc_create("pressure/memory", 0666, NULL, &psi_memory_proc_ops); proc_create("pressure/cpu", 0666, NULL, &psi_cpu_proc_ops); +#ifdef CONFIG_IRQ_TIME_ACCOUNTING + proc_create("pressure/irq", 0666, NULL, &psi_irq_proc_ops); +#endif } return 0; } diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h index e930b8fa6253..8b6cfc7a56f5 100644 --- a/kernel/sched/stats.h +++ b/kernel/sched/stats.h @@ -111,6 +111,7 @@ void psi_change_groups(struct task_struct *task, int cl= ear, int set); void psi_task_change(struct task_struct *task, int clear, int set); void psi_task_switch(struct task_struct *prev, struct task_struct *next, bool sleep); +void psi_account_irqtime(struct task_struct *task, u32 delta); =20 /* * PSI tracks state that persists across sleeps, such as iowaits and @@ -203,6 +204,7 @@ static inline void psi_ttwu_dequeue(struct task_struct = *p) {} static inline void psi_sched_switch(struct task_struct *prev, struct task_struct *next, bool sleep) {} +static inline void psi_account_irqtime(struct task_struct *task, u32 delta= ) {} #endif /* CONFIG_PSI */ =20 #ifdef CONFIG_SCHED_INFO --=20 2.36.1 From nobody Sat Apr 11 18:38:01 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4E89CC25B07 for ; Mon, 8 Aug 2022 11:06:24 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S242938AbiHHLGV (ORCPT ); Mon, 8 Aug 2022 07:06:21 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49672 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S242949AbiHHLFm (ORCPT ); Mon, 8 Aug 2022 07:05:42 -0400 Received: from mail-pg1-x52d.google.com (mail-pg1-x52d.google.com [IPv6:2607:f8b0:4864:20::52d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EAA5A654A for ; Mon, 8 Aug 2022 04:05:24 -0700 (PDT) Received: by mail-pg1-x52d.google.com with SMTP id f65so8229904pgc.12 for ; Mon, 08 Aug 2022 04:05:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=9POYeDaYW7dtn4XhApQfMlcQg9lt0ElNIM/9f/TV/5E=; b=OZ3MLAf56eTu/xwNZPNVcUkfq+qDU59H/XlcOYLDD9rxbjRorHzl5ExkQJf+sKjXyk Uywpm2rhNUjZgOJUHr+BDDZFXYGAChF2FVEZWDtfCNjWznbdnKI4RWsW3hwkDEWZ7CAX OFGH913FUIcQHQjYFODmCm++dzsTRQey+/Ienv6jdQDiBFT2HUwqBsZz3rSIQZTDe0Gs qQVjT0fvUHMyxNdL8UlvO+R69FWMt0IMWltX4I3frkyGghIe7xcsOPa95uexCJaKkL6u Aen/ACE4QTM0a6KhGD9MtEvhlMOTNs2uNEfDRWQZNjH5yT9sEFfT4h9vaJbaIVNJmr8T xelg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=9POYeDaYW7dtn4XhApQfMlcQg9lt0ElNIM/9f/TV/5E=; b=2rpE0KInGHSjUklh+FP0MEJ5PvPRC7vgLdHAcB058bSt+GZuQ9qq4dPKYoAI6brwoO 6K0ghv8XPK3T3k+QAPqXIHyrFk2caXio8iWKCqkdvPxLLPqnaceD8PlVwjrW6pfdoXQb WVUFUJ9hGvPtlmtyaC6myTZEurFHS1kOdF43IZnKvCX1HplLytzSheUViSFa3Kd8uEZ3 g+1pCqmVmlAt8J/uQ3i8++eYUQcMJBxr+96ps6hWo7kKTuj++0kwh6JnVjzUe+whoeZ5 7EVzXfMV1LU32ViPYiVCMfLaqpw8YhoLRwQ4MxfFXr9LVSKcH6EYWdC8EUBeb9F08tnU FVmQ== X-Gm-Message-State: ACgBeo0mKETL0RU0xJ04C3jpthVVOhRjNupeURJHw/8tmv29RaJWPmnl xAlMEJi1VOova2xnQwHqRNB9WQ== X-Google-Smtp-Source: AA6agR5CHqkMZvGrY3BOz9NP31v2sEpmcPeURzj70Kpu6UGaeJ4mw7ASU/0Ow1vSJLmsKKcfjSHvTQ== X-Received: by 2002:a05:6a00:240f:b0:52e:f99d:1157 with SMTP id z15-20020a056a00240f00b0052ef99d1157mr8930124pfh.70.1659956724220; Mon, 08 Aug 2022 04:05:24 -0700 (PDT) Received: from C02CV1DAMD6P.bytedance.net ([139.177.225.240]) by smtp.gmail.com with ESMTPSA id o12-20020aa7978c000000b0052dbad1ea2esm8393180pfp.6.2022.08.08.04.05.16 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 08 Aug 2022 04:05:23 -0700 (PDT) From: Chengming Zhou To: hannes@cmpxchg.org, tj@kernel.org, corbet@lwn.net, surenb@google.com, mingo@redhat.com, peterz@infradead.org, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com Cc: cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, songmuchun@bytedance.com, Chengming Zhou Subject: [PATCH v2 09/10] sched/psi: per-cgroup PSI stats disable/re-enable interface Date: Mon, 8 Aug 2022 19:03:40 +0800 Message-Id: <20220808110341.15799-10-zhouchengming@bytedance.com> X-Mailer: git-send-email 2.35.1 In-Reply-To: <20220808110341.15799-1-zhouchengming@bytedance.com> References: <20220808110341.15799-1-zhouchengming@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" PSI accounts stalls for each cgroup separately and aggregates it at each level of the hierarchy. This may cause non-negligible overhead for some workloads when under deep level of the hierarchy. commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking confi= gurable") make PSI to skip per-cgroup stall accounting, only account system-wide to avoid this each level overhead. But for our use case, we also want leaf cgroup PSI stats accounted for userspace adjustment on that cgroup, apart from only system-wide adjustment. So this patch introduce a per-cgroup PSI stats disable/re-enable interface "cgroup.psi", which is a read-write single value file that allowed values are "0" and "1", the defaults is "1" so per-cgroup PSI stats is enabled by default. Implementation details: It should be relatively straight-forward to disable and re-enable state aggregation, time tracking, averaging on a per-cgroup level, if we can live with losing history from while it was disabled. I.e. the avgs will restart from 0, total=3D will have gaps. But it's hard or complex to stop/restart groupc->tasks[] updates, which is not implemented in this patch. So we always update groupc->tasks[] and PSI_ONCPU bit in psi_group_change() even when the cgroup PSI stats is disabled. Suggested-by: Johannes Weiner Signed-off-by: Chengming Zhou --- Documentation/admin-guide/cgroup-v2.rst | 7 ++++ include/linux/psi.h | 2 ++ include/linux/psi_types.h | 2 ++ kernel/cgroup/cgroup.c | 43 +++++++++++++++++++++++++ kernel/sched/psi.c | 40 +++++++++++++++++++---- 5 files changed, 87 insertions(+), 7 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-= guide/cgroup-v2.rst index dd84e34bc051..ade40506ab80 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -968,6 +968,13 @@ All cgroup core files are prefixed with "cgroup." killing cgroups is a process directed operation, i.e. it affects the whole thread-group. =20 + cgroup.psi + A read-write single value file that allowed values are "0" and "1". + The default is "1". + + Writing "0" to the file will disable the cgroup PSI stats accounting. + Writing "1" to the file will re-enable the cgroup PSI stats accounting. + irq.pressure A read-write nested-keyed file. =20 diff --git a/include/linux/psi.h b/include/linux/psi.h index aa168a038242..1138ccffd76b 100644 --- a/include/linux/psi.h +++ b/include/linux/psi.h @@ -33,6 +33,7 @@ __poll_t psi_trigger_poll(void **trigger_ptr, struct file= *file, int psi_cgroup_alloc(struct cgroup *cgrp); void psi_cgroup_free(struct cgroup *cgrp); void cgroup_move_task(struct task_struct *p, struct css_set *to); +void psi_cgroup_enable(struct psi_group *group, bool enable); #endif =20 #else /* CONFIG_PSI */ @@ -54,6 +55,7 @@ static inline void cgroup_move_task(struct task_struct *p= , struct css_set *to) { rcu_assign_pointer(p->cgroups, to); } +static inline void psi_cgroup_enable(struct psi_group *group, bool enable)= {} #endif =20 #endif /* CONFIG_PSI */ diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h index 4677655f6ca1..fced39e255aa 100644 --- a/include/linux/psi_types.h +++ b/include/linux/psi_types.h @@ -147,6 +147,8 @@ struct psi_trigger { }; =20 struct psi_group { + bool enabled; + /* Protects data used by the aggregator */ struct mutex avgs_lock; =20 diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 91de8ff7fa50..6ba56983b5a5 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -3709,6 +3709,43 @@ static ssize_t cgroup_irq_pressure_write(struct kern= fs_open_file *of, } #endif =20 +static int cgroup_psi_show(struct seq_file *seq, void *v) +{ + struct cgroup *cgrp =3D seq_css(seq)->cgroup; + struct psi_group *psi =3D cgroup_ino(cgrp) =3D=3D 1 ? &psi_system : &cgrp= ->psi; + + seq_printf(seq, "%d\n", psi->enabled); + + return 0; +} + +static ssize_t cgroup_psi_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + ssize_t ret; + int enable; + struct cgroup *cgrp; + struct psi_group *psi; + + ret =3D kstrtoint(strstrip(buf), 0, &enable); + if (ret) + return ret; + + if (enable < 0 || enable > 1) + return -ERANGE; + + cgrp =3D cgroup_kn_lock_live(of->kn, false); + if (!cgrp) + return -ENOENT; + + psi =3D cgroup_ino(cgrp) =3D=3D 1 ? &psi_system : &cgrp->psi; + psi_cgroup_enable(psi, enable); + + cgroup_kn_unlock(of->kn); + + return nbytes; +} + static __poll_t cgroup_pressure_poll(struct kernfs_open_file *of, poll_table *pt) { @@ -5115,6 +5152,12 @@ static struct cftype cgroup_base_files[] =3D { .release =3D cgroup_pressure_release, }, #endif + { + .name =3D "cgroup.psi", + .flags =3D CFTYPE_PRESSURE, + .seq_show =3D cgroup_psi_show, + .write =3D cgroup_psi_write, + }, #endif /* CONFIG_PSI */ { } /* terminate */ }; diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 58f8092c938f..9df1686ee02d 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -181,6 +181,7 @@ static void group_init(struct psi_group *group) { int cpu; =20 + group->enabled =3D true; for_each_possible_cpu(cpu) seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq); group->avg_last_update =3D sched_clock(); @@ -700,17 +701,16 @@ static void psi_group_change(struct psi_group *group,= int cpu, groupc =3D per_cpu_ptr(group->pcpu, cpu); =20 /* - * First we assess the aggregate resource states this CPU's - * tasks have been in since the last change, and account any - * SOME and FULL time these may have resulted in. - * - * Then we update the task counts according to the state + * First we update the task counts according to the state * change requested through the @clear and @set bits. + * + * Then if the cgroup PSI stats accounting enabled, we + * assess the aggregate resource states this CPU's tasks + * have been in since the last change, and account any + * SOME and FULL time these may have resulted in. */ write_seqcount_begin(&groupc->seq); =20 - record_times(groupc, now); - /* * Start with TSK_ONCPU, which doesn't have a corresponding * task count - it's just a boolean flag directly encoded in @@ -750,6 +750,14 @@ static void psi_group_change(struct psi_group *group, = int cpu, if (set & (1 << t)) groupc->tasks[t]++; =20 + if (!group->enabled) { + if (groupc->state_mask & (1 << PSI_NONIDLE)) + record_times(groupc, now); + groupc->state_mask =3D state_mask; + write_seqcount_end(&groupc->seq); + return; + } + for (s =3D 0; s < NR_PSI_STATES; s++) { if (test_state(groupc->tasks, s, state_mask & PSI_ONCPU)) state_mask |=3D (1 << s); @@ -766,6 +774,7 @@ static void psi_group_change(struct psi_group *group, i= nt cpu, if (unlikely((state_mask & PSI_ONCPU) && cpu_curr(cpu)->in_memstall)) state_mask |=3D (1 << PSI_MEM_FULL); =20 + record_times(groupc, now); groupc->state_mask =3D state_mask; =20 write_seqcount_end(&groupc->seq); @@ -1088,6 +1097,23 @@ void cgroup_move_task(struct task_struct *task, stru= ct css_set *to) =20 task_rq_unlock(rq, task, &rf); } + +void psi_cgroup_enable(struct psi_group *group, bool enable) +{ + struct psi_group_cpu *groupc; + int cpu; + u64 now; + + if (group->enabled =3D=3D enable) + return; + group->enabled =3D enable; + + for_each_possible_cpu(cpu) { + groupc =3D per_cpu_ptr(group->pcpu, cpu); + now =3D cpu_clock(cpu); + psi_group_change(group, cpu, 0, 0, now, true); + } +} #endif /* CONFIG_CGROUPS */ =20 int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res) --=20 2.36.1 From nobody Sat Apr 11 18:38:01 2026 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id B2260C00140 for ; Mon, 8 Aug 2022 11:06:27 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S242828AbiHHLGZ (ORCPT ); Mon, 8 Aug 2022 07:06:25 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50726 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S242964AbiHHLFo (ORCPT ); Mon, 8 Aug 2022 07:05:44 -0400 Received: from mail-pj1-x1029.google.com (mail-pj1-x1029.google.com [IPv6:2607:f8b0:4864:20::1029]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2E61514D2B for ; Mon, 8 Aug 2022 04:05:31 -0700 (PDT) Received: by mail-pj1-x1029.google.com with SMTP id d65-20020a17090a6f4700b001f303a97b14so8741633pjk.1 for ; Mon, 08 Aug 2022 04:05:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=dgu0QebApDasttKV3uGy+5ar/hOmMsIwTE7SF/rUsQE=; b=JQAfgMqTw1Wij8iFsPNSvejYSeAmgSkLt4FTBNAQfJJz6xO0iZLYdaLQsJBxP/qmyG gHZ3Ei3C58TnGKPQNLjxv1tzHeYev80QU7LKN9qCwi9oEFml9xpyDRuhtekGLAZtbrRR OwykBPrFL9pJitjJRrCW1AhRYCW0G0Mz54X+WpshocbCa1Wp4QuBaR0Wc6+i8lHgzffY ZvCYBpUXUoyi1anfoTbnoUJzwGvnNYqw0ZexB6uvUPhzxn9SZaaObTBErCu9CDZp65jd mcDur1rEwRPwOkOacd/1Iw5VCCLMGJib50ZXPMNWsH284FcJ2BzlRyGhA0WvZY/hpHJr 5mig== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=dgu0QebApDasttKV3uGy+5ar/hOmMsIwTE7SF/rUsQE=; b=WeisWjosJ4IpX12bL9mP19faGtFAOHNHXtxZLRZDOC7yrXcz8XQjyE1uh44bNqMxoR CL6kTR++ClrUJvWIQ1axDW5UVKNcISX0/c2xK6hhdHLnNiSGNrp1J6BnfwCwphyTIOa1 YOiHCz7XKy1pibTU6I7Q+2nM+0Isf/b1c16QtnNsJK+2Ez+t/CJ7RdHJsly7IylzqRcB asau9sbPn3anvWBHuI4elNEYd0Ex6KwolDDSLTg4odgrzd/ZZ+agnXjMSk1NejDUudFR rLMaI7o2fvQrHzfGxVa5ewFftXoRgVDOqRm3lefQ9tdaWviX1XYd/9X57OePuHg42eYF qOIA== X-Gm-Message-State: ACgBeo1XClCxspV/imZCRYhvGa6xA8NQc5xOWwCASxktLqWcKIbBPNcC pu+BRr9HHMjc/ttR+0b4Dyu39A== X-Google-Smtp-Source: AA6agR7QRD8QV8oQ0lebhQrjGgnbIJTfLXmW6Y9k2ZDa83FrSr/HRZ3TFY9D/tzmq7opX7MyAW3uvA== X-Received: by 2002:a17:902:e744:b0:16e:f6c2:3731 with SMTP id p4-20020a170902e74400b0016ef6c23731mr18328588plf.104.1659956730459; Mon, 08 Aug 2022 04:05:30 -0700 (PDT) Received: from C02CV1DAMD6P.bytedance.net ([139.177.225.240]) by smtp.gmail.com with ESMTPSA id o12-20020aa7978c000000b0052dbad1ea2esm8393180pfp.6.2022.08.08.04.05.24 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 08 Aug 2022 04:05:30 -0700 (PDT) From: Chengming Zhou To: hannes@cmpxchg.org, tj@kernel.org, corbet@lwn.net, surenb@google.com, mingo@redhat.com, peterz@infradead.org, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com Cc: cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, songmuchun@bytedance.com, Chengming Zhou Subject: [PATCH v2 10/10] sched/psi: cache parent psi_group to speed up groups iterate Date: Mon, 8 Aug 2022 19:03:41 +0800 Message-Id: <20220808110341.15799-11-zhouchengming@bytedance.com> X-Mailer: git-send-email 2.35.1 In-Reply-To: <20220808110341.15799-1-zhouchengming@bytedance.com> References: <20220808110341.15799-1-zhouchengming@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Type: text/plain; charset="utf-8" We use iterate_groups() to iterate each level psi_group to update PSI stats, which is a very hot path. In current code, iterate_groups() have to use multiple branches and cgroup_parent() to get parent psi_group for each level, which is not very efficient. This patch cache parent psi_group in struct psi_group, only need to get psi_group of task itself first, then just use group->parent to iterate. Signed-off-by: Chengming Zhou --- include/linux/psi_types.h | 1 + kernel/sched/psi.c | 51 ++++++++++++++++++++------------------- 2 files changed, 27 insertions(+), 25 deletions(-) diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h index fced39e255aa..7459a47fcb1f 100644 --- a/include/linux/psi_types.h +++ b/include/linux/psi_types.h @@ -148,6 +148,7 @@ struct psi_trigger { =20 struct psi_group { bool enabled; + struct psi_group *parent; =20 /* Protects data used by the aggregator */ struct mutex avgs_lock; diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 9df1686ee02d..d3c1c49b9bcf 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -786,30 +786,22 @@ static void psi_group_change(struct psi_group *group,= int cpu, schedule_delayed_work(&group->avgs_work, PSI_FREQ); } =20 -static struct psi_group *iterate_groups(struct task_struct *task, void **i= ter) +static inline struct psi_group *task_psi_group(struct task_struct *task) { - if (*iter =3D=3D &psi_system) - return NULL; - #ifdef CONFIG_CGROUPS if (static_branch_likely(&psi_cgroups_enabled)) { - struct cgroup *cgroup =3D NULL; - - if (!*iter) - cgroup =3D task->cgroups->dfl_cgrp; - else - cgroup =3D cgroup_parent(*iter); + struct cgroup *cgroup =3D task_dfl_cgroup(task); =20 - if (cgroup && cgroup_parent(cgroup)) { - *iter =3D cgroup; + if (cgroup && cgroup_parent(cgroup)) return cgroup_psi(cgroup); - } } #endif - *iter =3D &psi_system; return &psi_system; } =20 +#define for_each_psi_group(group) \ + for (; group; group =3D group->parent) + static void psi_flags_change(struct task_struct *task, int clear, int set) { if (((task->psi_flags & set) || @@ -827,12 +819,11 @@ static void psi_flags_change(struct task_struct *task= , int clear, int set) =20 void psi_change_groups(struct task_struct *task, int clear, int set) { + struct psi_group *group =3D task_psi_group(task); int cpu =3D task_cpu(task); - struct psi_group *group; - void *iter =3D NULL; u64 now =3D cpu_clock(cpu); =20 - while ((group =3D iterate_groups(task, &iter))) + for_each_psi_group(group) psi_group_change(group, cpu, clear, set, now, true); } =20 @@ -850,7 +841,6 @@ void psi_task_switch(struct task_struct *prev, struct t= ask_struct *next, { struct psi_group *group, *common =3D NULL; int cpu =3D task_cpu(prev); - void *iter; u64 now =3D cpu_clock(cpu); =20 if (next->pid) { @@ -861,8 +851,8 @@ void psi_task_switch(struct task_struct *prev, struct t= ask_struct *next, * we reach the first common ancestor. Iterate @next's * ancestors only until we encounter @prev's ONCPU. */ - iter =3D NULL; - while ((group =3D iterate_groups(next, &iter))) { + group =3D task_psi_group(next); + for_each_psi_group(group) { if (per_cpu_ptr(group->pcpu, cpu)->state_mask & PSI_ONCPU) { common =3D group; @@ -903,9 +893,12 @@ void psi_task_switch(struct task_struct *prev, struct = task_struct *next, =20 psi_flags_change(prev, clear, set); =20 - iter =3D NULL; - while ((group =3D iterate_groups(prev, &iter)) && group !=3D common) + group =3D task_psi_group(prev); + for_each_psi_group(group) { + if (group =3D=3D common) + break; psi_group_change(group, cpu, clear, set, now, wake_clock); + } =20 /* * TSK_ONCPU is handled up to the common ancestor. If we're tasked @@ -913,7 +906,7 @@ void psi_task_switch(struct task_struct *prev, struct t= ask_struct *next, */ if (sleep || unlikely(prev->in_memstall !=3D next->in_memstall)) { clear &=3D ~TSK_ONCPU; - for (; group; group =3D iterate_groups(prev, &iter)) + for_each_psi_group(group) psi_group_change(group, cpu, clear, set, now, wake_clock); } } @@ -922,7 +915,6 @@ void psi_task_switch(struct task_struct *prev, struct t= ask_struct *next, void psi_account_irqtime(struct task_struct *task, u32 delta) { int cpu =3D task_cpu(task); - void *iter =3D NULL; struct psi_group *group; struct psi_group_cpu *groupc; u64 now; @@ -932,7 +924,8 @@ void psi_account_irqtime(struct task_struct *task, u32 = delta) =20 now =3D cpu_clock(cpu); =20 - while ((group =3D iterate_groups(task, &iter))) { + group =3D task_psi_group(task); + for_each_psi_group(group) { groupc =3D per_cpu_ptr(group->pcpu, cpu); =20 write_seqcount_begin(&groupc->seq); @@ -1010,6 +1003,8 @@ void psi_memstall_leave(unsigned long *flags) #ifdef CONFIG_CGROUPS int psi_cgroup_alloc(struct cgroup *cgroup) { + struct cgroup *parent; + if (!static_branch_likely(&psi_cgroups_enabled)) return 0; =20 @@ -1017,6 +1012,12 @@ int psi_cgroup_alloc(struct cgroup *cgroup) if (!cgroup->psi.pcpu) return -ENOMEM; group_init(&cgroup->psi); + + parent =3D cgroup_parent(cgroup); + if (parent && cgroup_parent(parent)) + cgroup->psi.parent =3D cgroup_psi(parent); + else + cgroup->psi.parent =3D &psi_system; return 0; } =20 --=20 2.36.1