From nobody Sat Apr 11 19:53:55 2026
Return-Path: <linux-kernel-owner@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from vger.kernel.org (vger.kernel.org [23.128.96.18])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 4E89CC25B07
	for <linux-kernel@archiver.kernel.org>; Mon,  8 Aug 2022 11:06:24 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S242938AbiHHLGV (ORCPT <rfc822;linux-kernel@archiver.kernel.org>);
        Mon, 8 Aug 2022 07:06:21 -0400
Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49672 "EHLO
        lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S242949AbiHHLFm (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Mon, 8 Aug 2022 07:05:42 -0400
Received: from mail-pg1-x52d.google.com (mail-pg1-x52d.google.com
 [IPv6:2607:f8b0:4864:20::52d])
        by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EAA5A654A
        for <linux-kernel@vger.kernel.org>;
 Mon,  8 Aug 2022 04:05:24 -0700 (PDT)
Received: by mail-pg1-x52d.google.com with SMTP id f65so8229904pgc.12
        for <linux-kernel@vger.kernel.org>;
 Mon, 08 Aug 2022 04:05:24 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=bytedance-com.20210112.gappssmtp.com; s=20210112;
        h=from:to:cc:subject:date:message-id:in-reply-to:references
         :mime-version:content-transfer-encoding;
        bh=9POYeDaYW7dtn4XhApQfMlcQg9lt0ElNIM/9f/TV/5E=;
        b=OZ3MLAf56eTu/xwNZPNVcUkfq+qDU59H/XlcOYLDD9rxbjRorHzl5ExkQJf+sKjXyk
         Uywpm2rhNUjZgOJUHr+BDDZFXYGAChF2FVEZWDtfCNjWznbdnKI4RWsW3hwkDEWZ7CAX
         OFGH913FUIcQHQjYFODmCm++dzsTRQey+/Ienv6jdQDiBFT2HUwqBsZz3rSIQZTDe0Gs
         qQVjT0fvUHMyxNdL8UlvO+R69FWMt0IMWltX4I3frkyGghIe7xcsOPa95uexCJaKkL6u
         Aen/ACE4QTM0a6KhGD9MtEvhlMOTNs2uNEfDRWQZNjH5yT9sEFfT4h9vaJbaIVNJmr8T
         xelg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
         :references:mime-version:content-transfer-encoding;
        bh=9POYeDaYW7dtn4XhApQfMlcQg9lt0ElNIM/9f/TV/5E=;
        b=2rpE0KInGHSjUklh+FP0MEJ5PvPRC7vgLdHAcB058bSt+GZuQ9qq4dPKYoAI6brwoO
         6K0ghv8XPK3T3k+QAPqXIHyrFk2caXio8iWKCqkdvPxLLPqnaceD8PlVwjrW6pfdoXQb
         WVUFUJ9hGvPtlmtyaC6myTZEurFHS1kOdF43IZnKvCX1HplLytzSheUViSFa3Kd8uEZ3
         g+1pCqmVmlAt8J/uQ3i8++eYUQcMJBxr+96ps6hWo7kKTuj++0kwh6JnVjzUe+whoeZ5
         7EVzXfMV1LU32ViPYiVCMfLaqpw8YhoLRwQ4MxfFXr9LVSKcH6EYWdC8EUBeb9F08tnU
         FVmQ==
X-Gm-Message-State: ACgBeo0mKETL0RU0xJ04C3jpthVVOhRjNupeURJHw/8tmv29RaJWPmnl
        xAlMEJi1VOova2xnQwHqRNB9WQ==
X-Google-Smtp-Source: 
 AA6agR5CHqkMZvGrY3BOz9NP31v2sEpmcPeURzj70Kpu6UGaeJ4mw7ASU/0Ow1vSJLmsKKcfjSHvTQ==
X-Received: by 2002:a05:6a00:240f:b0:52e:f99d:1157 with SMTP id
 z15-20020a056a00240f00b0052ef99d1157mr8930124pfh.70.1659956724220;
        Mon, 08 Aug 2022 04:05:24 -0700 (PDT)
Received: from C02CV1DAMD6P.bytedance.net ([139.177.225.240])
        by smtp.gmail.com with ESMTPSA id
 o12-20020aa7978c000000b0052dbad1ea2esm8393180pfp.6.2022.08.08.04.05.16
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 08 Aug 2022 04:05:23 -0700 (PDT)
From: Chengming Zhou <zhouchengming@bytedance.com>
To: hannes@cmpxchg.org, tj@kernel.org, corbet@lwn.net,
        surenb@google.com, mingo@redhat.com, peterz@infradead.org,
        vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
        rostedt@goodmis.org, bsegall@google.com
Cc: cgroups@vger.kernel.org, linux-doc@vger.kernel.org,
        linux-kernel@vger.kernel.org, songmuchun@bytedance.com,
        Chengming Zhou <zhouchengming@bytedance.com>
Subject: [PATCH v2 09/10] sched/psi: per-cgroup PSI stats disable/re-enable
 interface
Date: Mon,  8 Aug 2022 19:03:40 +0800
Message-Id: <20220808110341.15799-10-zhouchengming@bytedance.com>
X-Mailer: git-send-email 2.35.1
In-Reply-To: <20220808110341.15799-1-zhouchengming@bytedance.com>
References: <20220808110341.15799-1-zhouchengming@bytedance.com>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Precedence: bulk
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="utf-8"

PSI accounts stalls for each cgroup separately and aggregates it
at each level of the hierarchy. This may cause non-negligible overhead
for some workloads when under deep level of the hierarchy.

commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking confi=
gurable")
make PSI to skip per-cgroup stall accounting, only account system-wide
to avoid this each level overhead.

But for our use case, we also want leaf cgroup PSI stats accounted for
userspace adjustment on that cgroup, apart from only system-wide adjustment.

So this patch introduce a per-cgroup PSI stats disable/re-enable
interface "cgroup.psi", which is a read-write single value file that
allowed values are "0" and "1", the defaults is "1" so per-cgroup
PSI stats is enabled by default.

Implementation details:

It should be relatively straight-forward to disable and re-enable
state aggregation, time tracking, averaging on a per-cgroup level,
if we can live with losing history from while it was disabled.
I.e. the avgs will restart from 0, total=3D will have gaps.

But it's hard or complex to stop/restart groupc->tasks[] updates,
which is not implemented in this patch. So we always update
groupc->tasks[] and PSI_ONCPU bit in psi_group_change() even when
the cgroup PSI stats is disabled.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
---
 Documentation/admin-guide/cgroup-v2.rst |  7 ++++
 include/linux/psi.h                     |  2 ++
 include/linux/psi_types.h               |  2 ++
 kernel/cgroup/cgroup.c                  | 43 +++++++++++++++++++++++++
 kernel/sched/psi.c                      | 40 +++++++++++++++++++----
 5 files changed, 87 insertions(+), 7 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-=
guide/cgroup-v2.rst
index dd84e34bc051..ade40506ab80 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -968,6 +968,13 @@ All cgroup core files are prefixed with "cgroup."
 	killing cgroups is a process directed operation, i.e. it affects
 	the whole thread-group.
=20
+  cgroup.psi
+	A read-write single value file that allowed values are "0" and "1".
+	The default is "1".
+
+	Writing "0" to the file will disable the cgroup PSI stats accounting.
+	Writing "1" to the file will re-enable the cgroup PSI stats accounting.
+
   irq.pressure
 	A read-write nested-keyed file.
=20
diff --git a/include/linux/psi.h b/include/linux/psi.h
index aa168a038242..1138ccffd76b 100644
--- a/include/linux/psi.h
+++ b/include/linux/psi.h
@@ -33,6 +33,7 @@ __poll_t psi_trigger_poll(void **trigger_ptr, struct file=
 *file,
 int psi_cgroup_alloc(struct cgroup *cgrp);
 void psi_cgroup_free(struct cgroup *cgrp);
 void cgroup_move_task(struct task_struct *p, struct css_set *to);
+void psi_cgroup_enable(struct psi_group *group, bool enable);
 #endif
=20
 #else /* CONFIG_PSI */
@@ -54,6 +55,7 @@ static inline void cgroup_move_task(struct task_struct *p=
, struct css_set *to)
 {
 	rcu_assign_pointer(p->cgroups, to);
 }
+static inline void psi_cgroup_enable(struct psi_group *group, bool enable)=
 {}
 #endif
=20
 #endif /* CONFIG_PSI */
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index 4677655f6ca1..fced39e255aa 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -147,6 +147,8 @@ struct psi_trigger {
 };
=20
 struct psi_group {
+	bool enabled;
+
 	/* Protects data used by the aggregator */
 	struct mutex avgs_lock;
=20
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 91de8ff7fa50..6ba56983b5a5 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3709,6 +3709,43 @@ static ssize_t cgroup_irq_pressure_write(struct kern=
fs_open_file *of,
 }
 #endif
=20
+static int cgroup_psi_show(struct seq_file *seq, void *v)
+{
+	struct cgroup *cgrp =3D seq_css(seq)->cgroup;
+	struct psi_group *psi =3D cgroup_ino(cgrp) =3D=3D 1 ? &psi_system : &cgrp=
->psi;
+
+	seq_printf(seq, "%d\n", psi->enabled);
+
+	return 0;
+}
+
+static ssize_t cgroup_psi_write(struct kernfs_open_file *of,
+				char *buf, size_t nbytes, loff_t off)
+{
+	ssize_t ret;
+	int enable;
+	struct cgroup *cgrp;
+	struct psi_group *psi;
+
+	ret =3D kstrtoint(strstrip(buf), 0, &enable);
+	if (ret)
+		return ret;
+
+	if (enable < 0 || enable > 1)
+		return -ERANGE;
+
+	cgrp =3D cgroup_kn_lock_live(of->kn, false);
+	if (!cgrp)
+		return -ENOENT;
+
+	psi =3D cgroup_ino(cgrp) =3D=3D 1 ? &psi_system : &cgrp->psi;
+	psi_cgroup_enable(psi, enable);
+
+	cgroup_kn_unlock(of->kn);
+
+	return nbytes;
+}
+
 static __poll_t cgroup_pressure_poll(struct kernfs_open_file *of,
 					  poll_table *pt)
 {
@@ -5115,6 +5152,12 @@ static struct cftype cgroup_base_files[] =3D {
 		.release =3D cgroup_pressure_release,
 	},
 #endif
+	{
+		.name =3D "cgroup.psi",
+		.flags =3D CFTYPE_PRESSURE,
+		.seq_show =3D cgroup_psi_show,
+		.write =3D cgroup_psi_write,
+	},
 #endif /* CONFIG_PSI */
 	{ }	/* terminate */
 };
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 58f8092c938f..9df1686ee02d 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -181,6 +181,7 @@ static void group_init(struct psi_group *group)
 {
 	int cpu;
=20
+	group->enabled =3D true;
 	for_each_possible_cpu(cpu)
 		seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq);
 	group->avg_last_update =3D sched_clock();
@@ -700,17 +701,16 @@ static void psi_group_change(struct psi_group *group,=
 int cpu,
 	groupc =3D per_cpu_ptr(group->pcpu, cpu);
=20
 	/*
-	 * First we assess the aggregate resource states this CPU's
-	 * tasks have been in since the last change, and account any
-	 * SOME and FULL time these may have resulted in.
-	 *
-	 * Then we update the task counts according to the state
+	 * First we update the task counts according to the state
 	 * change requested through the @clear and @set bits.
+	 *
+	 * Then if the cgroup PSI stats accounting enabled, we
+	 * assess the aggregate resource states this CPU's tasks
+	 * have been in since the last change, and account any
+	 * SOME and FULL time these may have resulted in.
 	 */
 	write_seqcount_begin(&groupc->seq);
=20
-	record_times(groupc, now);
-
 	/*
 	 * Start with TSK_ONCPU, which doesn't have a corresponding
 	 * task count - it's just a boolean flag directly encoded in
@@ -750,6 +750,14 @@ static void psi_group_change(struct psi_group *group, =
int cpu,
 		if (set & (1 << t))
 			groupc->tasks[t]++;
=20
+	if (!group->enabled) {
+		if (groupc->state_mask & (1 << PSI_NONIDLE))
+			record_times(groupc, now);
+		groupc->state_mask =3D state_mask;
+		write_seqcount_end(&groupc->seq);
+		return;
+	}
+
 	for (s =3D 0; s < NR_PSI_STATES; s++) {
 		if (test_state(groupc->tasks, s, state_mask & PSI_ONCPU))
 			state_mask |=3D (1 << s);
@@ -766,6 +774,7 @@ static void psi_group_change(struct psi_group *group, i=
nt cpu,
 	if (unlikely((state_mask & PSI_ONCPU) && cpu_curr(cpu)->in_memstall))
 		state_mask |=3D (1 << PSI_MEM_FULL);
=20
+	record_times(groupc, now);
 	groupc->state_mask =3D state_mask;
=20
 	write_seqcount_end(&groupc->seq);
@@ -1088,6 +1097,23 @@ void cgroup_move_task(struct task_struct *task, stru=
ct css_set *to)
=20
 	task_rq_unlock(rq, task, &rf);
 }
+
+void psi_cgroup_enable(struct psi_group *group, bool enable)
+{
+	struct psi_group_cpu *groupc;
+	int cpu;
+	u64 now;
+
+	if (group->enabled =3D=3D enable)
+		return;
+	group->enabled =3D enable;
+
+	for_each_possible_cpu(cpu) {
+		groupc =3D per_cpu_ptr(group->pcpu, cpu);
+		now =3D cpu_clock(cpu);
+		psi_group_change(group, cpu, 0, 0, now, true);
+	}
+}
 #endif /* CONFIG_CGROUPS */
=20
 int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
--=20
2.36.1