From nobody Sat Feb  7 17:31:46 2026
Received: from mail-ej1-f49.google.com (mail-ej1-f49.google.com
 [209.85.218.49])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0B39B310779
	for <linux-kernel@vger.kernel.org>; Mon,  1 Dec 2025 12:42:21 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.218.49
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1764592944; cv=none;
 b=rjfBo4f38ut9VyT8FWhZDYn/+Ceeyk0grGNzIKwDRbqrkOIDT9weaP9IwUgZjyGbGVxJVYHlby0l0JMYUyqqnH//OFpBJKBhPvfpVsJGZoKxtD/wQmbZcD2F9NIIkkiy4mnUmxizPnrbgq5Ix+Xdh/X3HodEPvLp+J0B1jeXTK8=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1764592944; c=relaxed/simple;
	bh=cZM88g+rNBDskUGro8OBr43p/6kNCy7TkduCagsSQpM=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version;
 b=soiUonlgtN+T49Wo6PLCL2yHoWyYXZ6TsF+ss5c6punNGBbITOmEm9i1CxQKuf8somipuCFOIMT+OJEyNamn4ekGleMAqNIfRCWCgqjgsHYpix354FdjCA9+2zKh27cfK4FrjQgo79i+aV0QqicbRmcmjqyYQ//t7Rbde6ThXmQ=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=jkHc5EXx; arc=none smtp.client-ip=209.85.218.49
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="jkHc5EXx"
Received: by mail-ej1-f49.google.com with SMTP id
 a640c23a62f3a-b7277324054so608712666b.0
        for <linux-kernel@vger.kernel.org>;
 Mon, 01 Dec 2025 04:42:21 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1764592940; x=1765197740;
 darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:from:to:cc:subject:date
         :message-id:reply-to;
        bh=azOgCvS8LK0s63UIy7dDLbReRbwXHsNSs/wFWEXEJIQ=;
        b=jkHc5EXxMmWcPq4kyRGcuRohQ9RH60oYQ0jxNuW3enJgioSP77IjiogQgM2s1+1yR5
         gmERr28M1Tz2/fH9bBZToIw26bnjOF0RwuIusXGKC4HkkjDFAiCz7XDfWBlDtV87gV7R
         /kknlrP53OEEI7NFscVwvbLGt0Ui/e0mWxbqo8avOKS4W7V4WCQWiUgk/qyVgwyNMT0n
         JG+MxfVtxueDa5dKwXPHE8Ai6oxIVg7quOV1Drh0ANDKbZ66yPUJJY2b0ZiOM8gJv7y0
         4fLs3ExtMn3sU4sIRBm82QZIO8Pf58WN5WbHZxoXBP+UmewJl6kRr0y8eSINrZkXX6rX
         0fUA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1764592940; x=1765197740;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from
         :to:cc:subject:date:message-id:reply-to;
        bh=azOgCvS8LK0s63UIy7dDLbReRbwXHsNSs/wFWEXEJIQ=;
        b=C51LKr709wuJGt+KA7zZzvyUoR2MwoH/FozViZ+iQ4ttuzd/6XaNLZYJxqxyvvAaBD
         rqOJmPBWpRd9tG2ayMkOQPNoi4VhQfLPiadUvFZOsN4ZWej/7RqidgLgQEzpLhomNZwu
         0vHQ7lRw2lsWBXipK0ZJXYiTvsHP3PMppfwLft1nXSgm4wssgJCnqsKX9XYCmtbiBr2S
         GxrVIX1uB3edsfjoTq2xnYMmHce6xwuhiQNSiXgIzmSnDX25PK+MOHU/0HxCuxYDW9Jv
         1wD2IvXHRaGZwX7DyQoy7WSvWkd3aoiJRBirraQ2eCIVu1r3bWtFquPUXmYczP7O3ryP
         ugNA==
X-Gm-Message-State: AOJu0YyXH0XPAuxDk4VfR64v1n1osZ3lC5u3lMf3IJB9P8AWteDcvJ6S
	pK8sqHBPUL3gBf6x0u7hX7l7qbxUnYKZAs1SDdPOBtjX5XaDmOzCVOTA
X-Gm-Gg: ASbGncsrGTbAdULesljWi+2ko2tt8PQVBn87gZKajDlPqsRW3YgPtAMxe8mFKdSp0+P
	58jX5fgN3FYoc9Cl0Vr8H5ZrkHIaXC8xjf0NzaOnW0b1g3R6nqlpQvyVGQQWCIN/z/254y5el2Z
	oxc8jTi91odPHvCx4d0GxoWCgaQbw412RrqlDMN43yeFA0Z1ZifuMhQ/xb6YSys6xmV2AhFsZaM
	WEcR6fsirZKc+VzDbvtZzc8l4MSBdoR02QuNgckoi7TLhTiWmmY52WrEA58d3CUXnmlYeanMxwO
	eqPnt0SZLOHOdI/bJZX2p3Xi5aMFQBwT7D4Rc5rCIBkchl1KuBJiW5ix+PWIG0n1oMWReA6wSBT
	6jpvuNEL+kra37lVw1SMoN/vtw5sCKYu7rVGYTWu65Gu6qatWkku2BeFO5t/Py+5C0/hy3l/kZ2
	mM95AES75G
X-Google-Smtp-Source: 
 AGHT+IEeeWdrYH6wYTf161KuxLaASkweGnpW+0dHKqgfemmWoxi70t7n1z9kZrest0L5EnxbFP1lHw==
X-Received: by 2002:a17:907:9628:b0:b4a:e11a:195b with SMTP id
 a640c23a62f3a-b767184bc82mr3844480166b.44.1764592939958;
        Mon, 01 Dec 2025 04:42:19 -0800 (PST)
Received: from victus-lab ([193.205.81.5])
        by smtp.gmail.com with ESMTPSA id
 a640c23a62f3a-b76f59e8612sm1173738266b.52.2025.12.01.04.42.19
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 01 Dec 2025 04:42:19 -0800 (PST)
From: Yuri Andriaccio <yurand2000@gmail.com>
To: Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>
Cc: linux-kernel@vger.kernel.org,
	Luca Abeni <luca.abeni@santannapisa.it>,
	Yuri Andriaccio <yuri.andriaccio@santannapisa.it>
Subject: [RFC PATCH v4 14/28] sched/rt: Update rt-cgroup schedulability checks
Date: Mon,  1 Dec 2025 13:41:47 +0100
Message-ID: <20251201124205.11169-15-yurand2000@gmail.com>
X-Mailer: git-send-email 2.51.0
In-Reply-To: <20251201124205.11169-1-yurand2000@gmail.com>
References: <20251201124205.11169-1-yurand2000@gmail.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"

From: luca abeni <luca.abeni@santannapisa.it>

Update sched_group_rt_runtime/period and sched_group_set_rt_runtime/period
to use the newly defined data structures and perform necessary checks to
update both the runtime and period of a given group.

The set functions call tg_set_rt_bandwidth() which is also updated:
- Use the newly added HCBS dl_bandwidth structure instead of rt_bandwidth.
- Update __rt_schedulable() to check for numerical issues:
  - Prevent a non-zero runtime that is too small, since a non-zero very
    small runtime will make the servers behave as they had zero runtime.
  - Since some computation use signed integers, the period might be so
    big that when read as a signed integer becomes a negative number, and
    we don't want that. If the period satisfies this prerequisite, also
    the runtime will do, since the runtime is always less than or equal
    to the period.
- Update tg_rt_schedulable(), used when walking the cgroup tree to check
  if all invariants are met:
  - Update most of the instructions to obtain data from the newly added
    data structures (dl_bandwidth).
  - If the task group is the root group, run a total bandwidth check with
    the newly added dl_check_tg() function.
- After all checks are successful, if the changed group is not the root
  cgroup, update the assigned runtime and period to all the local
  deadline servers.
- Additionally use a mutex guard instead of manually locking/unlocking.

Add dl_check_tg(), which performs an admission control test similar to
__dl_overflow, but this time we are updating the cgroup's total bandwidth
rather than scheduling a new DEADLINE task or updating a non-cgroup
deadline server.

Finally, prevent creation of a cgroup hierarchy with depth greater than
two, as this will be addressed in a future patch. A depth two hierarchy
is sufficient for now for testing the patchset.

Co-developed-by: Alessio Balsini <a.balsini@sssup.it>
Signed-off-by: Alessio Balsini <a.balsini@sssup.it>
Co-developed-by: Andrea Parri <parri.andrea@gmail.com>
Signed-off-by: Andrea Parri <parri.andrea@gmail.com>
Co-developed-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
Signed-off-by: luca abeni <luca.abeni@santannapisa.it>
---
 kernel/sched/core.c     |  6 ++++
 kernel/sched/deadline.c | 46 +++++++++++++++++++++++----
 kernel/sched/rt.c       | 70 +++++++++++++++++++++++------------------
 kernel/sched/sched.h    |  1 +
 4 files changed, 87 insertions(+), 36 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d7fc83cdae..bdf1bebe52 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9253,6 +9253,12 @@ cpu_cgroup_css_alloc(struct cgroup_subsys_state *par=
ent_css)
 		return &root_task_group.css;
 	}
=20
+	/* Do not allow cpu_cgroup hierachies with depth greater than 2. */
+#ifdef CONFIG_RT_GROUP_SCHED
+	if (parent !=3D &root_task_group)
+		return ERR_PTR(-EINVAL);
+#endif
+
 	tg =3D sched_create_group(parent);
 	if (IS_ERR(tg))
 		return ERR_PTR(-ENOMEM);
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index b890fdd4b2..7ed157dfa6 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -347,7 +347,47 @@ void cancel_inactive_timer(struct sched_dl_entity *dl_=
se)
 	cancel_dl_timer(dl_se, &dl_se->inactive_timer);
 }
=20
+/*
+ * Used for dl_bw check and update, used under sched_rt_handler()::mutex a=
nd
+ * sched_domains_mutex.
+ */
+u64 dl_cookie;
+
 #ifdef CONFIG_RT_GROUP_SCHED
+int dl_check_tg(unsigned long total)
+{
+	unsigned long flags;
+	int which_cpu;
+	int cap;
+	struct dl_bw *dl_b;
+	u64 gen =3D ++dl_cookie;
+
+	for_each_possible_cpu(which_cpu) {
+		rcu_read_lock_sched();
+
+		if (!dl_bw_visited(which_cpu, gen)) {
+			cap =3D dl_bw_capacity(which_cpu);
+			dl_b =3D dl_bw_of(which_cpu);
+
+			raw_spin_lock_irqsave(&dl_b->lock, flags);
+
+			if (dl_b->bw !=3D -1 &&
+			    cap_scale(dl_b->bw, cap) < dl_b->total_bw + cap_scale(total, cap)) {
+				raw_spin_unlock_irqrestore(&dl_b->lock, flags);
+				rcu_read_unlock_sched();
+
+				return 0;
+			}
+
+			raw_spin_unlock_irqrestore(&dl_b->lock, flags);
+		}
+
+		rcu_read_unlock_sched();
+	}
+
+	return 1;
+}
+
 void dl_init_tg(struct sched_dl_entity *dl_se, u64 rt_runtime, u64 rt_peri=
od)
 {
 	struct rq *rq =3D container_of(dl_se->dl_rq, struct rq, dl);
@@ -3150,12 +3190,6 @@ DEFINE_SCHED_CLASS(dl) =3D {
 #endif
 };
=20
-/*
- * Used for dl_bw check and update, used under sched_rt_handler()::mutex a=
nd
- * sched_domains_mutex.
- */
-u64 dl_cookie;
-
 int sched_dl_global_validate(void)
 {
 	u64 runtime =3D global_rt_runtime();
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 2b7c4b7754..b0a6da20b5 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2007,11 +2007,6 @@ DEFINE_SCHED_CLASS(rt) =3D {
 };
=20
 #ifdef CONFIG_RT_GROUP_SCHED
-/*
- * Ensure that the real time constraints are schedulable.
- */
-static DEFINE_MUTEX(rt_constraints_mutex);
-
 static inline int tg_has_rt_tasks(struct task_group *tg)
 {
 	struct task_struct *task;
@@ -2045,8 +2040,8 @@ static int tg_rt_schedulable(struct task_group *tg, v=
oid *data)
 	unsigned long total, sum =3D 0;
 	u64 period, runtime;
=20
-	period =3D ktime_to_ns(tg->rt_bandwidth.rt_period);
-	runtime =3D tg->rt_bandwidth.rt_runtime;
+	period  =3D tg->dl_bandwidth.dl_period;
+	runtime =3D tg->dl_bandwidth.dl_runtime;
=20
 	if (tg =3D=3D d->tg) {
 		period =3D d->rt_period;
@@ -2062,8 +2057,7 @@ static int tg_rt_schedulable(struct task_group *tg, v=
oid *data)
 	/*
 	 * Ensure we don't starve existing RT tasks if runtime turns zero.
 	 */
-	if (rt_bandwidth_enabled() && !runtime &&
-	    tg->rt_bandwidth.rt_runtime && tg_has_rt_tasks(tg))
+	if (dl_bandwidth_enabled() && !runtime && tg_has_rt_tasks(tg))
 		return -EBUSY;
=20
 	if (WARN_ON(!rt_group_sched_enabled() && tg !=3D &root_task_group))
@@ -2077,12 +2071,17 @@ static int tg_rt_schedulable(struct task_group *tg,=
 void *data)
 	if (total > to_ratio(global_rt_period(), global_rt_runtime()))
 		return -EINVAL;
=20
+	if (tg =3D=3D &root_task_group) {
+		if (!dl_check_tg(total))
+			return -EBUSY;
+	}
+
 	/*
 	 * The sum of our children's runtime should not exceed our own.
 	 */
 	list_for_each_entry_rcu(child, &tg->children, siblings) {
-		period =3D ktime_to_ns(child->rt_bandwidth.rt_period);
-		runtime =3D child->rt_bandwidth.rt_runtime;
+		period  =3D child->dl_bandwidth.dl_period;
+		runtime =3D child->dl_bandwidth.dl_runtime;
=20
 		if (child =3D=3D d->tg) {
 			period =3D d->rt_period;
@@ -2108,6 +2107,20 @@ static int __rt_schedulable(struct task_group *tg, u=
64 period, u64 runtime)
 		.rt_runtime =3D runtime,
 	};
=20
+	/*
+	 * Since we truncate DL_SCALE bits, make sure we're at least
+	 * that big.
+	 */
+	if (runtime !=3D 0 && runtime < (1ULL << DL_SCALE))
+		return -EINVAL;
+
+	/*
+	 * Since we use the MSB for wrap-around and sign issues, make
+	 * sure it's not set (mind that period can be equal to zero).
+	 */
+	if (period & (1ULL << 63))
+		return -EINVAL;
+
 	rcu_read_lock();
 	ret =3D walk_tg_tree(tg_rt_schedulable, tg_nop, &data);
 	rcu_read_unlock();
@@ -2118,6 +2131,7 @@ static int __rt_schedulable(struct task_group *tg, u6=
4 period, u64 runtime)
 static int tg_set_rt_bandwidth(struct task_group *tg,
 		u64 rt_period, u64 rt_runtime)
 {
+	static DEFINE_MUTEX(rt_constraints_mutex);
 	int i, err =3D 0;
=20
 	/*
@@ -2137,34 +2151,30 @@ static int tg_set_rt_bandwidth(struct task_group *t=
g,
 	if (rt_runtime !=3D RUNTIME_INF && rt_runtime > max_rt_runtime)
 		return -EINVAL;
=20
-	mutex_lock(&rt_constraints_mutex);
+	guard(mutex)(&rt_constraints_mutex);
 	err =3D __rt_schedulable(tg, rt_period, rt_runtime);
 	if (err)
-		goto unlock;
+		return err;
=20
-	raw_spin_lock_irq(&tg->rt_bandwidth.rt_runtime_lock);
-	tg->rt_bandwidth.rt_period =3D ns_to_ktime(rt_period);
-	tg->rt_bandwidth.rt_runtime =3D rt_runtime;
+	guard(raw_spinlock_irq)(&tg->dl_bandwidth.dl_runtime_lock);
+	tg->dl_bandwidth.dl_period  =3D rt_period;
+	tg->dl_bandwidth.dl_runtime =3D rt_runtime;
=20
-	for_each_possible_cpu(i) {
-		struct rt_rq *rt_rq =3D tg->rt_rq[i];
+	if (tg =3D=3D &root_task_group)
+		return 0;
=20
-		raw_spin_lock(&rt_rq->rt_runtime_lock);
-		rt_rq->rt_runtime =3D rt_runtime;
-		raw_spin_unlock(&rt_rq->rt_runtime_lock);
+	for_each_possible_cpu(i) {
+		dl_init_tg(tg->dl_se[i], rt_runtime, rt_period);
 	}
-	raw_spin_unlock_irq(&tg->rt_bandwidth.rt_runtime_lock);
-unlock:
-	mutex_unlock(&rt_constraints_mutex);
=20
-	return err;
+	return 0;
 }
=20
 int sched_group_set_rt_runtime(struct task_group *tg, long rt_runtime_us)
 {
 	u64 rt_runtime, rt_period;
=20
-	rt_period =3D ktime_to_ns(tg->rt_bandwidth.rt_period);
+	rt_period  =3D tg->dl_bandwidth.dl_period;
 	rt_runtime =3D (u64)rt_runtime_us * NSEC_PER_USEC;
 	if (rt_runtime_us < 0)
 		rt_runtime =3D RUNTIME_INF;
@@ -2178,10 +2188,10 @@ long sched_group_rt_runtime(struct task_group *tg)
 {
 	u64 rt_runtime_us;
=20
-	if (tg->rt_bandwidth.rt_runtime =3D=3D RUNTIME_INF)
+	if (tg->dl_bandwidth.dl_runtime =3D=3D RUNTIME_INF)
 		return -1;
=20
-	rt_runtime_us =3D tg->rt_bandwidth.rt_runtime;
+	rt_runtime_us =3D tg->dl_bandwidth.dl_runtime;
 	do_div(rt_runtime_us, NSEC_PER_USEC);
 	return rt_runtime_us;
 }
@@ -2194,7 +2204,7 @@ int sched_group_set_rt_period(struct task_group *tg, =
u64 rt_period_us)
 		return -EINVAL;
=20
 	rt_period =3D rt_period_us * NSEC_PER_USEC;
-	rt_runtime =3D tg->rt_bandwidth.rt_runtime;
+	rt_runtime =3D tg->dl_bandwidth.dl_runtime;
=20
 	return tg_set_rt_bandwidth(tg, rt_period, rt_runtime);
 }
@@ -2203,7 +2213,7 @@ long sched_group_rt_period(struct task_group *tg)
 {
 	u64 rt_period_us;
=20
-	rt_period_us =3D ktime_to_ns(tg->rt_bandwidth.rt_period);
+	rt_period_us =3D tg->dl_bandwidth.dl_period;
 	do_div(rt_period_us, NSEC_PER_USEC);
 	return rt_period_us;
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index bc3ed02e40..334ab6d597 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -419,6 +419,7 @@ extern void dl_server_init(struct sched_dl_entity *dl_s=
e, struct dl_rq *dl_rq,
 		    struct rq *served_rq,
 		    dl_server_pick_f pick_task);
 extern void sched_init_dl_servers(void);
+extern int dl_check_tg(unsigned long total);
 extern void dl_init_tg(struct sched_dl_entity *dl_se, u64 rt_runtime, u64 =
rt_period);
=20
 extern void dl_server_update_idle_time(struct rq *rq,
--=20
2.51.0