From nobody Mon Dec 1 22:02:15 2025 Received: from mail-ej1-f49.google.com (mail-ej1-f49.google.com [209.85.218.49]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0B39B310779 for ; Mon, 1 Dec 2025 12:42:21 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.218.49 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764592944; cv=none; b=rjfBo4f38ut9VyT8FWhZDYn/+Ceeyk0grGNzIKwDRbqrkOIDT9weaP9IwUgZjyGbGVxJVYHlby0l0JMYUyqqnH//OFpBJKBhPvfpVsJGZoKxtD/wQmbZcD2F9NIIkkiy4mnUmxizPnrbgq5Ix+Xdh/X3HodEPvLp+J0B1jeXTK8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764592944; c=relaxed/simple; bh=cZM88g+rNBDskUGro8OBr43p/6kNCy7TkduCagsSQpM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=soiUonlgtN+T49Wo6PLCL2yHoWyYXZ6TsF+ss5c6punNGBbITOmEm9i1CxQKuf8somipuCFOIMT+OJEyNamn4ekGleMAqNIfRCWCgqjgsHYpix354FdjCA9+2zKh27cfK4FrjQgo79i+aV0QqicbRmcmjqyYQ//t7Rbde6ThXmQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=jkHc5EXx; arc=none smtp.client-ip=209.85.218.49 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="jkHc5EXx" Received: by mail-ej1-f49.google.com with SMTP id a640c23a62f3a-b7277324054so608712666b.0 for ; Mon, 01 Dec 2025 04:42:21 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764592940; x=1765197740; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=azOgCvS8LK0s63UIy7dDLbReRbwXHsNSs/wFWEXEJIQ=; b=jkHc5EXxMmWcPq4kyRGcuRohQ9RH60oYQ0jxNuW3enJgioSP77IjiogQgM2s1+1yR5 gmERr28M1Tz2/fH9bBZToIw26bnjOF0RwuIusXGKC4HkkjDFAiCz7XDfWBlDtV87gV7R /kknlrP53OEEI7NFscVwvbLGt0Ui/e0mWxbqo8avOKS4W7V4WCQWiUgk/qyVgwyNMT0n JG+MxfVtxueDa5dKwXPHE8Ai6oxIVg7quOV1Drh0ANDKbZ66yPUJJY2b0ZiOM8gJv7y0 4fLs3ExtMn3sU4sIRBm82QZIO8Pf58WN5WbHZxoXBP+UmewJl6kRr0y8eSINrZkXX6rX 0fUA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764592940; x=1765197740; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=azOgCvS8LK0s63UIy7dDLbReRbwXHsNSs/wFWEXEJIQ=; b=C51LKr709wuJGt+KA7zZzvyUoR2MwoH/FozViZ+iQ4ttuzd/6XaNLZYJxqxyvvAaBD rqOJmPBWpRd9tG2ayMkOQPNoi4VhQfLPiadUvFZOsN4ZWej/7RqidgLgQEzpLhomNZwu 0vHQ7lRw2lsWBXipK0ZJXYiTvsHP3PMppfwLft1nXSgm4wssgJCnqsKX9XYCmtbiBr2S GxrVIX1uB3edsfjoTq2xnYMmHce6xwuhiQNSiXgIzmSnDX25PK+MOHU/0HxCuxYDW9Jv 1wD2IvXHRaGZwX7DyQoy7WSvWkd3aoiJRBirraQ2eCIVu1r3bWtFquPUXmYczP7O3ryP ugNA== X-Gm-Message-State: AOJu0YyXH0XPAuxDk4VfR64v1n1osZ3lC5u3lMf3IJB9P8AWteDcvJ6S pK8sqHBPUL3gBf6x0u7hX7l7qbxUnYKZAs1SDdPOBtjX5XaDmOzCVOTA X-Gm-Gg: ASbGncsrGTbAdULesljWi+2ko2tt8PQVBn87gZKajDlPqsRW3YgPtAMxe8mFKdSp0+P 58jX5fgN3FYoc9Cl0Vr8H5ZrkHIaXC8xjf0NzaOnW0b1g3R6nqlpQvyVGQQWCIN/z/254y5el2Z oxc8jTi91odPHvCx4d0GxoWCgaQbw412RrqlDMN43yeFA0Z1ZifuMhQ/xb6YSys6xmV2AhFsZaM WEcR6fsirZKc+VzDbvtZzc8l4MSBdoR02QuNgckoi7TLhTiWmmY52WrEA58d3CUXnmlYeanMxwO eqPnt0SZLOHOdI/bJZX2p3Xi5aMFQBwT7D4Rc5rCIBkchl1KuBJiW5ix+PWIG0n1oMWReA6wSBT 6jpvuNEL+kra37lVw1SMoN/vtw5sCKYu7rVGYTWu65Gu6qatWkku2BeFO5t/Py+5C0/hy3l/kZ2 mM95AES75G X-Google-Smtp-Source: AGHT+IEeeWdrYH6wYTf161KuxLaASkweGnpW+0dHKqgfemmWoxi70t7n1z9kZrest0L5EnxbFP1lHw== X-Received: by 2002:a17:907:9628:b0:b4a:e11a:195b with SMTP id a640c23a62f3a-b767184bc82mr3844480166b.44.1764592939958; Mon, 01 Dec 2025 04:42:19 -0800 (PST) Received: from victus-lab ([193.205.81.5]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-b76f59e8612sm1173738266b.52.2025.12.01.04.42.19 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 01 Dec 2025 04:42:19 -0800 (PST) From: Yuri Andriaccio To: Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider Cc: linux-kernel@vger.kernel.org, Luca Abeni , Yuri Andriaccio Subject: [RFC PATCH v4 14/28] sched/rt: Update rt-cgroup schedulability checks Date: Mon, 1 Dec 2025 13:41:47 +0100 Message-ID: <20251201124205.11169-15-yurand2000@gmail.com> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20251201124205.11169-1-yurand2000@gmail.com> References: <20251201124205.11169-1-yurand2000@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: luca abeni Update sched_group_rt_runtime/period and sched_group_set_rt_runtime/period to use the newly defined data structures and perform necessary checks to update both the runtime and period of a given group. The set functions call tg_set_rt_bandwidth() which is also updated: - Use the newly added HCBS dl_bandwidth structure instead of rt_bandwidth. - Update __rt_schedulable() to check for numerical issues: - Prevent a non-zero runtime that is too small, since a non-zero very small runtime will make the servers behave as they had zero runtime. - Since some computation use signed integers, the period might be so big that when read as a signed integer becomes a negative number, and we don't want that. If the period satisfies this prerequisite, also the runtime will do, since the runtime is always less than or equal to the period. - Update tg_rt_schedulable(), used when walking the cgroup tree to check if all invariants are met: - Update most of the instructions to obtain data from the newly added data structures (dl_bandwidth). - If the task group is the root group, run a total bandwidth check with the newly added dl_check_tg() function. - After all checks are successful, if the changed group is not the root cgroup, update the assigned runtime and period to all the local deadline servers. - Additionally use a mutex guard instead of manually locking/unlocking. Add dl_check_tg(), which performs an admission control test similar to __dl_overflow, but this time we are updating the cgroup's total bandwidth rather than scheduling a new DEADLINE task or updating a non-cgroup deadline server. Finally, prevent creation of a cgroup hierarchy with depth greater than two, as this will be addressed in a future patch. A depth two hierarchy is sufficient for now for testing the patchset. Co-developed-by: Alessio Balsini Signed-off-by: Alessio Balsini Co-developed-by: Andrea Parri Signed-off-by: Andrea Parri Co-developed-by: Yuri Andriaccio Signed-off-by: Yuri Andriaccio Signed-off-by: luca abeni --- kernel/sched/core.c | 6 ++++ kernel/sched/deadline.c | 46 +++++++++++++++++++++++---- kernel/sched/rt.c | 70 +++++++++++++++++++++++------------------ kernel/sched/sched.h | 1 + 4 files changed, 87 insertions(+), 36 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index d7fc83cdae..bdf1bebe52 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -9253,6 +9253,12 @@ cpu_cgroup_css_alloc(struct cgroup_subsys_state *par= ent_css) return &root_task_group.css; } =20 + /* Do not allow cpu_cgroup hierachies with depth greater than 2. */ +#ifdef CONFIG_RT_GROUP_SCHED + if (parent !=3D &root_task_group) + return ERR_PTR(-EINVAL); +#endif + tg =3D sched_create_group(parent); if (IS_ERR(tg)) return ERR_PTR(-ENOMEM); diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index b890fdd4b2..7ed157dfa6 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -347,7 +347,47 @@ void cancel_inactive_timer(struct sched_dl_entity *dl_= se) cancel_dl_timer(dl_se, &dl_se->inactive_timer); } =20 +/* + * Used for dl_bw check and update, used under sched_rt_handler()::mutex a= nd + * sched_domains_mutex. + */ +u64 dl_cookie; + #ifdef CONFIG_RT_GROUP_SCHED +int dl_check_tg(unsigned long total) +{ + unsigned long flags; + int which_cpu; + int cap; + struct dl_bw *dl_b; + u64 gen =3D ++dl_cookie; + + for_each_possible_cpu(which_cpu) { + rcu_read_lock_sched(); + + if (!dl_bw_visited(which_cpu, gen)) { + cap =3D dl_bw_capacity(which_cpu); + dl_b =3D dl_bw_of(which_cpu); + + raw_spin_lock_irqsave(&dl_b->lock, flags); + + if (dl_b->bw !=3D -1 && + cap_scale(dl_b->bw, cap) < dl_b->total_bw + cap_scale(total, cap)) { + raw_spin_unlock_irqrestore(&dl_b->lock, flags); + rcu_read_unlock_sched(); + + return 0; + } + + raw_spin_unlock_irqrestore(&dl_b->lock, flags); + } + + rcu_read_unlock_sched(); + } + + return 1; +} + void dl_init_tg(struct sched_dl_entity *dl_se, u64 rt_runtime, u64 rt_peri= od) { struct rq *rq =3D container_of(dl_se->dl_rq, struct rq, dl); @@ -3150,12 +3190,6 @@ DEFINE_SCHED_CLASS(dl) =3D { #endif }; =20 -/* - * Used for dl_bw check and update, used under sched_rt_handler()::mutex a= nd - * sched_domains_mutex. - */ -u64 dl_cookie; - int sched_dl_global_validate(void) { u64 runtime =3D global_rt_runtime(); diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 2b7c4b7754..b0a6da20b5 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -2007,11 +2007,6 @@ DEFINE_SCHED_CLASS(rt) =3D { }; =20 #ifdef CONFIG_RT_GROUP_SCHED -/* - * Ensure that the real time constraints are schedulable. - */ -static DEFINE_MUTEX(rt_constraints_mutex); - static inline int tg_has_rt_tasks(struct task_group *tg) { struct task_struct *task; @@ -2045,8 +2040,8 @@ static int tg_rt_schedulable(struct task_group *tg, v= oid *data) unsigned long total, sum =3D 0; u64 period, runtime; =20 - period =3D ktime_to_ns(tg->rt_bandwidth.rt_period); - runtime =3D tg->rt_bandwidth.rt_runtime; + period =3D tg->dl_bandwidth.dl_period; + runtime =3D tg->dl_bandwidth.dl_runtime; =20 if (tg =3D=3D d->tg) { period =3D d->rt_period; @@ -2062,8 +2057,7 @@ static int tg_rt_schedulable(struct task_group *tg, v= oid *data) /* * Ensure we don't starve existing RT tasks if runtime turns zero. */ - if (rt_bandwidth_enabled() && !runtime && - tg->rt_bandwidth.rt_runtime && tg_has_rt_tasks(tg)) + if (dl_bandwidth_enabled() && !runtime && tg_has_rt_tasks(tg)) return -EBUSY; =20 if (WARN_ON(!rt_group_sched_enabled() && tg !=3D &root_task_group)) @@ -2077,12 +2071,17 @@ static int tg_rt_schedulable(struct task_group *tg,= void *data) if (total > to_ratio(global_rt_period(), global_rt_runtime())) return -EINVAL; =20 + if (tg =3D=3D &root_task_group) { + if (!dl_check_tg(total)) + return -EBUSY; + } + /* * The sum of our children's runtime should not exceed our own. */ list_for_each_entry_rcu(child, &tg->children, siblings) { - period =3D ktime_to_ns(child->rt_bandwidth.rt_period); - runtime =3D child->rt_bandwidth.rt_runtime; + period =3D child->dl_bandwidth.dl_period; + runtime =3D child->dl_bandwidth.dl_runtime; =20 if (child =3D=3D d->tg) { period =3D d->rt_period; @@ -2108,6 +2107,20 @@ static int __rt_schedulable(struct task_group *tg, u= 64 period, u64 runtime) .rt_runtime =3D runtime, }; =20 + /* + * Since we truncate DL_SCALE bits, make sure we're at least + * that big. + */ + if (runtime !=3D 0 && runtime < (1ULL << DL_SCALE)) + return -EINVAL; + + /* + * Since we use the MSB for wrap-around and sign issues, make + * sure it's not set (mind that period can be equal to zero). + */ + if (period & (1ULL << 63)) + return -EINVAL; + rcu_read_lock(); ret =3D walk_tg_tree(tg_rt_schedulable, tg_nop, &data); rcu_read_unlock(); @@ -2118,6 +2131,7 @@ static int __rt_schedulable(struct task_group *tg, u6= 4 period, u64 runtime) static int tg_set_rt_bandwidth(struct task_group *tg, u64 rt_period, u64 rt_runtime) { + static DEFINE_MUTEX(rt_constraints_mutex); int i, err =3D 0; =20 /* @@ -2137,34 +2151,30 @@ static int tg_set_rt_bandwidth(struct task_group *t= g, if (rt_runtime !=3D RUNTIME_INF && rt_runtime > max_rt_runtime) return -EINVAL; =20 - mutex_lock(&rt_constraints_mutex); + guard(mutex)(&rt_constraints_mutex); err =3D __rt_schedulable(tg, rt_period, rt_runtime); if (err) - goto unlock; + return err; =20 - raw_spin_lock_irq(&tg->rt_bandwidth.rt_runtime_lock); - tg->rt_bandwidth.rt_period =3D ns_to_ktime(rt_period); - tg->rt_bandwidth.rt_runtime =3D rt_runtime; + guard(raw_spinlock_irq)(&tg->dl_bandwidth.dl_runtime_lock); + tg->dl_bandwidth.dl_period =3D rt_period; + tg->dl_bandwidth.dl_runtime =3D rt_runtime; =20 - for_each_possible_cpu(i) { - struct rt_rq *rt_rq =3D tg->rt_rq[i]; + if (tg =3D=3D &root_task_group) + return 0; =20 - raw_spin_lock(&rt_rq->rt_runtime_lock); - rt_rq->rt_runtime =3D rt_runtime; - raw_spin_unlock(&rt_rq->rt_runtime_lock); + for_each_possible_cpu(i) { + dl_init_tg(tg->dl_se[i], rt_runtime, rt_period); } - raw_spin_unlock_irq(&tg->rt_bandwidth.rt_runtime_lock); -unlock: - mutex_unlock(&rt_constraints_mutex); =20 - return err; + return 0; } =20 int sched_group_set_rt_runtime(struct task_group *tg, long rt_runtime_us) { u64 rt_runtime, rt_period; =20 - rt_period =3D ktime_to_ns(tg->rt_bandwidth.rt_period); + rt_period =3D tg->dl_bandwidth.dl_period; rt_runtime =3D (u64)rt_runtime_us * NSEC_PER_USEC; if (rt_runtime_us < 0) rt_runtime =3D RUNTIME_INF; @@ -2178,10 +2188,10 @@ long sched_group_rt_runtime(struct task_group *tg) { u64 rt_runtime_us; =20 - if (tg->rt_bandwidth.rt_runtime =3D=3D RUNTIME_INF) + if (tg->dl_bandwidth.dl_runtime =3D=3D RUNTIME_INF) return -1; =20 - rt_runtime_us =3D tg->rt_bandwidth.rt_runtime; + rt_runtime_us =3D tg->dl_bandwidth.dl_runtime; do_div(rt_runtime_us, NSEC_PER_USEC); return rt_runtime_us; } @@ -2194,7 +2204,7 @@ int sched_group_set_rt_period(struct task_group *tg, = u64 rt_period_us) return -EINVAL; =20 rt_period =3D rt_period_us * NSEC_PER_USEC; - rt_runtime =3D tg->rt_bandwidth.rt_runtime; + rt_runtime =3D tg->dl_bandwidth.dl_runtime; =20 return tg_set_rt_bandwidth(tg, rt_period, rt_runtime); } @@ -2203,7 +2213,7 @@ long sched_group_rt_period(struct task_group *tg) { u64 rt_period_us; =20 - rt_period_us =3D ktime_to_ns(tg->rt_bandwidth.rt_period); + rt_period_us =3D tg->dl_bandwidth.dl_period; do_div(rt_period_us, NSEC_PER_USEC); return rt_period_us; } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index bc3ed02e40..334ab6d597 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -419,6 +419,7 @@ extern void dl_server_init(struct sched_dl_entity *dl_s= e, struct dl_rq *dl_rq, struct rq *served_rq, dl_server_pick_f pick_task); extern void sched_init_dl_servers(void); +extern int dl_check_tg(unsigned long total); extern void dl_init_tg(struct sched_dl_entity *dl_se, u64 rt_runtime, u64 = rt_period); =20 extern void dl_server_update_idle_time(struct rq *rq, --=20 2.51.0