From nobody Mon Dec 1 22:02:11 2025 Received: from mail-ej1-f52.google.com (mail-ej1-f52.google.com [209.85.218.52]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 52C7F312825 for ; Mon, 1 Dec 2025 12:42:26 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.218.52 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764592949; cv=none; b=JzM9K96ZPukg/A0jij2SB8u0/0i7tG3kE1aNPerW9X4XR/uEaXS23TuKoVhZz026wJe0ZMnqiqGLuFrXIFQTuLVt2jxkzctufswcLacEhTICRAICdqt8NAkZZ+1YGBG2qFkgzB9TbsolBTc2NlrbM66ljwX4DEzs6rv+AvhvicE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1764592949; c=relaxed/simple; bh=nwzCzmQCbRGpt9RS1KfZi6qB3vg44wTz++4vlAs636o=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=RdT9zLdkmj6/u9YRy03F7uPHpx4WWwfSSgsDTPG6LkxYCaMfccWs8lwQ9mW4O77ZtZbkUyhoGQiw/Nwq6mfSFtUrSBcxzRjYoiF+THgEhEJLZ5quz/wqDLjPpAowqJmpnL5O4WUE6i4FrBe3Pq4bt90CsBbw2Tx1Fe51Bf3QIrk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=T/CwAZ5Q; arc=none smtp.client-ip=209.85.218.52 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="T/CwAZ5Q" Received: by mail-ej1-f52.google.com with SMTP id a640c23a62f3a-b735487129fso584182466b.0 for ; Mon, 01 Dec 2025 04:42:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764592944; x=1765197744; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=8tAH7XbatC4BT69CgiRsmyLJQ+32HQcwrTzhuC5J9t8=; b=T/CwAZ5QCEmI+dyKC1RDw2lMBoLnZlU+Tq6JpWZFMI7SPvK6aDgnDI4MkyFpckT1Lv 7zr7ylh5yT5rX1dv13pbP97RBROS7jyYDgV4jpHDdUlbEQdDsqc3XBm0vw6Eg5Lzxodv zC/wh7soPg1IgfiDTQmc/6QVDxiJuUvLQCB7+KU6Z3YUGDNAsBkRDut/ts2WnVJca6WB gx7ltXdN2Gdp4lhspx5+FGJmsOAu3F6h7V7Y0Pz/K7O9zNPwnOCY+wHu25LN/a8vpkIq 03Kn2fzNtaiK1XGPUE2y5t8193rtQMdWoPVby6Y5TliNKf3Vj1ot8yBigxfiQs/2fU5d GWhQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764592944; x=1765197744; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=8tAH7XbatC4BT69CgiRsmyLJQ+32HQcwrTzhuC5J9t8=; b=URlM53JsGuWI5XgYRlP4BxsdfWpaqwW8ch3x+KZbQ9zSuJ+p4x0YjeDy5A0/hk89Eg Sv3tBXIPMq+smUYYON/fEIJ+bzB/iQBmERElZid4FwiQDoY8BvmurUHlz3Cu7sXFUpWf 4fTmXISYae3ccShHpRtDqd0J5GwARTa0vze8gg2zR9InnsX4Fd4+c72hl1Q7vXpgbWfM Un7ELjku8ElNTZUMFYuBqx2bMDfXgjdbzs2sLhmRWbtq6flRfakXVHLx8Lj1wzDV71uX 7smew3SwtuU99zTGPWd1bFC5QHiV907UeyaSbd3475PIzZ/9Adrt693j4CfwgXlvXnrM v4qw== X-Gm-Message-State: AOJu0YzX60BHtU1d5T2Dv9j3bckQ3/YzR08VPCJ+EfC34kQ1IqeDM+AV Sbi9J4j9K0I+bBiVKas+koa0WA9BYQQC/Lm7uvxISUh+ZWTk3Nqy83+/ X-Gm-Gg: ASbGncs5EmTEpGVtpDsXMZzf92lCGO0fhrF7/SUCPSIgjNAqV6mpH1T6uEvPC7PkQWt uWdkuPcwfadcdG1IWorSsqiBI/TkfPdDbRL8tyvtzmIkr8Ww2XE/hT+mxrGXGjL3DzlIEIaDD8l pEjvAKuuDLJnRzEVOegnakcF6ZniSoajBiP+9QZ756lNiK+i5rBsP/MyFJgag3gcUN9SLSEf7Cr 0pjVS3lbDnhEH7U+K3jR2DIeh2O0+WFE/JWZbU6My1HJwRpgNMl+gVCKZNrkL+Z2nrBUzSvDj5w aSy/b31ZWrpw1t5T+Yv4AJ6XuYtn0clmR5G4nA3m03ZHSHmB0tofezDr/VKX4VeFINN0g4Z1YYO D0L1vWBQHDcjv8dKwv2TSrgGJLwsPjkJjlfOaOMflp2T6LgqnJWURGIC2KUdFbzpy/K4UJ2qHOp PvCmW/6W6N X-Google-Smtp-Source: AGHT+IHTC7WDKAqdxr7e0OTaE2LS3M+JyLcS2denqML8KOowz4Z6ZSZ6FRBkMKrACALKXl/nCJCuOg== X-Received: by 2002:a17:907:2d11:b0:b73:853d:540e with SMTP id a640c23a62f3a-b767170c823mr4429015966b.30.1764592944271; Mon, 01 Dec 2025 04:42:24 -0800 (PST) Received: from victus-lab ([193.205.81.5]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-b76f59e8612sm1173738266b.52.2025.12.01.04.42.23 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 01 Dec 2025 04:42:23 -0800 (PST) From: Yuri Andriaccio To: Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider Cc: linux-kernel@vger.kernel.org, Luca Abeni , Yuri Andriaccio Subject: [RFC PATCH v4 19/28] sched/deadline: Allow deeper hierarchies of RT cgroups Date: Mon, 1 Dec 2025 13:41:52 +0100 Message-ID: <20251201124205.11169-20-yurand2000@gmail.com> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20251201124205.11169-1-yurand2000@gmail.com> References: <20251201124205.11169-1-yurand2000@gmail.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: luca abeni Remove the check for depth two only cgroup hierachies. Introduce the concept of live and active groups: - A group is live if it is a leaf group or if all its children have zero runtime. - A live group with non-zero runtime can be used to schedule tasks. - A live group with running tasks is deemed active. - A non-live group cannot be used to run tasks, but it is only used for bandwidth accounting, i.e. the sum of its children bandwidth must be less than or equal to the bandwidth of the parent. This change allows to use cgroups for bandwidth management for different users. - While the root cgroup specifies the total allocatable bandwidth of rt cgroups, a further accounting is performed to keep track of the live bandwidth, i.e. the sum of the bandwidth of live groups. The hierarchy invariant states that the live bandwidth must always be less than or equal to the total allocatable bw. Add is_live_sched_group() and sched_group_has_live_siblings() in deadline.c. These utility functions are used by dl_init_tg to perform updates only when necessary: - Only live groups may update the active dl bandwidth of dl entities (call to dl_rq_change_utilization), while non-live groups must not use servers, and thus must not change the active dl bandwidth. - The total bandwidth accounting must be changed to follow the live/non-live rules: - When disabling (runtime zero) the last child of a group, the parent becomes a live group, and so the parent's bw must be accounted back. - When enabling (runtime non-zero) the first child, the parent becomes a non-live group, and so the parent's bandwidth must be removed. Update free_rt_sched_group() to only zero out the runtime of non-zeroed servers. This is also necessary to force the bandwidth accounting of live groups. Update tg_set_rt_bandwidth() to change the runtime of a group to a non-zero value only if its parent is inactive, thus forcing it to become non-live if it was precedently (it would've already been non-live if a sibling cgroup was live). Update sched_rt_can_attach() to allow attaching only on live groups. Update dl_init_tg() to take a task_group pointer and a cpu's id rather than passing directly the pointer to the cpu's deadline server. The task_group pointer is necessary to check and update the live bandwidth accounting. Co-developed-by: Yuri Andriaccio Signed-off-by: Yuri Andriaccio Signed-off-by: luca abeni --- kernel/sched/core.c | 6 ---- kernel/sched/deadline.c | 61 ++++++++++++++++++++++++++++++++++++++--- kernel/sched/rt.c | 16 +++++++++-- kernel/sched/sched.h | 3 +- 4 files changed, 72 insertions(+), 14 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index cfb39050a2..983cd1b478 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -9253,12 +9253,6 @@ cpu_cgroup_css_alloc(struct cgroup_subsys_state *par= ent_css) return &root_task_group.css; } - /* Do not allow cpu_cgroup hierachies with depth greater than 2. */ -#ifdef CONFIG_RT_GROUP_SCHED - if (parent !=3D &root_task_group) - return ERR_PTR(-EINVAL); -#endif - tg =3D sched_create_group(parent); if (IS_ERR(tg)) return ERR_PTR(-ENOMEM); diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 7ed157dfa6..082bccc30b 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -388,11 +388,44 @@ int dl_check_tg(unsigned long total) return 1; } -void dl_init_tg(struct sched_dl_entity *dl_se, u64 rt_runtime, u64 rt_peri= od) +/* + * A cgroup is deemed live if: + * - It is a leaf cgroup. + * - All it's children have zero runtime. + */ +bool is_live_sched_group(struct task_group *tg) +{ + struct task_group *child; + bool is_active =3D 1; + + /* if there are no children, this is a leaf group, thus it is live */ + list_for_each_entry_rcu(child, &tg->children, siblings) { + if (child->dl_bandwidth.dl_runtime > 0) + is_active =3D 0; + } + return is_active; +} + +static inline bool sched_group_has_live_siblings(struct task_group *tg) +{ + struct task_group *child; + bool has_active_siblings =3D 0; + + list_for_each_entry_rcu(child, &tg->parent->children, siblings) { + if (child !=3D tg && child->dl_bandwidth.dl_runtime > 0) + has_active_siblings =3D 1; + } + return has_active_siblings; +} + +void dl_init_tg(struct task_group *tg, int cpu, u64 rt_runtime, u64 rt_per= iod) { + struct sched_dl_entity *dl_se =3D tg->dl_se[cpu]; struct rq *rq =3D container_of(dl_se->dl_rq, struct rq, dl); - int is_active; - u64 new_bw; + int is_active, is_live_group; + u64 old_runtime, new_bw; + + is_live_group =3D is_live_sched_group(tg); guard(raw_spin_rq_lock_irq)(rq); is_active =3D dl_se->my_q->rt.rt_nr_running > 0; @@ -400,8 +433,10 @@ void dl_init_tg(struct sched_dl_entity *dl_se, u64 rt_= runtime, u64 rt_period) update_rq_clock(rq); dl_server_stop(dl_se); + old_runtime =3D dl_se->dl_runtime; new_bw =3D to_ratio(rt_period, rt_runtime); - dl_rq_change_utilization(rq, dl_se, new_bw); + if (is_live_group) + dl_rq_change_utilization(rq, dl_se, new_bw); dl_se->dl_runtime =3D rt_runtime; dl_se->dl_deadline =3D rt_period; @@ -413,6 +448,24 @@ void dl_init_tg(struct sched_dl_entity *dl_se, u64 rt_= runtime, u64 rt_period) dl_se->dl_bw =3D new_bw; dl_se->dl_density =3D new_bw; + /* + * Handle parent bandwidth accounting when child runtime changes: + * - When disabling the last child, the parent becomes a leaf group, + * and so the parent's bandwidth must be accounted back. + * - When enabling the first child, the parent becomes a non-leaf group, + * and so the parent's bandwidth must be removed. + * Only leaf groups (those without active children) have non-zero bandwid= th. + */ + if (tg->parent && tg->parent !=3D &root_task_group) { + if (rt_runtime =3D=3D 0 && old_runtime !=3D 0 && + !sched_group_has_live_siblings(tg)) { + __add_rq_bw(tg->parent->dl_se[cpu]->dl_bw, dl_se->dl_rq); + } else if (rt_runtime !=3D 0 && old_runtime =3D=3D 0 && + !sched_group_has_live_siblings(tg)) { + __sub_rq_bw(tg->parent->dl_se[cpu]->dl_bw, dl_se->dl_rq); + } + } + if (is_active) dl_server_start(dl_se); } diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 928f53c1b0..a2084e9dc5 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -113,7 +113,8 @@ void free_rt_sched_group(struct task_group *tg) * Fix this issue by changing the group runtime * to 0 immediately before freeing it. */ - dl_init_tg(tg->dl_se[i], 0, tg->dl_se[i]->dl_period); + if (tg->dl_se[i]->dl_runtime) + dl_init_tg(tg, i, 0, tg->dl_se[i]->dl_period); raw_spin_rq_lock_irqsave(cpu_rq(i), flags); hrtimer_cancel(&tg->dl_se[i]->dl_timer); @@ -2134,6 +2135,14 @@ static int tg_set_rt_bandwidth(struct task_group *tg, static DEFINE_MUTEX(rt_constraints_mutex); int i, err =3D 0; + /* + * Do not allow to set a RT runtime > 0 if the parent has RT tasks + * (and is not the root group) + */ + if (rt_runtime && tg !=3D &root_task_group && + tg->parent !=3D &root_task_group && tg_has_rt_tasks(tg->parent)) + return -EINVAL; + /* No period doesn't make any sense. */ if (rt_period =3D=3D 0) return -EINVAL; @@ -2157,7 +2166,7 @@ static int tg_set_rt_bandwidth(struct task_group *tg, return 0; for_each_possible_cpu(i) { - dl_init_tg(tg->dl_se[i], rt_runtime, rt_period); + dl_init_tg(tg, i, rt_runtime, rt_period); } return 0; @@ -2228,7 +2237,8 @@ int sched_rt_can_attach(struct task_group *tg) if (rt_group_sched_enabled() && tg->dl_bandwidth.dl_runtime =3D=3D 0) return 0; - return 1; + /* tasks can be attached only if the taskgroup has no live children. */ + return (int)is_live_sched_group(tg); } #else /* !CONFIG_RT_GROUP_SCHED */ diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 4b65775ada..6c3fbfe84f 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -411,7 +411,8 @@ extern void dl_server_init(struct sched_dl_entity *dl_s= e, struct dl_rq *dl_rq, dl_server_pick_f pick_task); extern void sched_init_dl_servers(void); extern int dl_check_tg(unsigned long total); -extern void dl_init_tg(struct sched_dl_entity *dl_se, u64 rt_runtime, u64 = rt_period); +extern void dl_init_tg(struct task_group *tg, int cpu, u64 rt_runtime, u64= rt_period); +extern bool is_live_sched_group(struct task_group *tg); extern void dl_server_update_idle_time(struct rq *rq, struct task_struct *p); -- 2.51.0