From nobody Mon Oct 6 13:35:43 2025 Received: from mail-wm1-f50.google.com (mail-wm1-f50.google.com [209.85.128.50]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0C948293C7C for ; Mon, 21 Jul 2025 11:11:34 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.50 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753096297; cv=none; b=e+6KKTQtFwjfgv9q2sgtrZmkb7CzjstCWplC5tBlePHRmFRUubtWQpXZKYnrhPxFLz+K75lGIVQr822rUQvny8gxeMZDY67lkU8HnCdEtDegtKWNUSxM2OtYnR2HOo5F8wBwAJMeMDcadNOgynczKXK1OoGveNCvPYJAccT2CAA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753096297; c=relaxed/simple; bh=Uxv8Ufc0YEmU/dEm5QctHDxZmjkPHSnfb1lKX7nSVg8=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=LiL1UF74sYO3/1pcS0HVIENp7wDGaO3FmO/lN3nUIMzzNd3SlmPvU89Jd+XNaJ3wXC2ZJR2u+kltizQnqIDDGlEUeL5sEyPKUtjdynEnFFoQ9AKHSROeLjFzZCybreoVSCJv1ZftT8zf5QqSlDATT458Taxyyp4OQGDdzpgZTzo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=IVESGMrV; arc=none smtp.client-ip=209.85.128.50 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="IVESGMrV" Received: by mail-wm1-f50.google.com with SMTP id 5b1f17b1804b1-451d6ade159so30416735e9.1 for ; Mon, 21 Jul 2025 04:11:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1753096293; x=1753701093; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=PzRWUOwmh6jQaA4dvwMb3cW3wU5FLdyw3hTFeO/UOVQ=; b=IVESGMrVUy+u8hXrQDMDfXPHyBGGyB8ZE8nLBDaQ/Swz7U3LypUHmnxrh0xED+nMwZ snNvjmpQUNq9Y+BhZJRZ2UyV01B7RCYnxpPwOpqPXF4G9Bzam7qQ2+gh8TCCbShjDgqg w7Tx6pnfEvXc2AXf83naFz/aL6AkZmiFAIUb+s6gxQ375GXYCUogPwD3/cxfMLUGBTo0 fpmQ7Tm6w7KWBLyyCJVaJ2UxOPd8T3/tByrJJaVEEeSu9NqYWqJR/kLXcg6PLegPoQAK 8lFSgYqTcrXmpBbueqP4YxkfPcKT1s7Obvh1ZTVHbUi53p4ZwNyKehle2UIubsy0/x/l mFKg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1753096293; x=1753701093; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=PzRWUOwmh6jQaA4dvwMb3cW3wU5FLdyw3hTFeO/UOVQ=; b=aFovyEqr5G0Rhr5K1hlq1Z+oOSh2ksBcX9qFJMW0J7cXv4qz5QrGeahWKC80+39X0K CEpt4KK01Ff40Sw1SEeBSNh8R1SmbkPKkGxlE2YZBfaRqgpdQX60y1T31TwmK01VO/JV +iLiF/htq+Pk1ikW9KtcuEO3NDkDRaaq7eESsKcaD+McRtEt4c4h4JhH5+UydzYGGXFE 6gWokg/oKayfXTYtinYoOaZvjCsvyWQL4ZakaNfKm4PAIA6hbyLDC2eKkABTMFIOovNi FfFhU3xMn5btQztt0PB1rFaqtujjtGnL7ZVewdjXeFp0vD79qNsb43rEm4mOelAfuzMD nDPQ== X-Gm-Message-State: AOJu0YxKc9xOaREDvIz+q4YmbKD3f3iaq6BXgtYm/s6hxFsDBEil9XUa 83pVUmTzR9D/HFhyoE89Ri3mnZBCKK/DX3LPARqUI6OJlBe49VX3QIBp X-Gm-Gg: ASbGncv/vrS78G9F9QJsMjuw7tfy2gJYZwbuJIzBMFAUXrW/UZNNsrsDvWuWXJgi4H1 P1lFBMsMwZzg6Wyf5P4cemq9Rns4b7QUZZtkJDP6dKTmyL+Mn6n1lK+oEOI21lcx5CyHrtCYFne AHXzHFvUcB+aSW41h1LXHjRixPRSqTw8uFeUHedZfveBS13FB+Rdg96UKP364NRANdvt5wnTzcT dJ6oPxBoUPjM3IjtdTRZlxuOcauo2Po8zSy/F9/DAl7kLTRmdoqueA6n62TpYSZ35l3XjAH6nBb riSm2oaVZtiuOO/mtexNpmxyOi1gQPBoayhvKM0EZzH6LTcqT8XdiuViWZVg1niQH/QAdz18zex dbwt7TyYxWga84ENZW+k= X-Google-Smtp-Source: AGHT+IEd3Lm44AqJpaqNLDKa18VZQEVZ51xZKIwLi9KoS0BXG31rmRjoNR1BcOD5Lwxi2tqpATftxQ== X-Received: by 2002:a05:6000:40de:b0:3a4:eae1:a79f with SMTP id ffacd0b85a97d-3b60e5240damr18098393f8f.33.1753096292930; Mon, 21 Jul 2025 04:11:32 -0700 (PDT) Received: from victus-lab ([193.205.81.5]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-4563b74f5f3sm95623615e9.25.2025.07.21.04.11.32 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 21 Jul 2025 04:11:32 -0700 (PDT) From: Yuri Andriaccio To: Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Valentin Schneider Cc: linux-kernel@vger.kernel.org, Luca Abeni , Yuri Andriaccio Subject: [PATCH] sched/deadline: Remove fair-servers from real-time task's bandwidth accounting Date: Mon, 21 Jul 2025 13:11:31 +0200 Message-ID: <20250721111131.309388-1-yurand2000@gmail.com> X-Mailer: git-send-email 2.50.1 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Fair-servers are currently used in place of the old RT_THROTTLING mechanism= to prevent the starvation of SCHED_OTHER (and other lower priority) tasks when real-time FIFO/RR processes are trying to fully utilize the CPU. To allow t= he RT_THROTTLING mechanism, the maximum allocatable bandwidth for real-time ta= sks has been limited to 95% of the CPU-time. The RT_THROTTLING mechanism is now removed in favor of fair-servers, which = are currently set to use, as expected, 5% of the CPU-time. Still, they share the same bandwidth that allows to run real-time tasks, and which is still set t= o 95% of the total CPU-time. This means that by removing the RT_THROTTLING mechan= ism, the bandwidth remaning for real-time SCHED_DEADLINE tasks and other dl-serv= ers (FIFO/RR are not affected) is only 90%. To demonstrate this, I've run the following script on the latest tip/master: ---------- PROCS=3D$(nproc) echo "Allocate 95% bw per task/cpu" for ((i =3D 0; i < $PROCS; i++)) do sleep infinity & chrt -d -T 95000000 -P 100000000 -D 100000000 -p 0 $! done killall sleep echo "Allocate 90% bw per task/cpu" for ((i =3D 0; i < $PROCS; i++)) do sleep infinity & chrt -d -T 90000000 -P 100000000 -D 100000000 -p 0 $! done killall sleep ---------- First-off we try to fully utilize the 95% rt-bandwidth by allocating #CPU SCHED_DEADLINE tasks, requesting 95/100ms each. This will fail, because, as mentioned, fair-servers are also accounted in the 95% realtime bw. With the second allocation, it is possible to show that 90% bandwidth is instead all= owed by the scheduler. By playing with the numbers and chrt(s), it is possible t= o see that the allocatable bandwidth for SCHED_DEADLINE tasks is exactly 90%, whi= le it is possible to see through stress-tests that on CPU-hog caused by FIFO/RR t= asks (of course SCHED_DEADLINE tasks are limited by the admission test, which as mentioned fails at 90% total bw utilization), the fair-servers only allocat= e at most 5% of the CPU-time to SCHED_OTHER tasks. There is clearly a 5% of CPU-= time lost somewhere. This patch reclaims the 5% lost SCHED_DEADLINE CPU-time (FIFO/RR are not affected, there is no admission test there to perform), by accounting the fair-server's bandwidth separately. After this patch, the above script runs successfully also when allocating 95% bw per task/cpu. Changes: - Make the fair-servers' bandwidth not be accounted into the total allocated bandwidth for real-time tasks. - Do not account for fair-servers in the GRUB's bandwidth reclaiming mechan= ism. - Remove the admission control test when allocating a fair-server, as its bandwidth is accounted differently. - Limit the max bandwidth to (BW_UNIT - max_rt_bw) when changing the parame= ters of a fair-server, preventing overcommitment. - Add dl_bw_fair, which computes the total allocated bandwidth of the fair-servers in the given root-domain. - Update admission tests (in sched_dl_global_validate) when changing the maximum allocatable bandwidth for real-time tasks, preventing overcommitm= ent. Notes: Since the fair-server's bandwidth can be changed through debugfs, it has not been enforced that a fair-server's bw must be always equal to (BW_UNIT - max_rt_bw), rather it must be less or equal to this value. This allows reta= ining the fair-servers' settings changed through the debugfs when changing the maximum realtime bandwidth. This also means that in order to increase the maximum bandwidth for real-ti= me tasks, the bw of fair-servers must be first decreased through debugfs other= wise admission tests will fail, and viceversa, to increase the bw of fair-server= s, the bw of real-time tasks must be reduced beforehand. Testing: This patch has been tested with basic regression tests, by checking that it= is not possible to overcommit the bandwidth of fair-servers and that SCHED_OTH= ER tasks do use at least the specified amount of bw (also varying the ratio of rt/non-rt bandwidth). Additionally it has also been tested on top of this fix, ensuring that the warning mentioned in the bug report is not re-triggered: https://lore.kernel.org/all/aHpf4LfMtB2V9uNb@jlelli-thinkpadt14gen4.remote.= csb/ Signed-off-by: Yuri Andriaccio --- kernel/sched/deadline.c | 66 ++++++++++++++++++----------------------- kernel/sched/sched.h | 1 - kernel/sched/topology.c | 8 ----- 3 files changed, 29 insertions(+), 46 deletions(-) diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 1af06e48227..e97a7feb59d 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -141,6 +141,24 @@ static inline int dl_bw_cpus(int i) return cpus; } =20 +static inline u64 dl_bw_fair(int i) +{ + struct root_domain *rd =3D cpu_rq(i)->rd; + u64 fair_server_bw =3D 0; + + RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(), + "sched RCU must be held"); + + if (cpumask_subset(rd->span, cpu_active_mask)) + i =3D cpumask_first(rd->span); + + for_each_cpu_and(i, rd->span, cpu_active_mask) { + fair_server_bw +=3D cpu_rq(i)->fair_server.dl_bw; + } + + return fair_server_bw; +} + static inline unsigned long __dl_bw_capacity(const struct cpumask *mask) { unsigned long cap =3D 0; @@ -1657,25 +1675,9 @@ void sched_init_dl_servers(void) } } =20 -void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq) -{ - u64 new_bw =3D dl_se->dl_bw; - int cpu =3D cpu_of(rq); - struct dl_bw *dl_b; - - dl_b =3D dl_bw_of(cpu_of(rq)); - guard(raw_spinlock)(&dl_b->lock); - - if (!dl_bw_cpus(cpu)) - return; - - __dl_add(dl_b, new_bw, dl_bw_cpus(cpu)); -} - int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64= period, bool init) { - u64 old_bw =3D init ? 0 : to_ratio(dl_se->dl_period, dl_se->dl_runtime); - u64 new_bw =3D to_ratio(period, runtime); + u64 max_bw, new_bw =3D to_ratio(period, runtime); struct rq *rq =3D dl_se->rq; int cpu =3D cpu_of(rq); struct dl_bw *dl_b; @@ -1688,17 +1690,14 @@ int dl_server_apply_params(struct sched_dl_entity *= dl_se, u64 runtime, u64 perio =20 cpus =3D dl_bw_cpus(cpu); cap =3D dl_bw_capacity(cpu); + max_bw =3D cap_scale(BW_UNIT - dl_b->bw, cap) / cpus; =20 - if (__dl_overflow(dl_b, cap, old_bw, new_bw)) + if (new_bw > max_bw) return -EBUSY; =20 if (init) { __add_rq_bw(new_bw, &rq->dl); - __dl_add(dl_b, new_bw, cpus); } else { - __dl_sub(dl_b, dl_se->dl_bw, cpus); - __dl_add(dl_b, new_bw, cpus); - dl_rq_change_utilization(rq, dl_se, new_bw); } =20 @@ -2932,17 +2931,6 @@ void dl_clear_root_domain(struct root_domain *rd) rd->dl_bw.total_bw =3D 0; for_each_cpu(i, rd->span) cpu_rq(i)->dl.extra_bw =3D cpu_rq(i)->dl.max_bw; - - /* - * dl_servers are not tasks. Since dl_add_task_root_domain ignores - * them, we need to account for them here explicitly. - */ - for_each_cpu(i, rd->span) { - struct sched_dl_entity *dl_se =3D &cpu_rq(i)->fair_server; - - if (dl_server(dl_se) && cpu_active(i)) - __dl_add(&rd->dl_bw, dl_se->dl_bw, dl_bw_cpus(i)); - } } =20 void dl_clear_root_domain_cpu(int cpu) @@ -3126,9 +3114,10 @@ int sched_dl_global_validate(void) u64 period =3D global_rt_period(); u64 new_bw =3D to_ratio(period, runtime); u64 cookie =3D ++dl_cookie; + u64 fair_bw; struct dl_bw *dl_b; - int cpu, cpus, ret =3D 0; - unsigned long flags; + int cpu, ret =3D 0; + unsigned long cap, flags; =20 /* * Here we want to check the bandwidth not being set to some @@ -3142,10 +3131,13 @@ int sched_dl_global_validate(void) goto next; =20 dl_b =3D dl_bw_of(cpu); - cpus =3D dl_bw_cpus(cpu); + cap =3D dl_bw_capacity(cpu); + fair_bw =3D dl_bw_fair(cpu); =20 raw_spin_lock_irqsave(&dl_b->lock, flags); - if (new_bw * cpus < dl_b->total_bw) + if (cap_scale(new_bw, cap) < dl_b->total_bw) + ret =3D -EBUSY; + if (cap_scale(new_bw, cap) + fair_bw > cap_scale(BW_UNIT, cap)) ret =3D -EBUSY; raw_spin_unlock_irqrestore(&dl_b->lock, flags); =20 diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index ac953fad8c2..42b5d024dce 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -390,7 +390,6 @@ extern void sched_init_dl_servers(void); extern void dl_server_update_idle_time(struct rq *rq, struct task_struct *p); extern void fair_server_init(struct rq *rq); -extern void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct = rq *rq); extern int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 period, bool init); =20 diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c index 977e133bb8a..4ea3365984a 100644 --- a/kernel/sched/topology.c +++ b/kernel/sched/topology.c @@ -500,14 +500,6 @@ void rq_attach_root(struct rq *rq, struct root_domain = *rd) if (cpumask_test_cpu(rq->cpu, cpu_active_mask)) set_rq_online(rq); =20 - /* - * Because the rq is not a task, dl_add_task_root_domain() did not - * move the fair server bw to the rd if it already started. - * Add it now. - */ - if (rq->fair_server.dl_server) - __dl_server_attach_root(&rq->fair_server, rq); - rq_unlock_irqrestore(rq, &rf); =20 if (old_rd) --=20 2.50.1