[v2] sched/deadline: Remove fair-servers from real-time task's bandwidth accounting

[PATCH v2] sched/deadline: Remove fair-servers from real-time task's bandwidth accounting

Posted by Yuri Andriaccio 2 months, 1 week ago

Fair-servers are currently used in place of the old RT_THROTTLING mechanism to
prevent the starvation of SCHED_OTHER (and other lower priority) tasks when
real-time FIFO/RR processes are trying to fully utilize the CPU. To allow the
RT_THROTTLING mechanism, the maximum allocatable bandwidth for real-time tasks
has been limited to 95% of the CPU-time.

The RT_THROTTLING mechanism is now removed in favor of fair-servers, which are
currently set to use, as expected, 5% of the CPU-time. Still, they share the
same bandwidth that allows to run real-time tasks, and which is still set to 95%
of the total CPU-time. This means that by removing the RT_THROTTLING mechanism,
the bandwidth remaning for real-time SCHED_DEADLINE tasks and other dl-servers
(FIFO/RR are not affected) is only 90%.

This patch reclaims the 5% lost CPU-time, which is definitely reserved for
SCHED_OTHER tasks, but should not be accounted togheter with the other real-time
tasks. More generally, the fair-servers' bandwidth must not be accounted with
other real-time tasks.

Updates:
- Make the fair-servers' bandwidth not be accounted into the total allocated
  bandwidth for real-time tasks.
- Remove the admission control test when allocating a fair-server.
- Do not account for fair-servers in the GRUB's bandwidth reclaiming mechanism.
- Limit the max bandwidth to (BW_UNIT - max_rt_bw) when changing the parameters
  of a fair-server, preventing overcommitment.
- Add dl_bw_fair, which computes the total allocated bandwidth of the
  fair-servers in the given root-domain.
- Update admission tests (in sched_dl_global_validate) when changing the
  maximum allocatable bandwidth for real-time tasks, preventing overcommitment.

Since the fair-server's bandwidth can be changed through debugfs, it has not
been enforced that a fair-server's bw must be always equal to (BW_UNIT -
max_rt_bw), rather it must be less or equal to this value. This allows retaining
the fair-servers' settings changed through the debugfs when chaning the
max_rt_bw.

This also means that in order to increase the maximum bandwidth for real-time
tasks, the bw of fair-servers must be first decreased through debugfs otherwise
admission tests will fail, and viceversa, to increase the bw of fair-servers,
the bw of real-time tasks must be reduced beforehand.

This v2 version addresses the compilation error on i386 reported at:
https://lore.kernel.org/oe-kbuild-all/202507220727.BmA1Osdg-lkp@intel.com/

v1: https://lore.kernel.org/all/20250721111131.309388-1-yurand2000@gmail.com/

Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
---
 kernel/sched/deadline.c | 66 ++++++++++++++++++-----------------------
 kernel/sched/sched.h    |  1 -
 kernel/sched/topology.c |  8 -----
 3 files changed, 29 insertions(+), 46 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index e2d51f4306..8ba6bf3ef6 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -141,6 +141,24 @@ static inline int dl_bw_cpus(int i)
 	return cpus;
 }
 
+static inline u64 dl_bw_fair(int i)
+{
+	struct root_domain *rd = cpu_rq(i)->rd;
+	u64 fair_server_bw = 0;
+
+	RCU_LOCKDEP_WARN(!rcu_read_lock_sched_held(),
+			 "sched RCU must be held");
+
+	if (cpumask_subset(rd->span, cpu_active_mask))
+		i = cpumask_first(rd->span);
+
+	for_each_cpu_and(i, rd->span, cpu_active_mask) {
+		fair_server_bw += cpu_rq(i)->fair_server.dl_bw;
+	}
+
+	return fair_server_bw;
+}
+
 static inline unsigned long __dl_bw_capacity(const struct cpumask *mask)
 {
 	unsigned long cap = 0;
@@ -1657,25 +1675,9 @@ void sched_init_dl_servers(void)
 	}
 }
 
-void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq)
-{
-	u64 new_bw = dl_se->dl_bw;
-	int cpu = cpu_of(rq);
-	struct dl_bw *dl_b;
-
-	dl_b = dl_bw_of(cpu_of(rq));
-	guard(raw_spinlock)(&dl_b->lock);
-
-	if (!dl_bw_cpus(cpu))
-		return;
-
-	__dl_add(dl_b, new_bw, dl_bw_cpus(cpu));
-}
-
 int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 period, bool init)
 {
-	u64 old_bw = init ? 0 : to_ratio(dl_se->dl_period, dl_se->dl_runtime);
-	u64 new_bw = to_ratio(period, runtime);
+	u64 max_bw, new_bw = to_ratio(period, runtime);
 	struct rq *rq = dl_se->rq;
 	int cpu = cpu_of(rq);
 	struct dl_bw *dl_b;
@@ -1688,17 +1690,14 @@ int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 perio
 
 	cpus = dl_bw_cpus(cpu);
 	cap = dl_bw_capacity(cpu);
+	max_bw = div64_ul(cap_scale(BW_UNIT - dl_b->bw, cap), (unsigned long)cpus);
 
-	if (__dl_overflow(dl_b, cap, old_bw, new_bw))
+	if (new_bw > max_bw)
 		return -EBUSY;
 
 	if (init) {
 		__add_rq_bw(new_bw, &rq->dl);
-		__dl_add(dl_b, new_bw, cpus);
 	} else {
-		__dl_sub(dl_b, dl_se->dl_bw, cpus);
-		__dl_add(dl_b, new_bw, cpus);
-
 		dl_rq_change_utilization(rq, dl_se, new_bw);
 	}
 
@@ -2939,17 +2938,6 @@ void dl_clear_root_domain(struct root_domain *rd)
 	rd->dl_bw.total_bw = 0;
 	for_each_cpu(i, rd->span)
 		cpu_rq(i)->dl.extra_bw = cpu_rq(i)->dl.max_bw;
-
-	/*
-	 * dl_servers are not tasks. Since dl_add_task_root_domain ignores
-	 * them, we need to account for them here explicitly.
-	 */
-	for_each_cpu(i, rd->span) {
-		struct sched_dl_entity *dl_se = &cpu_rq(i)->fair_server;
-
-		if (dl_server(dl_se) && cpu_active(i))
-			__dl_add(&rd->dl_bw, dl_se->dl_bw, dl_bw_cpus(i));
-	}
 }
 
 void dl_clear_root_domain_cpu(int cpu)
@@ -3133,9 +3121,10 @@ int sched_dl_global_validate(void)
 	u64 period = global_rt_period();
 	u64 new_bw = to_ratio(period, runtime);
 	u64 cookie = ++dl_cookie;
+	u64 fair_bw;
 	struct dl_bw *dl_b;
-	int cpu, cpus, ret = 0;
-	unsigned long flags;
+	int cpu, ret = 0;
+	unsigned long cap, flags;
 
 	/*
 	 * Here we want to check the bandwidth not being set to some
@@ -3149,10 +3138,13 @@ int sched_dl_global_validate(void)
 			goto next;
 
 		dl_b = dl_bw_of(cpu);
-		cpus = dl_bw_cpus(cpu);
+		cap = dl_bw_capacity(cpu);
+		fair_bw = dl_bw_fair(cpu);
 
 		raw_spin_lock_irqsave(&dl_b->lock, flags);
-		if (new_bw * cpus < dl_b->total_bw)
+		if (cap_scale(new_bw, cap) < dl_b->total_bw)
+			ret = -EBUSY;
+		if (cap_scale(new_bw, cap) + fair_bw > cap_scale(BW_UNIT, cap))
 			ret = -EBUSY;
 		raw_spin_unlock_irqrestore(&dl_b->lock, flags);
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d3f33d10c5..8719ab8a81 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -390,7 +390,6 @@ extern void sched_init_dl_servers(void);
 extern void dl_server_update_idle_time(struct rq *rq,
 		    struct task_struct *p);
 extern void fair_server_init(struct rq *rq);
-extern void __dl_server_attach_root(struct sched_dl_entity *dl_se, struct rq *rq);
 extern int dl_server_apply_params(struct sched_dl_entity *dl_se,
 		    u64 runtime, u64 period, bool init);
 
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 977e133bb8..4ea3365984 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -500,14 +500,6 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd)
 	if (cpumask_test_cpu(rq->cpu, cpu_active_mask))
 		set_rq_online(rq);
 
-	/*
-	 * Because the rq is not a task, dl_add_task_root_domain() did not
-	 * move the fair server bw to the rd if it already started.
-	 * Add it now.
-	 */
-	if (rq->fair_server.dl_server)
-		__dl_server_attach_root(&rq->fair_server, rq);
-
 	rq_unlock_irqrestore(rq, &rf);
 
 	if (old_rd)

base-commit: ee90c3bb525e6ea0845e5b70f0beef0abc8f2373
-- 
2.50.1

Re: [PATCH v2] sched/deadline: Remove fair-servers from real-time task's bandwidth accounting

Posted by Juri Lelli 2 months ago

Hi Yuri,

On 25/07/25 18:44, Yuri Andriaccio wrote:
> Fair-servers are currently used in place of the old RT_THROTTLING mechanism to
> prevent the starvation of SCHED_OTHER (and other lower priority) tasks when
> real-time FIFO/RR processes are trying to fully utilize the CPU. To allow the
> RT_THROTTLING mechanism, the maximum allocatable bandwidth for real-time tasks
> has been limited to 95% of the CPU-time.
> 
> The RT_THROTTLING mechanism is now removed in favor of fair-servers, which are
> currently set to use, as expected, 5% of the CPU-time. Still, they share the
> same bandwidth that allows to run real-time tasks, and which is still set to 95%
> of the total CPU-time. This means that by removing the RT_THROTTLING mechanism,
> the bandwidth remaning for real-time SCHED_DEADLINE tasks and other dl-servers
> (FIFO/RR are not affected) is only 90%.
> 
> This patch reclaims the 5% lost CPU-time, which is definitely reserved for
> SCHED_OTHER tasks, but should not be accounted togheter with the other real-time
> tasks. More generally, the fair-servers' bandwidth must not be accounted with
> other real-time tasks.
> 
> Updates:
> - Make the fair-servers' bandwidth not be accounted into the total allocated
>   bandwidth for real-time tasks.
> - Remove the admission control test when allocating a fair-server.
> - Do not account for fair-servers in the GRUB's bandwidth reclaiming mechanism.
> - Limit the max bandwidth to (BW_UNIT - max_rt_bw) when changing the parameters
>   of a fair-server, preventing overcommitment.
> - Add dl_bw_fair, which computes the total allocated bandwidth of the
>   fair-servers in the given root-domain.
> - Update admission tests (in sched_dl_global_validate) when changing the
>   maximum allocatable bandwidth for real-time tasks, preventing overcommitment.
> 
> Since the fair-server's bandwidth can be changed through debugfs, it has not
> been enforced that a fair-server's bw must be always equal to (BW_UNIT -
> max_rt_bw), rather it must be less or equal to this value. This allows retaining
> the fair-servers' settings changed through the debugfs when chaning the
> max_rt_bw.
> 
> This also means that in order to increase the maximum bandwidth for real-time
> tasks, the bw of fair-servers must be first decreased through debugfs otherwise
> admission tests will fail, and viceversa, to increase the bw of fair-servers,
> the bw of real-time tasks must be reduced beforehand.
> 
> This v2 version addresses the compilation error on i386 reported at:
> https://lore.kernel.org/oe-kbuild-all/202507220727.BmA1Osdg-lkp@intel.com/
> 
> v1: https://lore.kernel.org/all/20250721111131.309388-1-yurand2000@gmail.com/
> 
> Signed-off-by: Yuri Andriaccio <yurand2000@gmail.com>
> ---

Thanks for this. I have been testing it and it looks good. Just a couple
of comments below.

...

> @@ -1688,17 +1690,14 @@ int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 perio
>  
>  	cpus = dl_bw_cpus(cpu);
>  	cap = dl_bw_capacity(cpu);
> +	max_bw = div64_ul(cap_scale(BW_UNIT - dl_b->bw, cap), (unsigned long)cpus);

fc975cfb3639 ("sched/deadline: Fix dl_server runtime calculation
formula") essentially removed cap/freq scaling for dl-servers. Should we
rather not scale max_bw here as well?

> -	if (__dl_overflow(dl_b, cap, old_bw, new_bw))
> +	if (new_bw > max_bw)
>  		return -EBUSY;
>  
>  	if (init) {
>  		__add_rq_bw(new_bw, &rq->dl);
> -		__dl_add(dl_b, new_bw, cpus);
>  	} else {
> -		__dl_sub(dl_b, dl_se->dl_bw, cpus);
> -		__dl_add(dl_b, new_bw, cpus);
> -
>  		dl_rq_change_utilization(rq, dl_se, new_bw);
>  	}

...

> @@ -3149,10 +3138,13 @@ int sched_dl_global_validate(void)
>  			goto next;
>  
>  		dl_b = dl_bw_of(cpu);
> -		cpus = dl_bw_cpus(cpu);
> +		cap = dl_bw_capacity(cpu);
> +		fair_bw = dl_bw_fair(cpu);
>  
>  		raw_spin_lock_irqsave(&dl_b->lock, flags);
> -		if (new_bw * cpus < dl_b->total_bw)
> +		if (cap_scale(new_bw, cap) < dl_b->total_bw)
> +			ret = -EBUSY;

It's kind of a minor one, but can't we return early at this point already?

> +		if (cap_scale(new_bw, cap) + fair_bw > cap_scale(BW_UNIT, cap))
>  			ret = -EBUSY;
>  		raw_spin_unlock_irqrestore(&dl_b->lock, flags);

Thanks!
Juri

Re: [PATCH v2] sched/deadline: Remove fair-servers from real-time task's bandwidth accounting

Posted by Yuri Andriaccio 2 months ago

Hi,

thanks for reviewing the patch.

> > @@ -1688,17 +1690,14 @@ int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 perio
> >  
> >    cpus = dl_bw_cpus(cpu);
> >    cap = dl_bw_capacity(cpu);
> > +  max_bw = div64_ul(cap_scale(BW_UNIT - dl_b->bw, cap), (unsigned long)cpus);
> 
> fc975cfb3639 ("sched/deadline: Fix dl_server runtime calculation
> formula") essentially removed cap/freq scaling for dl-servers. Should we
> rather not scale max_bw here as well?

Now that I think about it, you are correct. Since the fair-servers' rate is
fixed (i.e. by default 50ms every second), the bandwidth must be scaled for both
the CPU and the server, or equally, neither needs scaling for the check in
question.

...

> > @@ -3149,10 +3138,13 @@ int sched_dl_global_validate(void)
> >        goto next;
> >  
> >      dl_b = dl_bw_of(cpu);
> > -    cpus = dl_bw_cpus(cpu);
> > +    cap = dl_bw_capacity(cpu);
> > +    fair_bw = dl_bw_fair(cpu);
> >  
> >      raw_spin_lock_irqsave(&dl_b->lock, flags);
> > -    if (new_bw * cpus < dl_b->total_bw)
> > +    if (cap_scale(new_bw, cap) < dl_b->total_bw)
> > +      ret = -EBUSY;
> 
> It's kind of a minor one, but can't we return early at this point already?

Yes, I suppose so. I'll update the patch to return as soon as the error
condition is met.

Additionally, I'll also update some of the checks in the above function to
reflect the aforementioned fixed rate behaviour for fair-servers.

Have a nice day,
Yuri

Re: [PATCH v2] sched/deadline: Remove fair-servers from real-time task's bandwidth accounting

Posted by Juri Lelli 1 month, 3 weeks ago

Hey,

On 01/08/25 18:03, Yuri Andriaccio wrote:
> Hi,
> 
> thanks for reviewing the patch.
> 
> > > @@ -1688,17 +1690,14 @@ int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 perio
> > >  
> > >    cpus = dl_bw_cpus(cpu);
> > >    cap = dl_bw_capacity(cpu);
> > > +  max_bw = div64_ul(cap_scale(BW_UNIT - dl_b->bw, cap), (unsigned long)cpus);
> > 
> > fc975cfb3639 ("sched/deadline: Fix dl_server runtime calculation
> > formula") essentially removed cap/freq scaling for dl-servers. Should we
> > rather not scale max_bw here as well?
> 
> Now that I think about it, you are correct. Since the fair-servers' rate is
> fixed (i.e. by default 50ms every second), the bandwidth must be scaled for both
> the CPU and the server, or equally, neither needs scaling for the check in
> question.
> 
> ...
> 
> > > @@ -3149,10 +3138,13 @@ int sched_dl_global_validate(void)
> > >        goto next;
> > >  
> > >      dl_b = dl_bw_of(cpu);
> > > -    cpus = dl_bw_cpus(cpu);
> > > +    cap = dl_bw_capacity(cpu);
> > > +    fair_bw = dl_bw_fair(cpu);
> > >  
> > >      raw_spin_lock_irqsave(&dl_b->lock, flags);
> > > -    if (new_bw * cpus < dl_b->total_bw)
> > > +    if (cap_scale(new_bw, cap) < dl_b->total_bw)
> > > +      ret = -EBUSY;
> > 
> > It's kind of a minor one, but can't we return early at this point already?
> 
> Yes, I suppose so. I'll update the patch to return as soon as the error
> condition is met.
> 
> Additionally, I'll also update some of the checks in the above function to
> reflect the aforementioned fixed rate behaviour for fair-servers.

Don't think you had a chance to send a new version yet, no worries!

But, I just noticed that this seems to regress cpu hotplug. With this
applied, offlining of cpus fails with device or resource busy on my test
system. Can you please double check?

Thanks!
Juri

Re: [PATCH v2] sched/deadline: Remove fair-servers from real-time task's bandwidth accounting

Posted by Matteo Martelli 2 months, 1 week ago

Hi Yuri,

On Fri, 25 Jul 2025 18:44:12 +0200, Yuri Andriaccio <yurand2000@gmail.com> wrote:
> ...
> @@ -1688,17 +1690,14 @@ int dl_server_apply_params(struct sched_dl_entity *dl_se, u64 runtime, u64 perio
>
>	cpus = dl_bw_cpus(cpu);
>	cap = dl_bw_capacity(cpu);
> +	max_bw = div64_ul(cap_scale(BW_UNIT - dl_b->bw, cap), (unsigned long)cpus);

This line exceeds 80 characters width. Perhaps it needs to be split.

> 
> -	if (__dl_overflow(dl_b, cap, old_bw, new_bw))
> +	if (new_bw > max_bw)
> 		return -EBUSY;
> ...

Beside that minor note, I retested the v2 of this patch with the same
tests I ran for v1 [1]. I confirm that stress-ng and runtime variations
commands provide the same results. Also no warning is produced anymore
as I also applied Juri's patch as you suggested [2].

Tested-by: Matteo Martelli <matteo.martelli@codethink.co.uk>

[1]: https://lore.kernel.org/all/86013fcc38e582ab89b9b7e4864cc1bd@codethink.co.uk/
[2]: https://lore.kernel.org/all/20250725152804.14224-1-yurand2000@gmail.com/