GRUB reclaiming fixes

[PATCH v3 2/5] sched/deadline: Fix reclaim inaccuracy with SMP

Posted by Vineeth Pillai 2 years, 9 months ago

In a multi-processor system, bandwidth usage is divided equally to
all cpus. This causes issues with reclaiming free bandwidth on a cpu.
"Uextra" is same on all cpus in a root domain and running_bw would be
different based on the reserved bandwidth of tasks running on the cpu.
This causes disproportionate reclaiming - task with lesser bandwidth
reclaims less even if its the only task running on that cpu.

Following is a small test with three tasks with reservations (8,10)
(1,10) and (1, 100). These three tasks run on different cpus. But
since the reclamation logic calculates available bandwidth as a factor
of globally available bandwidth, tasks with lesser bandwidth reclaims
only little compared to higher bandwidth even if cpu has free and
available bandwidth to be reclaimed.

TID[730]: RECLAIM=1, (r=8ms, d=10ms, p=10ms), Util: 95.05
TID[731]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 31.34
TID[732]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 3.16

Fix: use the available bandwidth on each cpu to calculate reclaimable
bandwidth. Admission control takes care of total bandwidth and hence
using the available bandwidth on a specific cpu would not break the
deadline guarentees.

With this fix, the above test behaves as follows:
TID[586]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 95.24
TID[585]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 95.01
TID[584]: RECLAIM=1, (r=8ms, d=10ms, p=10ms), Util: 95.01

Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
---
 kernel/sched/deadline.c | 22 +++++++---------------
 1 file changed, 7 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 91451c1c7e52..85902c4c484b 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1272,7 +1272,7 @@ int dl_runtime_exceeded(struct sched_dl_entity *dl_se)
  *	Umax:		Max usable bandwidth for DL. Currently
  *			= sched_rt_runtime_us / sched_rt_period_us
  *	Uextra:		Extra bandwidth not reserved:
- *			= Umax - \Sum(u_i / #cpus in the root domain)
+ *			= Umax - this_bw
  *	u_i:		Bandwidth of an admitted dl task in the
  *			root domain.
  *
@@ -1286,22 +1286,14 @@ int dl_runtime_exceeded(struct sched_dl_entity *dl_se)
  */
 static u64 grub_reclaim(u64 delta, struct rq *rq, struct sched_dl_entity *dl_se)
 {
-	u64 u_act;
-	u64 u_inact = rq->dl.this_bw - rq->dl.running_bw; /* Utot - Uact */
-
 	/*
-	 * Instead of computing max{u, (rq->dl.max_bw - u_inact - u_extra)},
-	 * we compare u_inact + rq->dl.extra_bw with
-	 * rq->dl.max_bw - u, because u_inact + rq->dl.extra_bw can be larger
-	 * than rq->dl.max_bw (so, rq->dl.max_bw - u_inact - rq->dl.extra_bw
-	 * would be negative leading to wrong results)
+	 * max{u, Umax - Uinact - Uextra}
+	 * = max{u, max_bw - (this_bw - running_bw) + (this_bw - running_bw)}
+	 * = max{u, running_bw} = running_bw
+	 * So dq = -(max{u, Umax - Uinact - Uextra} / Umax) dt
+	 *       = -(running_bw / max_bw) dt
 	 */
-	if (u_inact + rq->dl.extra_bw > rq->dl.max_bw - dl_se->dl_bw)
-		u_act = dl_se->dl_bw;
-	else
-		u_act = rq->dl.max_bw - u_inact - rq->dl.extra_bw;
-
-	return div64_u64(delta * u_act, rq->dl.max_bw);
+	return div64_u64(delta * rq->dl.running_bw, rq->dl.max_bw);
 }
 
 /*
-- 
2.40.1

Re: [PATCH v3 2/5] sched/deadline: Fix reclaim inaccuracy with SMP

Posted by Dietmar Eggemann 2 years, 8 months ago

Hi Vineeth,

On 15/05/2023 04:57, Vineeth Pillai wrote:
> In a multi-processor system, bandwidth usage is divided equally to
> all cpus. This causes issues with reclaiming free bandwidth on a cpu.
> "Uextra" is same on all cpus in a root domain and running_bw would be
> different based on the reserved bandwidth of tasks running on the cpu.
> This causes disproportionate reclaiming - task with lesser bandwidth
> reclaims less even if its the only task running on that cpu.
> 
> Following is a small test with three tasks with reservations (8,10)
> (1,10) and (1, 100). These three tasks run on different cpus. But
> since the reclamation logic calculates available bandwidth as a factor
> of globally available bandwidth, tasks with lesser bandwidth reclaims
> only little compared to higher bandwidth even if cpu has free and
> available bandwidth to be reclaimed.
> 
> TID[730]: RECLAIM=1, (r=8ms, d=10ms, p=10ms), Util: 95.05
> TID[731]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 31.34
> TID[732]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 3.16

What does this 'Util: X' value stand for? I assume it's the utilization
of the task? How do you obtain it?

I see that e.g. TID[731] should run 1ms each 10ms w/o grub and with grub
the runtime could be potentially longer since 'scaled_delta_exec < delta'.

I don't get this comment in update_curr_dl():

1325    /*
1326     * For tasks that participate in GRUB, we implement GRUB-PA: the
1327     * spare reclaimed bandwidth is used to clock down frequency.
1328     *

It looks like dl_se->runtime is affected and with 'scaled_delta_exec <
delta' the task runs longer than dl_se->dl_runtime?

> Fix: use the available bandwidth on each cpu to calculate reclaimable
> bandwidth. Admission control takes care of total bandwidth and hence
> using the available bandwidth on a specific cpu would not break the
> deadline guarentees.
> 
> With this fix, the above test behaves as follows:
> TID[586]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 95.24
> TID[585]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 95.01
> TID[584]: RECLAIM=1, (r=8ms, d=10ms, p=10ms), Util: 95.01
> 
> Signed-off-by: Vineeth Pillai (Google) <vineeth@bitbyteword.org>
> ---
>  kernel/sched/deadline.c | 22 +++++++---------------
>  1 file changed, 7 insertions(+), 15 deletions(-)
> 
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 91451c1c7e52..85902c4c484b 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1272,7 +1272,7 @@ int dl_runtime_exceeded(struct sched_dl_entity *dl_se)
>   *	Umax:		Max usable bandwidth for DL. Currently
>   *			= sched_rt_runtime_us / sched_rt_period_us
>   *	Uextra:		Extra bandwidth not reserved:
> - *			= Umax - \Sum(u_i / #cpus in the root domain)
> + *			= Umax - this_bw
>   *	u_i:		Bandwidth of an admitted dl task in the
>   *			root domain.
>   *
> @@ -1286,22 +1286,14 @@ int dl_runtime_exceeded(struct sched_dl_entity *dl_se)
>   */
>  static u64 grub_reclaim(u64 delta, struct rq *rq, struct sched_dl_entity *dl_se)
>  {
> -	u64 u_act;
> -	u64 u_inact = rq->dl.this_bw - rq->dl.running_bw; /* Utot - Uact */
> -
>  	/*
> -	 * Instead of computing max{u, (rq->dl.max_bw - u_inact - u_extra)},
> -	 * we compare u_inact + rq->dl.extra_bw with
> -	 * rq->dl.max_bw - u, because u_inact + rq->dl.extra_bw can be larger
> -	 * than rq->dl.max_bw (so, rq->dl.max_bw - u_inact - rq->dl.extra_bw
> -	 * would be negative leading to wrong results)
> +	 * max{u, Umax - Uinact - Uextra}
> +	 * = max{u, max_bw - (this_bw - running_bw) + (this_bw - running_bw)}
> +	 * = max{u, running_bw} = running_bw
> +	 * So dq = -(max{u, Umax - Uinact - Uextra} / Umax) dt
> +	 *       = -(running_bw / max_bw) dt
>  	 */
> -	if (u_inact + rq->dl.extra_bw > rq->dl.max_bw - dl_se->dl_bw)
> -		u_act = dl_se->dl_bw;
> -	else
> -		u_act = rq->dl.max_bw - u_inact - rq->dl.extra_bw;
> -
> -	return div64_u64(delta * u_act, rq->dl.max_bw);
> +	return div64_u64(delta * rq->dl.running_bw, rq->dl.max_bw);

I did the test discussed later in this thread with:

3 [3/100] tasks (dl_se->dl_bw = (3 << 20)/100 = 31457) on 3 CPUs

factor = scaled_delta_exec/delta

- existing grub

rq->dl.bw_ratio = ( 100 << 8 ) / 95 = 269
rq->dl.extra_bw = ( 95 << 20 ) / 100 = 996147

cpu=2 curr->[thread0-2 1715] delta=2140100 this_bw=31457
running_bw=31457 extra_bw=894788 u_inact=0 u_act_min=33054 u_act=153788
scaled_delta_exec=313874 factor=0.14

- your solution patch [1-2]

cpu=2 curr->[thread0-0 1676] delta=157020 running_bw=31457 max_bw=996147
res=4958 factor=0.03

You say that GRUB calculation is inaccurate and that this inaccuracy
gets larger as the bandwidth of tasks becomes smaller.

Could you explain this inaccuracy on this example?

Re: [PATCH v3 2/5] sched/deadline: Fix reclaim inaccuracy with SMP

Posted by Vineeth Remanan Pillai 2 years, 8 months ago

Hi Dietmar,

On Fri, May 19, 2023 at 1:56 PM Dietmar Eggemann
<dietmar.eggemann@arm.com> wrote:

> > TID[730]: RECLAIM=1, (r=8ms, d=10ms, p=10ms), Util: 95.05
> > TID[731]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 31.34
> > TID[732]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 3.16
>
> What does this 'Util: X' value stand for? I assume it's the utilization
> of the task? How do you obtain it?
>
Yes, it is the utilization of the task. I calculate it by dividing the
cputime with elapsed time(using clock_gettime(2)).

> I see that e.g. TID[731] should run 1ms each 10ms w/o grub and with grub
> the runtime could be potentially longer since 'scaled_delta_exec < delta'.
>
Yes correct. GRUB(Greedy Reclamation of Unused Bandwidth) algorithm
is used here for deadline tasks that needs to run longer than their
runtime when needed. sched_setattr allows a flag SCHED_FLAG_RECLAIM
to indicate that the task would like to reclaim unused bandwidth of a
cpu if available. For those tasks, 'runtime' is depreciated using the
GRUB formula and it allows it to run for longer and reclaim the free
bandwidth of the cpu. The GRUB implementation in linux allows a task
to reclaim upto RT capacity(95%) and depends on the free bandwidth
of the cpu. So TID[731] theoretically should run for 95ms as it is
the only task in the cpu, but it doesn't get to run that long.

> I don't get this comment in update_curr_dl():
>
> 1325    /*
> 1326     * For tasks that participate in GRUB, we implement GRUB-PA: the
> 1327     * spare reclaimed bandwidth is used to clock down frequency.
> 1328     *
>
> It looks like dl_se->runtime is affected and with 'scaled_delta_exec <
> delta' the task runs longer than dl_se->dl_runtime?
>
Yes. As mentioned above, GRUB allows the task to run longer by slowing
down the depreciation of "dl_se->dl_runtime". scaled_delta_exec is
calculated by the GRUB formula explained in the paper [1] & [2].

> I did the test discussed later in this thread with:
>
> 3 [3/100] tasks (dl_se->dl_bw = (3 << 20)/100 = 31457) on 3 CPUs
>
> factor = scaled_delta_exec/delta
>
> - existing grub
>
> rq->dl.bw_ratio = ( 100 << 8 ) / 95 = 269
> rq->dl.extra_bw = ( 95 << 20 ) / 100 = 996147
>
> cpu=2 curr->[thread0-2 1715] delta=2140100 this_bw=31457
> running_bw=31457 extra_bw=894788 u_inact=0 u_act_min=33054 u_act=153788
> scaled_delta_exec=313874 factor=0.14
>
> - your solution patch [1-2]
>
> cpu=2 curr->[thread0-0 1676] delta=157020 running_bw=31457 max_bw=996147
> res=4958 factor=0.03
>
> You say that GRUB calculation is inaccurate and that this inaccuracy
> gets larger as the bandwidth of tasks becomes smaller.
>
> Could you explain this inaccuracy on this example?
>
According to GRUB, we should be able to reclaim the unused bandwidth
for the running task upto RT limits(95%). In this example we have a
task with 3ms runtime and 100ms runtime on a cpu. So it is supposed
to run for 95ms before it is throttled.
Existing implementation's factor = 0.14 and 3ms is depreciated by
this factor. So it gets to run for "3 / 0.14 ~= 22ms". This is the
inaccuracy that the patch is trying to solve. With the patch, the
factor is .03166 and runtime = "3 / 0.03166 ~= 95ms"

Hope this clarifies.

Thanks,
Vineeth

[1]: Abeni, Luca & Lipari, Giuseppe & Parri, Andrea & Sun, Youcheng.
     (2015). Parallel and sequential reclaiming in multicore
     real-time global scheduling.

[2]: G. Lipari and S. Baruah, "Greedy reclamation of unused bandwidth
     in constant-bandwidth servers," Proceedings 12th Euromicro
     Conference on Real-Time Systems. Euromicro RTS 2000, Stockholm,
     Sweden, 2000, pp. 193-200, doi: 10.1109/EMRTS.2000.854007.

Re: [PATCH v3 2/5] sched/deadline: Fix reclaim inaccuracy with SMP

Posted by Dietmar Eggemann 2 years, 8 months ago

Hi  Vineeth,

On 20/05/2023 04:15, Vineeth Remanan Pillai wrote:
> Hi Dietmar,
> 
> On Fri, May 19, 2023 at 1:56 PM Dietmar Eggemann
> <dietmar.eggemann@arm.com> wrote:
> 
>>> TID[730]: RECLAIM=1, (r=8ms, d=10ms, p=10ms), Util: 95.05
>>> TID[731]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 31.34
>>> TID[732]: RECLAIM=1, (r=1ms, d=100ms, p=100ms), Util: 3.16
>>
>> What does this 'Util: X' value stand for? I assume it's the utilization
>> of the task? How do you obtain it?
>>
> Yes, it is the utilization of the task. I calculate it by dividing the
> cputime with elapsed time(using clock_gettime(2)).

Makes, sense, I guess what I missed here in the first place is the fact
that those DL tasks want to run 100%.

>> I see that e.g. TID[731] should run 1ms each 10ms w/o grub and with grub
>> the runtime could be potentially longer since 'scaled_delta_exec < delta'.
>>
> Yes correct. GRUB(Greedy Reclamation of Unused Bandwidth) algorithm
> is used here for deadline tasks that needs to run longer than their
> runtime when needed. sched_setattr allows a flag SCHED_FLAG_RECLAIM
> to indicate that the task would like to reclaim unused bandwidth of a
> cpu if available. For those tasks, 'runtime' is depreciated using the
> GRUB formula and it allows it to run for longer and reclaim the free
> bandwidth of the cpu. The GRUB implementation in linux allows a task
> to reclaim upto RT capacity(95%) and depends on the free bandwidth
> of the cpu. So TID[731] theoretically should run for 95ms as it is
> the only task in the cpu, but it doesn't get to run that long.

Correct.

>> I don't get this comment in update_curr_dl():
>>
>> 1325    /*
>> 1326     * For tasks that participate in GRUB, we implement GRUB-PA: the
>> 1327     * spare reclaimed bandwidth is used to clock down frequency.
>> 1328     *
>>
>> It looks like dl_se->runtime is affected and with 'scaled_delta_exec <
>> delta' the task runs longer than dl_se->dl_runtime?
>>
> Yes. As mentioned above, GRUB allows the task to run longer by slowing
> down the depreciation of "dl_se->dl_runtime". scaled_delta_exec is
> calculated by the GRUB formula explained in the paper [1] & [2].

What I didn't understand was this `GRUB-PA` and `the spare reclaimed
bandwidth is used to clock down frequency` in relation to GRUB task
runtime depreciation.

But now I think I get it. `GRUB-PA` means that in case we run with the
schedutil CPUfreq governor, the CPU frequency is influenced by Uact
(rq->dl.running_bw) via:

sugov_get_util() -> effective_cpu_util() -> cpu_bw_dl() ->

      return rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT

and on top of this we do GRUB reclaiming for those SCHED_FLAG_RECLAIM
tasks, i.e. task runtime depreciation.

>> I did the test discussed later in this thread with:
>>
>> 3 [3/100] tasks (dl_se->dl_bw = (3 << 20)/100 = 31457) on 3 CPUs
>>
>> factor = scaled_delta_exec/delta
>>
>> - existing grub
>>
>> rq->dl.bw_ratio = ( 100 << 8 ) / 95 = 269
>> rq->dl.extra_bw = ( 95 << 20 ) / 100 = 996147
>>
>> cpu=2 curr->[thread0-2 1715] delta=2140100 this_bw=31457
>> running_bw=31457 extra_bw=894788 u_inact=0 u_act_min=33054 u_act=153788
>> scaled_delta_exec=313874 factor=0.14
>>
>> - your solution patch [1-2]
>>
>> cpu=2 curr->[thread0-0 1676] delta=157020 running_bw=31457 max_bw=996147
>> res=4958 factor=0.03
>>
>> You say that GRUB calculation is inaccurate and that this inaccuracy
>> gets larger as the bandwidth of tasks becomes smaller.
>>
>> Could you explain this inaccuracy on this example?
>>
> According to GRUB, we should be able to reclaim the unused bandwidth
> for the running task upto RT limits(95%). In this example we have a
> task with 3ms runtime and 100ms runtime on a cpu. So it is supposed
> to run for 95ms before it is throttled.

Correct.

> Existing implementation's factor = 0.14 and 3ms is depreciated by
> this factor. So it gets to run for "3 / 0.14 ~= 22ms". This is the
> inaccuracy that the patch is trying to solve. With the patch, the
> factor is .03166 and runtime = "3 / 0.03166 ~= 95ms"

My tests were wrong since I was using DL task with dl_runtime=3ms and
dl_period = 100ms with an actual runtime=3ms whereas your tasks probably
want to run 100%.

> Hope this clarifies.

yes, it did, thanks!

Re: [PATCH v3 2/5] sched/deadline: Fix reclaim inaccuracy with SMP

Posted by luca abeni 2 years, 9 months ago

Hi,

this patch is giving me some headaches:

On Sun, 14 May 2023 22:57:13 -0400
Vineeth Pillai <vineeth@bitbyteword.org> wrote:
[...]
>   *	Uextra:		Extra bandwidth not reserved:
> - *			= Umax - \Sum(u_i / #cpus in the root domain)
> + *			= Umax - this_bw

While I agree that this setting should be OK, it ends up with
	dq = -Uact / Umax * dt
which I remember I originally tried, and gave some issues
(I do not remember the details, but I think if you try N
identical reclaiming tasks, with N > M, the reclaimed time
is not distributed equally among them?)

I need to think a little bit more about this...


		Luca

>   *	u_i:		Bandwidth of an admitted dl task in the
>   *			root domain.
>   *
> @@ -1286,22 +1286,14 @@ int dl_runtime_exceeded(struct
> sched_dl_entity *dl_se) */
>  static u64 grub_reclaim(u64 delta, struct rq *rq, struct
> sched_dl_entity *dl_se) {
> -	u64 u_act;
> -	u64 u_inact = rq->dl.this_bw - rq->dl.running_bw; /* Utot -
> Uact */ -
>  	/*
> -	 * Instead of computing max{u, (rq->dl.max_bw - u_inact -
> u_extra)},
> -	 * we compare u_inact + rq->dl.extra_bw with
> -	 * rq->dl.max_bw - u, because u_inact + rq->dl.extra_bw can
> be larger
> -	 * than rq->dl.max_bw (so, rq->dl.max_bw - u_inact -
> rq->dl.extra_bw
> -	 * would be negative leading to wrong results)
> +	 * max{u, Umax - Uinact - Uextra}
> +	 * = max{u, max_bw - (this_bw - running_bw) + (this_bw -
> running_bw)}
> +	 * = max{u, running_bw} = running_bw
> +	 * So dq = -(max{u, Umax - Uinact - Uextra} / Umax) dt
> +	 *       = -(running_bw / max_bw) dt
>  	 */
> -	if (u_inact + rq->dl.extra_bw > rq->dl.max_bw - dl_se->dl_bw)
> -		u_act = dl_se->dl_bw;
> -	else
> -		u_act = rq->dl.max_bw - u_inact - rq->dl.extra_bw;
> -
> -	return div64_u64(delta * u_act, rq->dl.max_bw);
> +	return div64_u64(delta * rq->dl.running_bw, rq->dl.max_bw);
>  }
>  
>  /*

Re: [PATCH v3 2/5] sched/deadline: Fix reclaim inaccuracy with SMP

Posted by Vineeth Remanan Pillai 2 years, 8 months ago

Hi Luca,

On Mon, May 15, 2023 at 4:06 AM luca abeni <luca.abeni@santannapisa.it> wrote:

>
> this patch is giving me some headaches:
>
Sorry about that.. I was also stressing out on how to get the
reclaiming done right for the past couple of days ;-)

> Vineeth Pillai <vineeth@bitbyteword.org> wrote:
> [...]
> >   *   Uextra:         Extra bandwidth not reserved:
> > - *                   = Umax - \Sum(u_i / #cpus in the root domain)
> > + *                   = Umax - this_bw
>
> While I agree that this setting should be OK, it ends up with
>         dq = -Uact / Umax * dt
> which I remember I originally tried, and gave some issues
> (I do not remember the details, but I think if you try N
> identical reclaiming tasks, with N > M, the reclaimed time
> is not distributed equally among them?)
>
I have noticed this behaviour where the reclaimed time is not equally
distributed when we have more tasks than available processors. But it
depended on where the task was scheduled. Within the same cpu, the
distribution seemed to be proportional. But the tasks migrated often
and then depending on whether the task got a whole cpu for its
runtime or not, the reclaimed bandwidth differed. I thought that
should be okay as it depended upon where the task landed.

One other problem I saw was cpu usage spiking above max_bw leading to
system hang sometimes. I thought stopping reclaiming when running_bw
gets larger than max_bw(in 4th patch) fixed this, but when I ran the
tests long enough, I did see this hang.

> I need to think a little bit more about this...
>
Thanks for looking into this.. I have a basic idea why tasks with less
bandwidth reclaim less in SMP when number of tasks is less than number
of cpus, but do not yet have a verifiable fix for it.

If patches 1 and 4 looks good to you, we shall drop 2 and 3 and fix the
SMP issue with varying bandwidth separately.. Patch 4 would differ a
bit when I remove 2 and 3 so as to use the formula:
 "dq = -(max{u, (Umax_reclaim - Uinact - Uextra)} / Umax_reclaim) dt"

Thanks for your patience with all these brainstorming:-)

Thanks,
Vineeth

Re: [PATCH v3 2/5] sched/deadline: Fix reclaim inaccuracy with SMP

Posted by luca abeni 2 years, 8 months ago

On Mon, 15 May 2023 21:47:03 -0400
Vineeth Remanan Pillai <vineeth@bitbyteword.org> wrote:

> Hi Luca,
> 
> On Mon, May 15, 2023 at 4:06 AM luca abeni
> <luca.abeni@santannapisa.it> wrote:
> 
> >
> > this patch is giving me some headaches:
> >  
> Sorry about that.. I was also stressing out on how to get the
> reclaiming done right for the past couple of days ;-)

Well, this math is hard... :)

> > Vineeth Pillai <vineeth@bitbyteword.org> wrote:
> > [...]  
> > >   *   Uextra:         Extra bandwidth not reserved:
> > > - *                   = Umax - \Sum(u_i / #cpus in the root
> > > domain)
> > > + *                   = Umax - this_bw  
> >
> > While I agree that this setting should be OK, it ends up with
> >         dq = -Uact / Umax * dt
> > which I remember I originally tried, and gave some issues
> > (I do not remember the details, but I think if you try N
> > identical reclaiming tasks, with N > M, the reclaimed time
> > is not distributed equally among them?)
> >  
> I have noticed this behaviour where the reclaimed time is not equally
> distributed when we have more tasks than available processors. But it
> depended on where the task was scheduled. Within the same cpu, the
> distribution seemed to be proportional.

Yes, as far as I remember it is due to migrations. IIRC, the problem is
related to the fact that using "dq = -Uact / Umax * dt" a task running
on a core might end up trying to reclaim some idle time from other
cores (which is obviously not possible).
This is why m-GRUB used "1 - Uinact" instead of "Uact"

[...]
> > I need to think a little bit more about this...
> >  
> Thanks for looking into this.. I have a basic idea why tasks with less
> bandwidth reclaim less in SMP when number of tasks is less than number
> of cpus, but do not yet have a verifiable fix for it.

I think I can now understand at least part of the problem. In my
understanding, the problem is due to using
	dq = -(max{u_i, (Umax - Uinact - Uextra)} / Umax) * dt

It should really be
	dq = -(max{u_i, (1 - Uinact - Uextra)} / Umax) * dt

(since we divide by Umax, using "Umax - ..." will lead to reclaiming up
to "Umax / Umax" = 1)

Did you try this equation?

I'll write more about this later... And thanks for coping with all my
comments!


				Luca
> 
> If patches 1 and 4 looks good to you, we shall drop 2 and 3 and fix
> the SMP issue with varying bandwidth separately.. Patch 4 would
> differ a bit when I remove 2 and 3 so as to use the formula:
>  "dq = -(max{u, (Umax_reclaim - Uinact - Uextra)} / Umax_reclaim) dt"
> 
> Thanks for your patience with all these brainstorming:-)
> 
> Thanks,
> Vineeth

Re: [PATCH v3 2/5] sched/deadline: Fix reclaim inaccuracy with SMP

Posted by Vineeth Remanan Pillai 2 years, 8 months ago

Hi Luca,

On Tue, May 16, 2023 at 3:37 AM luca abeni <luca.abeni@santannapisa.it> wrote:
> > I have noticed this behaviour where the reclaimed time is not equally
> > distributed when we have more tasks than available processors. But it
> > depended on where the task was scheduled. Within the same cpu, the
> > distribution seemed to be proportional.
>
> Yes, as far as I remember it is due to migrations. IIRC, the problem is
> related to the fact that using "dq = -Uact / Umax * dt" a task running
> on a core might end up trying to reclaim some idle time from other
> cores (which is obviously not possible).
> This is why m-GRUB used "1 - Uinact" instead of "Uact"
>
This is what I was a little confused about. In "-Uact / Umax", all
the variables are per-cpu and it should only be reclaiming what is
free on the cpu right? And when migration happens, Uact changes
and the reclaiming adapts itself. I was thinking it should probably
be okay for tasks to reclaim differently based on what free bw is
left on the cpu it is running. For eg: if cpu 1 has two tasks of bw
.3 each, each task can reclaim "(.95 - .6) / 2" and another cpu with
only one task(.3 bandwidth) reclaims (.95 - .3). So both cpus
utilization is .95 and tasks reclaim what is available on the cpu.

With "1 - Uinact", where Uinact accounts for a portion of global free
bandwidth, tasks reclaim proportionately to the global free bandwidth
and this causes tasks with lesser bandwidth to reclaim lesser when
compared to higher bandwidth tasks even if they don't share the cpu.
This is what I was seeing in practice.

But with Uact / Umax, Uact can be greater than Umax and this caused
some issues like tasks not getting their reserved bandwidth. For eg:
4 tasks with (7,10) on a 3 cpu system - one cpu can have Uact of 1.4
and scaled_delta to be greater than delta. This causes runtime to
deplete faster until one task is migrated. But after migration, the
target cpu will have this problem. So "Uact / Umax" was not working
in close to overcommit situations.

In summary "1 - Uinact" causes reclaiming much less while "Uact / Umax"
has issues during overcommitting of tasks with high bandwidth. This is
what I understood from experiments and reading.

> [...]
> > > I need to think a little bit more about this...
> > >
> > Thanks for looking into this.. I have a basic idea why tasks with less
> > bandwidth reclaim less in SMP when number of tasks is less than number
> > of cpus, but do not yet have a verifiable fix for it.
>
> I think I can now understand at least part of the problem. In my
> understanding, the problem is due to using
>         dq = -(max{u_i, (Umax - Uinact - Uextra)} / Umax) * dt
>
> It should really be
>         dq = -(max{u_i, (1 - Uinact - Uextra)} / Umax) * dt
>
> (since we divide by Umax, using "Umax - ..." will lead to reclaiming up
> to "Umax / Umax" = 1)
>
> Did you try this equation?
>
I had tested this and it was reclaiming much less compared to the first one.
I had 3 tasks with reservation (3,100) and 3 cpus.

With dq = -(max{u_i, (Umax - Uinact - Uextra)} / Umax) * dt (1)
TID[636]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 95.08
TID[635]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 95.07
TID[637]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 95.06

With dq = -(max{u_i, (1 - Uinact - Uextra)} / Umax) * dt (2)
TID[601]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 35.65
TID[600]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 35.65
TID[602]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 35.65

As the task bandwidth goes higher, equation (2) reclaims more, but
equation (2) is a constant of 95% as long as number of tasks less
than cpus. If the number of tasks is more than cpus, eq (2) fares
better in reclaiming than eq (1)

eq (1) with 5 tasks (3,100)
TID[627]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 28.64
TID[626]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 28.64
TID[629]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 28.62
TID[628]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 29.00
TID[630]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 28.99

Here top shows 3 cpus in the range ~45 to 50% util

eq (2) with 5 tasks (3,100)
TID[667]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 57.20
TID[670]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 57.79
TID[668]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 57.11
TID[666]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 56.34
TID[669]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 55.82

And here top shows all 3 cpus with 95% util

> I'll write more about this later... And thanks for coping with all my
> comments!
>
Thanks :-)

Vineeth

Re: [PATCH v3 2/5] sched/deadline: Fix reclaim inaccuracy with SMP

Posted by luca abeni 2 years, 8 months ago

Hi,

sorry for returning on this discussion, but there is something I still
do not understand:

On Tue, 16 May 2023 11:08:18 -0400
Vineeth Remanan Pillai <vineeth@bitbyteword.org> wrote:
[...]  
> I had tested this and it was reclaiming much less compared to the
> first one. I had 3 tasks with reservation (3,100) and 3 cpus.

So, just to confirm: here you have only 3 SCHED_DEADLINE tasks,
scheduled on a root domain containing only 3 CPUs (dl_bw_cpus() return
3)... Right?
So, the utilization of each task is 3/100 = 0.03 and Uextra is
1 - (0.03 * 3) / 3 = 0.97.
And since all the tasks are always active, Uinact = 0...
Is this understanding right?

If so:
> With dq = -(max{u_i, (Umax - Uinact - Uextra)} / Umax) * dt (1)
> TID[636]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 95.08
> TID[635]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 95.07
> TID[637]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 95.06
> 
> With dq = -(max{u_i, (1 - Uinact - Uextra)} / Umax) * dt (2)
> TID[601]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 35.65
> TID[600]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 35.65
> TID[602]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 35.65

Here, we should have
	dq = -(max{0.03, (1 - 0 - 0.97)} / Umax) * dt
	   = -(0.03 / Umax) * dt
which reclaims up to Umax... So, the utilization should be 95%
Since you measured 35.65%, it means that (1-Uextra) is much larger
than 0.97... So, maybe you found some bug in the Uextra computation?

Can you try printing the extra_bw value, to check what happened?



			Thanks,
				Luca

> 
> As the task bandwidth goes higher, equation (2) reclaims more, but
> equation (2) is a constant of 95% as long as number of tasks less
> than cpus. If the number of tasks is more than cpus, eq (2) fares
> better in reclaiming than eq (1)
> 
> eq (1) with 5 tasks (3,100)
> TID[627]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 28.64
> TID[626]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 28.64
> TID[629]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 28.62
> TID[628]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 29.00
> TID[630]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 28.99
> 
> Here top shows 3 cpus in the range ~45 to 50% util
> 
> eq (2) with 5 tasks (3,100)
> TID[667]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 57.20
> TID[670]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 57.79
> TID[668]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 57.11
> TID[666]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 56.34
> TID[669]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 55.82
> 
> And here top shows all 3 cpus with 95% util
> 
> > I'll write more about this later... And thanks for coping with all
> > my comments!
> >  
> Thanks :-)
> 
> Vineeth

Re: [PATCH v3 2/5] sched/deadline: Fix reclaim inaccuracy with SMP

Posted by luca abeni 2 years, 8 months ago

On Fri, 19 May 2023 11:56:21 +0200
luca abeni <luca.abeni@santannapisa.it> wrote:

> Hi,
> 
> sorry for returning on this discussion, but there is something I still
> do not understand:
> 
> On Tue, 16 May 2023 11:08:18 -0400
> Vineeth Remanan Pillai <vineeth@bitbyteword.org> wrote:
> [...]  
> > I had tested this and it was reclaiming much less compared to the
> > first one. I had 3 tasks with reservation (3,100) and 3 cpus.  
> 
> So, just to confirm: here you have only 3 SCHED_DEADLINE tasks,
> scheduled on a root domain containing only 3 CPUs (dl_bw_cpus() return
> 3)... Right?
> So, the utilization of each task is 3/100 = 0.03 and Uextra is
> 1 - (0.03 * 3) / 3 = 0.97.

OK, sorry again... I found my error immediately after sending the email.
Uextra is computed as "Umax - ...", not "1 - ...".
So, I now understand where the 35% comes from.

I now _suspect_ the correct equation should be
	dq = -(max{u_i / Umax, (Umax - Uinact - Uextra)}) * dt
but I want to test it before wasting your time again; I'll write more
after performing some more tests.


			Luca

> And since all the tasks are always active, Uinact = 0...
> Is this understanding right?
> 
> If so:
> > With dq = -(max{u_i, (Umax - Uinact - Uextra)} / Umax) * dt (1)
> > TID[636]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 95.08
> > TID[635]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 95.07
> > TID[637]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 95.06
> > 
> > With dq = -(max{u_i, (1 - Uinact - Uextra)} / Umax) * dt (2)
> > TID[601]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 35.65
> > TID[600]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 35.65
> > TID[602]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 35.65  
> 
> Here, we should have
> 	dq = -(max{0.03, (1 - 0 - 0.97)} / Umax) * dt
> 	   = -(0.03 / Umax) * dt
> which reclaims up to Umax... So, the utilization should be 95%
> Since you measured 35.65%, it means that (1-Uextra) is much larger
> than 0.97... So, maybe you found some bug in the Uextra computation?
> 
> Can you try printing the extra_bw value, to check what happened?
> 
> 
> 
> 			Thanks,
> 				Luca
> 
> > 
> > As the task bandwidth goes higher, equation (2) reclaims more, but
> > equation (2) is a constant of 95% as long as number of tasks less
> > than cpus. If the number of tasks is more than cpus, eq (2) fares
> > better in reclaiming than eq (1)
> > 
> > eq (1) with 5 tasks (3,100)
> > TID[627]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 28.64
> > TID[626]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 28.64
> > TID[629]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 28.62
> > TID[628]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 29.00
> > TID[630]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 28.99
> > 
> > Here top shows 3 cpus in the range ~45 to 50% util
> > 
> > eq (2) with 5 tasks (3,100)
> > TID[667]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 57.20
> > TID[670]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 57.79
> > TID[668]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 57.11
> > TID[666]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 56.34
> > TID[669]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 55.82
> > 
> > And here top shows all 3 cpus with 95% util
> >   
> > > I'll write more about this later... And thanks for coping with all
> > > my comments!
> > >    
> > Thanks :-)
> > 
> > Vineeth  
>

Re: [PATCH v3 2/5] sched/deadline: Fix reclaim inaccuracy with SMP

Posted by Vineeth Remanan Pillai 2 years, 8 months ago

Hi Luca,

On Fri, May 19, 2023 at 6:18 AM luca abeni <luca.abeni@santannapisa.it> wrote:
>
> On Fri, 19 May 2023 11:56:21 +0200
> luca abeni <luca.abeni@santannapisa.it> wrote:
> [...]
>
> OK, sorry again... I found my error immediately after sending the email.
> Uextra is computed as "Umax - ...", not "1 - ...".
> So, I now understand where the 35% comes from.
>
Thanks for debugging this, it makes sense now!

> I now _suspect_ the correct equation should be
>         dq = -(max{u_i / Umax, (Umax - Uinact - Uextra)}) * dt
> but I want to test it before wasting your time again; I'll write more
> after performing some more tests.
>
I tried this equation and it fixes the above issue. But a little
confused as to why we should not be limiting the second term to Umax?
In my testing, I was seeing the issue solved with this equation as
well:
 "dq = -(max{u_i, (Umax - Uinact - Uextra)} / Umax) * dt"

With both these equations, it doesn't solve couple of other issues we
had discussed before:
- tasks with different bandwidth reclaims differently even when #tasks
  is less than #cpus.
- cpu util may go to 100% when we have tasks with large bandwidth close
  to Umax

As an eg. for issue 1, three tasks - (7,10) (3,10) and (1,10):
TID[590]: RECLAIM=1, (r=7ms, d=10ms, p=10ms), Util: 95.20
TID[591]: RECLAIM=1, (r=3ms, d=10ms, p=10ms), Util: 81.94
TID[592]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 27.19

re. issue 2, four tasks with same reservation (7,10), tasks tries
to reclaim leading to 100% cpu usage on all three cpus and leads to
system hang.

I was trying to understand the issue and it looks like static values
of Uextra and Umax are causing inaccuracies. Uextra is calculated
based on global bandwidth But based on the local state of the cpu, we
could reclaim more or less than (u_inact + rq->dl.extra_bw). Similarly
Umax is a local max for each cpu and we should not be reclaiming upto
Umax unconditionally. If the global load is high, reclaiming upto Umax
would cause problems as a migration can land in.

I was trying an idea to dynamically decide Uextra and Umax based on
global and local load.

The crux of the idea is as below

+ if (rq->dl.running_bw > rq->dl.max_bw)
+ return delta;
+
+ max_bw = rq->dl.this_bw + rq->dl.extra_bw;
+ extra_bw = rq->dl.extra_bw;
+ if (rq->dl.this_bw < rq->dl.extra_bw || max_bw > rq->dl.max_bw) {
+ extra_bw = rq->dl.max_bw - rq->dl.this_bw;
+ max_bw = rq->dl.max_bw;
+ }
+

And use this max_bw and extra_bw in the equation:
 "dq = -(max{u_i, (Umax - Uinact - Uextra)} / Umax) * dt"

The reasoning for above changes are:
- running_bw can be greater than max_bw in SMP and we should not be
  reclaiming if that's the case
- Initially we assume max_bw and extra_bw, based on global load.
  If this_bw < extra_bw, it means that cpus are not heavily loaded
  and incoming migrations can be satisfied with local cpu's available
  bandwidth and we could reclaim upto rq->dl.max_bw and use extra_bw
  as "rq->dl.max_bw - rq->dl.this_bw". Also we should limit max_bw for
  any cpu to a maximum of rq->dl.max_cpu.

This does not consider mix of normal and SCHED_FLAG_RECLAIM tasks and
minor changes on top of this are need to consider that scenario. This
seems to fix all the issues that I have encountered.

The above fix doesn't work on equation:
  "dq = -(max{u_i / Umax, (Umax - Uinact - Uextra)}) * dt"

cpu still spikes to 100% with 4 tasks of (7,10). I am not sure, but I
guess it might be because we are not limiting the second term to Umax.

Please let me know your thoughts on this.

Thanks again,
Vineeth

Re: [PATCH v3 2/5] sched/deadline: Fix reclaim inaccuracy with SMP

Posted by luca abeni 2 years, 8 months ago

Hi again,

On Fri, 19 May 2023 12:12:50 -0400
Vineeth Remanan Pillai <vineeth@bitbyteword.org> wrote:
[...]
> - cpu util may go to 100% when we have tasks with large bandwidth
> close to Umax
> 
> As an eg. for issue 1, three tasks - (7,10) (3,10) and (1,10):
> TID[590]: RECLAIM=1, (r=7ms, d=10ms, p=10ms), Util: 95.20
> TID[591]: RECLAIM=1, (r=3ms, d=10ms, p=10ms), Util: 81.94
> TID[592]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 27.19
> 
> re. issue 2, four tasks with same reservation (7,10), tasks tries
> to reclaim leading to 100% cpu usage on all three cpus and leads to
> system hang.

I just tried to repeat this test on a VM with 3 CPUs, and I can
reproduce the stall (100% of CPU time reclaimed by SCHED_DEADLINE
tasks, with no possibility for the other tasks to execute) when I use
	dq = -(max{u_i / Umax, (Umax - Uinact - Uextra)}) * dt

But when I use
	dq = -(max{u_i, (Umax - Uinact - Uextra)} / Umax) * dt
everything works as expected, the 4 tasks reclaim 95% of the CPU
time and my shell is still active...
(so, I cannot reproduce the starvation issue with this equation)

So, I now think the second one is the correct equation to be used.



				Luca

Re: [PATCH v3 2/5] sched/deadline: Fix reclaim inaccuracy with SMP

Posted by Vineeth Remanan Pillai 2 years, 8 months ago

Hi Luca,

Merging the last two mails in this reply :-)

> So, we are wasting 181.3 - 95 = 86.3% of CPU time, which 590 cannot
> reclaim (because it cannot execute simultaneously on 2 CPUs).
>
Correct. Thanks for explaining it in detail, I was tracing the scheduler
and verified this pattern you explained.

> Now that the problem is more clear to me, I am trying to understand a
> possible solution (as you mention, moving some extra bandwidth from the
> 590's CPU will fix this problem... But I am not sure if this dynamic
> extra bandwidth migration is feasible in practice without introducing
> too much overhead)
>
> I'll look better at your new proposal.
>
The idea that I mentioned tries to solve this problem in a best effort
way: If global load is high, use the global "Uextra = rq->dl.extra_bw"
and "Umax = rq->dl.this_bw + rq->dl.extra_bw". Otherwise use the local
values "Umax= rq->dl.max_bw", "Uextra= rq->dl.max_bw - rq->dl.this_bw".
This is still not perfect, but tries to reclaim very close to maximum
allowed limit almost always.

Please have a look when you get a chance :-).

>
> I just tried to repeat this test on a VM with 3 CPUs, and I can
> reproduce the stall (100% of CPU time reclaimed by SCHED_DEADLINE
> tasks, with no possibility for the other tasks to execute) when I use
>         dq = -(max{u_i / Umax, (Umax - Uinact - Uextra)}) * dt
>
> But when I use
>         dq = -(max{u_i, (Umax - Uinact - Uextra)} / Umax) * dt
> everything works as expected, the 4 tasks reclaim 95% of the CPU
> time and my shell is still active...
> (so, I cannot reproduce the starvation issue with this equation)
>
Sorry about this confusion, yes you are right, there is no stall with
this equation. The only issue is the lesser reclaim when the load is
less and tasks have different bandwidth requirements.

> So, I now think the second one is the correct equation to be used.
>
Thanks for confirming.

I think it probably makes sense to get the fix for the equation to go
in as a first step and then we can investigate more about the second
issue (less reclaiming with less load and different bandwidth) and
fix it separately. What do you think? I shall send the next iteration
with the fix for the equation alone if its okay with you.

Thanks,
Vineeth

Re: [PATCH v3 2/5] sched/deadline: Fix reclaim inaccuracy with SMP

Posted by luca abeni 2 years, 8 months ago

Hi,

sorry for the late reply.

On Mon, 22 May 2023 15:22:52 -0400
Vineeth Remanan Pillai <vineeth@bitbyteword.org> wrote:
[...]
> > But when I use
> >         dq = -(max{u_i, (Umax - Uinact - Uextra)} / Umax) * dt
> > everything works as expected, the 4 tasks reclaim 95% of the CPU
> > time and my shell is still active...
> > (so, I cannot reproduce the starvation issue with this equation)
> >  
> Sorry about this confusion, yes you are right, there is no stall with
> this equation. The only issue is the lesser reclaim when the load is
> less and tasks have different bandwidth requirements.
> 
> > So, I now think the second one is the correct equation to be used.
> >  
> Thanks for confirming.
> 
> I think it probably makes sense to get the fix for the equation to go
> in as a first step and then we can investigate more about the second
> issue (less reclaiming with less load and different bandwidth) and
> fix it separately. What do you think?

I fully agree. If you split this change in a first patch, IMHO it
can be applied.

BTW, I tried changing the equation without introducing div64, and it
seems to me that it works well... So, if removing the bw_ratio
approximation is needed, I think you can do it in a second patch (so,
the first patch changes the reclaiming equation, and the second one
introduces div64)


			Thanks,
				Luca

> I shall send the next iteration
> with the fix for the equation alone if its okay with you.
> 
> Thanks,
> Vineeth

Re: [PATCH v3 2/5] sched/deadline: Fix reclaim inaccuracy with SMP

Posted by Vineeth Remanan Pillai 2 years, 8 months ago

Hi Luca,

> >
> > I think it probably makes sense to get the fix for the equation to go
> > in as a first step and then we can investigate more about the second
> > issue (less reclaiming with less load and different bandwidth) and
> > fix it separately. What do you think?
>
> I fully agree. If you split this change in a first patch, IMHO it
> can be applied.
>
Thanks, I shall have it as the first patch in the next iteration.

> BTW, I tried changing the equation without introducing div64, and it
> seems to me that it works well... So, if removing the bw_ratio
> approximation is needed, I think you can do it in a second patch (so,
> the first patch changes the reclaiming equation, and the second one
> introduces div64)
>
There was one more reason for removing the bw_ratio. The second patch
fixing the issue with mix of normal and SCHED_FLAG_RECLAIM tasks,
do not have a constant Umax, so we cannot calculate the bw_ratio in
advance and has to be calculated in grub_reclaim().

As you suggested, I shall remove the bw_ratio as the second patch and
then have the third patch to fix the case of SCHED_FLAG_RECLAIM and
normal deadline tasks. For the rest of the issues, I shall work on it as
a separate patch series once this goes in.

Will send it out tomorrow after a bit more testing.

Thanks again,
Vineeth

Re: [PATCH v3 2/5] sched/deadline: Fix reclaim inaccuracy with SMP

Posted by Vineeth Remanan Pillai 2 years, 8 months ago

Hi Luca,

On Tue, May 23, 2023 at 10:11 PM Vineeth Remanan Pillai
<vineeth@bitbyteword.org> wrote:
>
> Hi Luca,
>
> [...]
>
> Will send it out tomorrow after a bit more testing.
>
Sorry for the delay. I was trying to fix the inaccuracy when normal
and SCHED_FLAG_RECLAIM tasks are present. Its a bit complicated than
I initially thought as Umax_reclaim and Uextra needs to be computed
at the root domain level and this would mean more locking. I have a
patch ready but I think I need to do more testing to understand the
performance implications and it is better to batch it as a separate
series.

So, I am just sending out the GRUB equation fix which is tested well
and I am confident about. Will send out rest of the fixes later as
a separate series.

Thanks,
Vineeth

Re: [PATCH v3 2/5] sched/deadline: Fix reclaim inaccuracy with SMP

Posted by luca abeni 2 years, 8 months ago

On Fri, 26 May 2023 10:54:09 -0400
Vineeth Remanan Pillai <vineeth@bitbyteword.org> wrote:
[...]
> Sorry for the delay. I was trying to fix the inaccuracy when normal
> and SCHED_FLAG_RECLAIM tasks are present. Its a bit complicated than
> I initially thought as Umax_reclaim and Uextra needs to be computed
> at the root domain level and this would mean more locking. I have a
> patch ready but I think I need to do more testing to understand the
> performance implications and it is better to batch it as a separate
> series.
> 
> So, I am just sending out the GRUB equation fix which is tested well
> and I am confident about. Will send out rest of the fixes later as
> a separate series.

Looks good to me. And... Thanks again for taking care of this issue!


			Luca

Re: [PATCH v3 2/5] sched/deadline: Fix reclaim inaccuracy with SMP

Posted by luca abeni 2 years, 8 months ago

Hi,

On Fri, 19 May 2023 12:12:50 -0400
Vineeth Remanan Pillai <vineeth@bitbyteword.org> wrote:
[...]
> With both these equations, it doesn't solve couple of other issues we
> had discussed before:
> - tasks with different bandwidth reclaims differently even when #tasks
>   is less than #cpus.

I think I now understand this issue (see below)

> - cpu util may go to 100% when we have tasks with large bandwidth
> close to Umax

This one is still not clear to me... I'll do some more analysis.

> As an eg. for issue 1, three tasks - (7,10) (3,10) and (1,10):
> TID[590]: RECLAIM=1, (r=7ms, d=10ms, p=10ms), Util: 95.20
> TID[591]: RECLAIM=1, (r=3ms, d=10ms, p=10ms), Util: 81.94
> TID[592]: RECLAIM=1, (r=1ms, d=10ms, p=10ms), Util: 27.19

So, the issue here is that GRUB tries to assign the reclaimed
utilization proportionally to the task's utilizations... So, 591 should
execute 3 times the amount of time executed by 592, and 590 should
execute 7 times the amount of time executed by 592. Task 592 is then
supposed to execute for 95 / (1 + 3 + 7) = 95 / 11 = 8.63% of the CPU
time; task 591 is supposed to execute for 8.63 * 3 = 25.9% of the CPU
time; task 590 is supposed to execute for 8.63 * 7 = 60.64% of the CPU
time.

So, 592 executes for 8.63 * 3 = 25.9% of the time on one single CPU
(you measured 27.19, but this is nead), task 591 executes for
25.9 * 3 = 77.72% of the time on one single CPU (again, this is
close to what you measured) and task 590 should execute for...
60.64 * 3 = 181.3% of the time on one single CPU! Which is clearly not
possible... And the "max{}" rule cuts this to 95%.

So, we are wasting 181.3 - 95 = 86.3% of CPU time, which 590 cannot
reclaim (because it cannot execute simultaneously on 2 CPUs).

And this is close to the amount of CPU time not reclaimed in the test
you cite above (95 - 81 + 95 - 27)

Now that the problem is more clear to me, I am trying to understand a
possible solution (as you mention, moving some extra bandwidth from the
590's CPU will fix this problem... But I am not sure if this dynamic
extra bandwidth migration is feasible in practice without introducing
too much overhead)

I'll look better at your new proposal.

			Thanks,
				Luca

Re: [PATCH v3 2/5] sched/deadline: Fix reclaim inaccuracy with SMP

Posted by luca abeni 2 years, 8 months ago

Hi,

On Tue, 16 May 2023 11:08:18 -0400
Vineeth Remanan Pillai <vineeth@bitbyteword.org> wrote:

> Hi Luca,
> 
> On Tue, May 16, 2023 at 3:37 AM luca abeni
> <luca.abeni@santannapisa.it> wrote:
> > > I have noticed this behaviour where the reclaimed time is not
> > > equally distributed when we have more tasks than available
> > > processors. But it depended on where the task was scheduled.
> > > Within the same cpu, the distribution seemed to be proportional.  
> >
> > Yes, as far as I remember it is due to migrations. IIRC, the
> > problem is related to the fact that using "dq = -Uact / Umax * dt"
> > a task running on a core might end up trying to reclaim some idle
> > time from other cores (which is obviously not possible).
> > This is why m-GRUB used "1 - Uinact" instead of "Uact"
> >  
> This is what I was a little confused about. In "-Uact / Umax", all
> the variables are per-cpu and it should only be reclaiming what is
> free on the cpu right? And when migration happens, Uact changes
> and the reclaiming adapts itself.

Sorry, I do not remember the details... But I think the problem is in
the transient when a task migrates from a core to a different one.
I am trying to search from my old notes to see if I find some more
details.


> I was thinking it should probably
> be okay for tasks to reclaim differently based on what free bw is
> left on the cpu it is running. For eg: if cpu 1 has two tasks of bw
> .3 each, each task can reclaim "(.95 - .6) / 2" and another cpu with
> only one task(.3 bandwidth) reclaims (.95 - .3). So both cpus
> utilization is .95 and tasks reclaim what is available on the cpu.

I suspect (but I am not sure) this only works if tasks do not migrate.


> With "1 - Uinact", where Uinact accounts for a portion of global free
> bandwidth, tasks reclaim proportionately to the global free bandwidth
> and this causes tasks with lesser bandwidth to reclaim lesser when
> compared to higher bandwidth tasks even if they don't share the cpu.
> This is what I was seeing in practice.

Just to be sure: is this with the "original" Uextra setting, or with
your new "Uextra = Umax - this_bw" setting?
(I am not sure, but I suspect that "1 - Uinact - Uextra" with your new
definition of Uextra should work well...)


[...]
> > I think I can now understand at least part of the problem. In my
> > understanding, the problem is due to using
> >         dq = -(max{u_i, (Umax - Uinact - Uextra)} / Umax) * dt
> >
> > It should really be
> >         dq = -(max{u_i, (1 - Uinact - Uextra)} / Umax) * dt
> >
> > (since we divide by Umax, using "Umax - ..." will lead to
> > reclaiming up to "Umax / Umax" = 1)
> >
> > Did you try this equation?
> >  
> I had tested this and it was reclaiming much less compared to the
> first one. I had 3 tasks with reservation (3,100) and 3 cpus.
> 
> With dq = -(max{u_i, (Umax - Uinact - Uextra)} / Umax) * dt (1)
> TID[636]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 95.08
> TID[635]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 95.07
> TID[637]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 95.06
> 
> With dq = -(max{u_i, (1 - Uinact - Uextra)} / Umax) * dt (2)
> TID[601]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 35.65
> TID[600]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 35.65
> TID[602]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 35.65

Maybe I am missing something and I am misunderstanding the situation,
but my impression was that this is the effect of setting
	Umax - \Sum(u_i / #cpus in the root domain)
I was hoping that with your new Umax setting this problem could be
fixed... I am going to double-check my reasoning.


			Thanks,
				Luca

Re: [PATCH v3 2/5] sched/deadline: Fix reclaim inaccuracy with SMP

Posted by Vineeth Remanan Pillai 2 years, 8 months ago

Hi Luca,

On Tue, May 16, 2023 at 12:19 PM luca abeni <luca.abeni@santannapisa.it> wrote:
>
> > I was thinking it should probably
> > be okay for tasks to reclaim differently based on what free bw is
> > left on the cpu it is running. For eg: if cpu 1 has two tasks of bw
> > .3 each, each task can reclaim "(.95 - .6) / 2" and another cpu with
> > only one task(.3 bandwidth) reclaims (.95 - .3). So both cpus
> > utilization is .95 and tasks reclaim what is available on the cpu.
>
> I suspect (but I am not sure) this only works if tasks do not migrate.
>
From what I am seeing, if the reserved bandwidth of all tasks on a cpu
is less than Umax, then this works. Even with migration, if the task
lands on another cpu where the new running_bw < Umax, then it runs and
reclaims the free bandwidth. But this breaks if running_bw > Umax and
it can happen if total_bw is within limits, but a cpu is overloaded.
For eg: four tasks with reservation (7, 10) on a three cpu system.
Here two cpus will have running_bw = .7 but third cpu will be 1.4
even though total_bw = 2.80 which is less than the limit of 2.85.

>
> > With "1 - Uinact", where Uinact accounts for a portion of global free
> > bandwidth, tasks reclaim proportionately to the global free bandwidth
> > and this causes tasks with lesser bandwidth to reclaim lesser when
> > compared to higher bandwidth tasks even if they don't share the cpu.
> > This is what I was seeing in practice.
>
> Just to be sure: is this with the "original" Uextra setting, or with
> your new "Uextra = Umax - this_bw" setting?
> (I am not sure, but I suspect that "1 - Uinact - Uextra" with your new
> definition of Uextra should work well...)
>
I am seeing this with original Uextra setting where the global bandwidth
is accounted. With "Uextra = Umax - this_bw", reclaiming seems to be
correct and I think it is because it considers local bandwidth only.

> > With dq = -(max{u_i, (Umax - Uinact - Uextra)} / Umax) * dt (1)
> > TID[636]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 95.08
> > TID[635]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 95.07
> > TID[637]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 95.06
> >
> > With dq = -(max{u_i, (1 - Uinact - Uextra)} / Umax) * dt (2)
> > TID[601]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 35.65
> > TID[600]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 35.65
> > TID[602]: RECLAIM=1, (r=3ms, d=100ms, p=100ms), Util: 35.65
>
> Maybe I am missing something and I am misunderstanding the situation,
> but my impression was that this is the effect of setting
>         Umax - \Sum(u_i / #cpus in the root domain)
> I was hoping that with your new Umax setting this problem could be
> fixed... I am going to double-check my reasoning.
>
Even with the Umax_reclaim changes, equation (1) is the one which
reclaims upto 95% when number of tasks is less than the number of
cpus. With more tasks than cpus, eq (1) still reclaims more than
eq (2) and cpu utilization caps at 95%. I also need to dig more to
understand the reason behind this.

Thanks for looking into this, I will also study more on this and
keep you posted..

Thanks,
Vineeth