[v2] sched: Forward deadline for early tick

[PATCH V2] sched: Forward deadline for early tick

Posted by zihan zhou 1 year, 1 month ago

Due to the tick error, the eevdf scheduler exhibits unexpected behavior.
For example, a machine with sysctl_sched_base_slice=3ms, CONFIG_HZ=1000
 should trigger a tick every 1ms. A se (sched_entity) with default weight
 1024 should theoretically reach its deadline on the third tick.

However, the tick often arrives a little faster than expected. In this
 case, the se can only wait until the next tick to consider that it has
 reached its deadline, and will run 1ms longer.

vruntime + sysctl_sched_base_slice =     deadline
        |-----------|-----------|-----------|-----------|
             1ms          1ms         1ms         1ms
                   ^           ^           ^           ^
                 tick1       tick2       tick3       tick4(nearly 4ms)

Here is a simple example of this scenario,
where sysctl_sched_base_slice=3ms, CONFIG_HZ=1000,
the CPU is Intel(R) Xeon(R) Platinum 8338C CPU @ 2.60GHz,
and "while :; do :; done &" is run twice with pids 72112 and 72113.
According to the design of eevdf,
both should run for 3ms each, but often they run for 4ms.

         time    cpu  task name      wait time  sch delay   run time
                      [tid/pid]         (msec)     (msec)     (msec)
------------- ------  -------------  ---------  ---------  ---------
 56696.846136 [0001]  perf[72368]        0.000      0.000      0.000
 56696.849378 [0001]  bash[72112]        0.000      0.000      3.241
 56696.852379 [0001]  bash[72113]        0.000      0.000      3.000
 56696.852964 [0001]  sleep[72369]       0.000      6.261      0.584
 56696.856378 [0001]  bash[72112]        3.585      0.000      3.414
 56696.860379 [0001]  bash[72113]        3.999      0.000      4.000 <
 56696.864379 [0001]  bash[72112]        4.000      0.000      4.000 <
 56696.868377 [0001]  bash[72113]        4.000      0.000      3.997 <
 56696.871378 [0001]  bash[72112]        3.997      0.000      3.000
 56696.874377 [0001]  bash[72113]        3.000      0.000      2.999
 56696.877377 [0001]  bash[72112]        2.999      0.000      2.999
 56696.881377 [0001]  bash[72113]        2.999      0.000      3.999 <

There are two reasons for tick error: clockevent precision and the 
CONFIG_IRQ_TIME_ACCOUNTING. with CONFIG_IRQ_TIME_ACCOUNTING every tick
 will less than 1ms, but even without it, because of clockevent precision,
tick still often less than 1ms. In the system above, there is no such
 config, but the task still often takes more than 3ms.

To solve this problem, we add a sched feature FORWARD_DEADLINE,
consider forwarding the deadline appropriately. When vruntime is very
close to the deadline, and the task is ineligible, we consider that task
 should be resched, the tolerance is set to min(vslice/128, tick/2).

when open FORWARD_DEADLINE,
 the task will run once every 3ms as designed by eevdf:

       time    cpu  task name         wait time  sch delay   run time
                    [tid/pid]            (msec)     (msec)     (msec)
----------- ------  ----------------  ---------  ---------  ---------
 207.207293 [0001]  bash[1699]            3.998      0.000      3.000 
 207.210294 [0001]  bash[1694]            3.000      0.000      3.000 
 207.213300 [0001]  bash[1699]            3.000      0.000      3.006 
 207.216293 [0001]  bash[1694]            3.006      0.000      2.993 
 207.219293 [0001]  bash[1699]            2.993      0.000      3.000 
 207.222293 [0001]  bash[1694]            3.000      0.000      2.999 
 207.225293 [0001]  bash[1699]            2.999      0.000      3.000 
 207.228293 [0001]  bash[1694]            3.000      0.000      3.000 
 207.231293 [0001]  bash[1699]            3.000      0.000      3.000 
 207.234293 [0001]  bash[1694]            3.000      0.000      2.999 
 207.237292 [0001]  bash[1699]            2.999      0.000      2.999 
 207.240293 [0001]  bash[1694]            2.999      0.000      3.000 
 207.243293 [0001]  bash[1699]            3.000      0.000      3.000 


Signed-off-by: zihan zhou <15645113830zzh@gmail.com>
Signed-off-by: yaozhenguo <yaozhenguo@jd.com>
Signed-off-by: yaowenchao <yaowenchao@jd.com>
---
v2: 1. just forward deadline for ineligible task.
    2. for ineligible task, vd_i = vd_i + r_i / w_i, but for eligible task,
       vd_i = ve_i + r_i / w_i, which is the same as before.
    3. tolerance = min(vslice>>7, TICK_NSEC/2), prevent scheduling errors 
       from increasing when vslice is too large relative to tick.
---
 kernel/sched/fair.c     | 42 +++++++++++++++++++++++++++++++++++++----
 kernel/sched/features.h |  7 +++++++
 2 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2d16c8545c71..9cc52f632bb1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1006,8 +1006,10 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
  */
 static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	if ((s64)(se->vruntime - se->deadline) < 0)
-		return false;
+
+	u64 vslice;
+	u64 tolerance = 0;
+	u64 next_deadline;
 
 	/*
 	 * For EEVDF the virtual time slope is determined by w_i (iow.
@@ -1016,11 +1018,43 @@ static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	 */
 	if (!se->custom_slice)
 		se->slice = sysctl_sched_base_slice;
+	vslice = calc_delta_fair(se->slice, se);
+
+	/*
+	 * vd_i = ve_i + r_i / w_i
+	 */
+	next_deadline = se->vruntime + vslice;
+
+	if (sched_feat(FORWARD_DEADLINE))
+		tolerance = min(vslice>>7, TICK_NSEC/2);
+
+	if ((s64)(se->vruntime + tolerance - se->deadline) < 0)
+		return false;
 
 	/*
-	 * EEVDF: vd_i = ve_i + r_i / w_i
+	 * when se->vruntime + tolerance - se->deadline >= 0
+	 * but se->vruntime - se->deadline < 0,
+	 * there is two case: if entity is eligible?
+	 * if entity is not eligible, we don't need wait deadline, because
+	 * eevdf don't guarantee
+	 * an ineligible entity can exec its request time in one go.
+	 * but when entity eligible, just let it run, which is the
+	 * same processing logic as before.
 	 */
-	se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
+
+	if (sched_feat(FORWARD_DEADLINE) && (s64)(se->vruntime - se->deadline) < 0) {
+
+		if (entity_eligible(cfs_rq, se))
+			return false;
+
+		/*
+		 * vd_i = vd_i + r_i / w_i
+		 */
+		next_deadline = se->deadline + vslice;
+	}
+
+
+	se->deadline = next_deadline;
 
 	/*
 	 * The task has consumed its request, reschedule.
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 290874079f60..5c74deec7209 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -24,6 +24,13 @@ SCHED_FEAT(RUN_TO_PARITY, true)
  */
 SCHED_FEAT(PREEMPT_SHORT, true)
 
+/*
+ * For some cases where the tick is faster than expected,
+ * move the deadline forward
+ */
+SCHED_FEAT(FORWARD_DEADLINE, true)
+
+
 /*
  * Prefer to schedule the task we woke last (assuming it failed
  * wakeup-preemption), since its likely going to consume data we
-- 
2.39.3 (Apple Git-146)

Re: [PATCH V2] sched: Forward deadline for early tick

Posted by Vincent Guittot 1 year, 1 month ago

On Wed, 25 Dec 2024 at 06:41, zihan zhou <15645113830zzh@gmail.com> wrote:
>
> Due to the tick error, the eevdf scheduler exhibits unexpected behavior.
> For example, a machine with sysctl_sched_base_slice=3ms, CONFIG_HZ=1000
>  should trigger a tick every 1ms. A se (sched_entity) with default weight
>  1024 should theoretically reach its deadline on the third tick.
>
> However, the tick often arrives a little faster than expected. In this
>  case, the se can only wait until the next tick to consider that it has
>  reached its deadline, and will run 1ms longer.
>
> vruntime + sysctl_sched_base_slice =     deadline
>         |-----------|-----------|-----------|-----------|
>              1ms          1ms         1ms         1ms
>                    ^           ^           ^           ^
>                  tick1       tick2       tick3       tick4(nearly 4ms)
>
> Here is a simple example of this scenario,
> where sysctl_sched_base_slice=3ms, CONFIG_HZ=1000,
> the CPU is Intel(R) Xeon(R) Platinum 8338C CPU @ 2.60GHz,
> and "while :; do :; done &" is run twice with pids 72112 and 72113.
> According to the design of eevdf,
> both should run for 3ms each, but often they run for 4ms.
>
>          time    cpu  task name      wait time  sch delay   run time
>                       [tid/pid]         (msec)     (msec)     (msec)
> ------------- ------  -------------  ---------  ---------  ---------
>  56696.846136 [0001]  perf[72368]        0.000      0.000      0.000
>  56696.849378 [0001]  bash[72112]        0.000      0.000      3.241
>  56696.852379 [0001]  bash[72113]        0.000      0.000      3.000
>  56696.852964 [0001]  sleep[72369]       0.000      6.261      0.584
>  56696.856378 [0001]  bash[72112]        3.585      0.000      3.414
>  56696.860379 [0001]  bash[72113]        3.999      0.000      4.000 <
>  56696.864379 [0001]  bash[72112]        4.000      0.000      4.000 <
>  56696.868377 [0001]  bash[72113]        4.000      0.000      3.997 <
>  56696.871378 [0001]  bash[72112]        3.997      0.000      3.000
>  56696.874377 [0001]  bash[72113]        3.000      0.000      2.999
>  56696.877377 [0001]  bash[72112]        2.999      0.000      2.999
>  56696.881377 [0001]  bash[72113]        2.999      0.000      3.999 <
>
> There are two reasons for tick error: clockevent precision and the
> CONFIG_IRQ_TIME_ACCOUNTING. with CONFIG_IRQ_TIME_ACCOUNTING every tick
>  will less than 1ms, but even without it, because of clockevent precision,
> tick still often less than 1ms. In the system above, there is no such
>  config, but the task still often takes more than 3ms.
>
> To solve this problem, we add a sched feature FORWARD_DEADLINE,
> consider forwarding the deadline appropriately. When vruntime is very
> close to the deadline, and the task is ineligible, we consider that task
>  should be resched, the tolerance is set to min(vslice/128, tick/2).

I'm worried with this approximation because the task didn't get the
slice it has requested because of the stolen time by irq or a shorter
tick duration

>
> when open FORWARD_DEADLINE,
>  the task will run once every 3ms as designed by eevdf:
>
>        time    cpu  task name         wait time  sch delay   run time
>                     [tid/pid]            (msec)     (msec)     (msec)
> ----------- ------  ----------------  ---------  ---------  ---------
>  207.207293 [0001]  bash[1699]            3.998      0.000      3.000
>  207.210294 [0001]  bash[1694]            3.000      0.000      3.000
>  207.213300 [0001]  bash[1699]            3.000      0.000      3.006
>  207.216293 [0001]  bash[1694]            3.006      0.000      2.993
>  207.219293 [0001]  bash[1699]            2.993      0.000      3.000
>  207.222293 [0001]  bash[1694]            3.000      0.000      2.999
>  207.225293 [0001]  bash[1699]            2.999      0.000      3.000
>  207.228293 [0001]  bash[1694]            3.000      0.000      3.000
>  207.231293 [0001]  bash[1699]            3.000      0.000      3.000
>  207.234293 [0001]  bash[1694]            3.000      0.000      2.999
>  207.237292 [0001]  bash[1699]            2.999      0.000      2.999
>  207.240293 [0001]  bash[1694]            2.999      0.000      3.000
>  207.243293 [0001]  bash[1699]            3.000      0.000      3.000
>
>
> Signed-off-by: zihan zhou <15645113830zzh@gmail.com>
> Signed-off-by: yaozhenguo <yaozhenguo@jd.com>
> Signed-off-by: yaowenchao <yaowenchao@jd.com>
> ---
> v2: 1. just forward deadline for ineligible task.
>     2. for ineligible task, vd_i = vd_i + r_i / w_i, but for eligible task,
>        vd_i = ve_i + r_i / w_i, which is the same as before.
>     3. tolerance = min(vslice>>7, TICK_NSEC/2), prevent scheduling errors
>        from increasing when vslice is too large relative to tick.
> ---
>  kernel/sched/fair.c     | 42 +++++++++++++++++++++++++++++++++++++----
>  kernel/sched/features.h |  7 +++++++
>  2 files changed, 45 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2d16c8545c71..9cc52f632bb1 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1006,8 +1006,10 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
>   */
>  static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> -       if ((s64)(se->vruntime - se->deadline) < 0)
> -               return false;
> +
> +       u64 vslice;
> +       u64 tolerance = 0;
> +       u64 next_deadline;
>
>         /*
>          * For EEVDF the virtual time slope is determined by w_i (iow.
> @@ -1016,11 +1018,43 @@ static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
>          */
>         if (!se->custom_slice)
>                 se->slice = sysctl_sched_base_slice;
> +       vslice = calc_delta_fair(se->slice, se);
> +
> +       /*
> +        * vd_i = ve_i + r_i / w_i
> +        */
> +       next_deadline = se->vruntime + vslice;
> +
> +       if (sched_feat(FORWARD_DEADLINE))
> +               tolerance = min(vslice>>7, TICK_NSEC/2);
> +
> +       if ((s64)(se->vruntime + tolerance - se->deadline) < 0)
> +               return false;
>
>         /*
> -        * EEVDF: vd_i = ve_i + r_i / w_i
> +        * when se->vruntime + tolerance - se->deadline >= 0
> +        * but se->vruntime - se->deadline < 0,
> +        * there is two case: if entity is eligible?
> +        * if entity is not eligible, we don't need wait deadline, because
> +        * eevdf don't guarantee
> +        * an ineligible entity can exec its request time in one go.

Yes but it also didn't say that you can move forward its deadline
before it has consumed its slice request. A task is only ensured to
get a discrete time quanta but can be preempted after each quanta even
if its slice has not elapsed.

What you want, it's to trigger a need_resched after a minimum time
quanta has elapsed but not to update the deadline before the slice has
elapsed.

Now the question is what is the minimum time quanta for us. Should it
be a tick whatever it's real duration for the task ? Should it be
longer ?


> +        * but when entity eligible, just let it run, which is the
> +        * same processing logic as before.
>          */
> -       se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
> +
> +       if (sched_feat(FORWARD_DEADLINE) && (s64)(se->vruntime - se->deadline) < 0) {
> +
> +               if (entity_eligible(cfs_rq, se))
> +                       return false;
> +
> +               /*
> +                * vd_i = vd_i + r_i / w_i
> +                */
> +               next_deadline = se->deadline + vslice;
> +       }
> +
> +
> +       se->deadline = next_deadline;
>
>         /*
>          * The task has consumed its request, reschedule.
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 290874079f60..5c74deec7209 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -24,6 +24,13 @@ SCHED_FEAT(RUN_TO_PARITY, true)
>   */
>  SCHED_FEAT(PREEMPT_SHORT, true)
>
> +/*
> + * For some cases where the tick is faster than expected,
> + * move the deadline forward
> + */
> +SCHED_FEAT(FORWARD_DEADLINE, true)
> +
> +
>  /*
>   * Prefer to schedule the task we woke last (assuming it failed
>   * wakeup-preemption), since its likely going to consume data we
> --
> 2.39.3 (Apple Git-146)
>

Re: [PATCH V2] sched: Forward deadline for early tick

Posted by zihan zhou 1 year ago

Thank you for your reply!

> > There are two reasons for tick error: clockevent precision and the
> > CONFIG_IRQ_TIME_ACCOUNTING. with CONFIG_IRQ_TIME_ACCOUNTING every tick
> >  will less than 1ms, but even without it, because of clockevent precision,
> > tick still often less than 1ms. In the system above, there is no such
> >  config, but the task still often takes more than 3ms.
> >
> > To solve this problem, we add a sched feature FORWARD_DEADLINE,
> > consider forwarding the deadline appropriately. When vruntime is very
> > close to the deadline, and the task is ineligible, we consider that task
> >  should be resched, the tolerance is set to min(vslice/128, tick/2).
> 
> I'm worried with this approximation because the task didn't get the
> slice it has requested because of the stolen time by irq or a shorter
> tick duration

Yes, you are right. forward deadline is not a good way, although the error 
is small, the task will have less exec time.


> Yes but it also didn't say that you can move forward its deadline
> before it has consumed its slice request. A task is only ensured to
> get a discrete time quanta but can be preempted after each quanta even
> if its slice has not elapsed.
> 
> What you want, it's to trigger a need_resched after a minimum time
> quanta has elapsed but not to update the deadline before the slice has
> elapsed.
> 
> Now the question is what is the minimum time quanta for us. Should it
> be a tick whatever it's real duration for the task ? Should it be
> longer ?

I have also been thinking about this question: what is the appropriate
 minimum time quanta? I think neither tick nor slice is a good choice.
The task should have an atomic runtime greater than tick to avoid frequent
 switching, but if the time size is silce, it is not flexible, tasks often
use up the requested slice at once without caring about if they are
 eligible. So can we add a new kernel parameter, like 
sysctl_sched_min_granularity? Adding a min_granularity between tick and
 slice seems like a good choice.

We can let a task exec min_granularity time at once, then it is needed to
 consider whether the task is eligible or whether the deadline needs
 to be updated.


But I don't know how to handle wake up preemption more appropriately.
Is it necessary to wait for the preempted task to complete an atomic time?

Re: [PATCH V2] sched: Forward deadline for early tick

Posted by Vincent Guittot 1 year ago

On Thu, 9 Jan 2025 at 06:49, zihan zhou <15645113830zzh@gmail.com> wrote:
>
> Thank you for your reply!
>
> > > There are two reasons for tick error: clockevent precision and the
> > > CONFIG_IRQ_TIME_ACCOUNTING. with CONFIG_IRQ_TIME_ACCOUNTING every tick
> > >  will less than 1ms, but even without it, because of clockevent precision,
> > > tick still often less than 1ms. In the system above, there is no such
> > >  config, but the task still often takes more than 3ms.
> > >
> > > To solve this problem, we add a sched feature FORWARD_DEADLINE,
> > > consider forwarding the deadline appropriately. When vruntime is very
> > > close to the deadline, and the task is ineligible, we consider that task
> > >  should be resched, the tolerance is set to min(vslice/128, tick/2).
> >
> > I'm worried with this approximation because the task didn't get the
> > slice it has requested because of the stolen time by irq or a shorter
> > tick duration
>
> Yes, you are right. forward deadline is not a good way, although the error
> is small, the task will have less exec time.
>
>
> > Yes but it also didn't say that you can move forward its deadline
> > before it has consumed its slice request. A task is only ensured to
> > get a discrete time quanta but can be preempted after each quanta even
> > if its slice has not elapsed.
> >
> > What you want, it's to trigger a need_resched after a minimum time
> > quanta has elapsed but not to update the deadline before the slice has
> > elapsed.
> >
> > Now the question is what is the minimum time quanta for us. Should it
> > be a tick whatever it's real duration for the task ? Should it be
> > longer ?
>
> I have also been thinking about this question: what is the appropriate
>  minimum time quanta? I think neither tick nor slice is a good choice.
> The task should have an atomic runtime greater than tick to avoid frequent
>  switching, but if the time size is silce, it is not flexible, tasks often
> use up the requested slice at once without caring about if they are

The current implementation tries to minimize the number of context
switch by letting current to exhaust its slice unless a waking
eligible task has a shorter deadline

>  eligible. So can we add a new kernel parameter, like
> sysctl_sched_min_granularity? Adding a min_granularity between tick and
>  slice seems like a good choice.

We try to prevent adding more knobs.
We are back to changing the default slice to not be a multiple of
ticks to give room for interrupt context and others.

default slice = 0.75 msec * (1 + ilog(ncpus)) with ncpus capped at 8
which means that we have a default slice of
0.75 for 1 cpu
1.50 up to 3 cpus
2.25 up to 7 cpus
3.00 for 8 cpus and above

For HZ=250 and HZ=100, all values are "ok". By "ok", I mean that task
will not get an extra tick but their runtime remains far higher than
their slice. The only config that has an issue is HZ=1000 with 8 cpus
or more.

Using 0.70 instead of 0.75 should not change much for other configs
and would fix this config too with a default slice equals 2.8ms.
0.70 for 1 cpu
1.40 up to 3 cpus
2.10 up to 7 cpus
2.8 for 8 cpus and above

That being said the problem remains the same if a task sets its custom
slice being a multiple of ticks or the time stolen by interrupts is
higher than 66us per tick in average

>
> We can let a task exec min_granularity time at once, then it is needed to
>  consider whether the task is eligible or whether the deadline needs
>  to be updated.
>
>
> But I don't know how to handle wake up preemption more appropriately.
> Is it necessary to wait for the preempted task to complete an atomic time?
>
>

Re: [PATCH V2] sched: Forward deadline for early tick

Posted by zihan zhou 1 year ago

Thanks for your reply!

> default slice = 0.75 msec * (1 + ilog(ncpus)) with ncpus capped at 8
> which means that we have a default slice of
> 0.75 for 1 cpu
> 1.50 up to 3 cpus
> 2.25 up to 7 cpus
> 3.00 for 8 cpus and above
> 
> For HZ=250 and HZ=100, all values are "ok". By "ok", I mean that task
> will not get an extra tick but their runtime remains far higher than
> their slice. The only config that has an issue is HZ=1000 with 8 cpus
> or more.
> 
> Using 0.70 instead of 0.75 should not change much for other configs
> and would fix this config too with a default slice equals 2.8ms.
> 0.70 for 1 cpu
> 1.40 up to 3 cpus
> 2.10 up to 7 cpus
> 2.8 for 8 cpus and above

It seems that changing the default slice is a good idea that won't have
too much impact and solves the main problem. I strongly agree with it,
which is more reasonable than the forward deadline.

> That being said the problem remains the same if a task sets its custom
> slice being a multiple of ticks or the time stolen by interrupts is
> higher than 66us per tick in average

Additionally, can we add some warnings on this basis? We do not prevent
slices from being integer multiples of the tick, but we do not recommend
doing it.

Here is the patch I have written based on the above logic, looking forward
to your feedback.


Signed-off-by: zihan zhou <15645113830zzh@gmail.com>
---
 kernel/sched/debug.c    | 48 ++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/fair.c     |  6 +++---
 kernel/sched/syscalls.c |  4 ++++
 3 files changed, 54 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index a1be00a988bf..da17d19082ba 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -166,6 +166,52 @@ static const struct file_operations sched_feat_fops = {
 	.release	= single_release,
 };
 
+
+static ssize_t sched_base_slice_write(struct file *filp, const char __user *ubuf,
+				   size_t cnt, loff_t *ppos)
+{
+	char buf[16];
+	unsigned int base_slice;
+
+	if (cnt > 15)
+		cnt = 15;
+
+	if (copy_from_user(&buf, ubuf, cnt))
+		return -EFAULT;
+	buf[cnt] = '\0';
+
+	if (kstrtouint(buf, 10, &base_slice))
+		return -EINVAL;
+
+
+	sysctl_sched_base_slice = base_slice;
+
+	if (sysctl_sched_base_slice % TICK_NSEC == 0)
+		pr_warn("It is not recommended to set the slice to an integer multiple of the tick\n");
+
+	*ppos += cnt;
+	return cnt;
+}
+
+static int sched_base_slice_show(struct seq_file *m, void *v)
+{
+	seq_printf(m, "%d\n", sysctl_sched_base_slice);
+	return 0;
+}
+
+static int sched_base_slice_open(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, sched_base_slice_show, NULL);
+}
+
+static const struct file_operations sched_base_slice_fops = {
+	.open		= sched_base_slice_open,
+	.write		= sched_base_slice_write,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
 #ifdef CONFIG_SMP
 
 static ssize_t sched_scaling_write(struct file *filp, const char __user *ubuf,
@@ -505,7 +551,7 @@ static __init int sched_init_debug(void)
 	debugfs_create_file("preempt", 0644, debugfs_sched, NULL, &sched_dynamic_fops);
 #endif
 
-	debugfs_create_u32("base_slice_ns", 0644, debugfs_sched, &sysctl_sched_base_slice);
+	debugfs_create_file("base_slice_ns", 0644, debugfs_sched, NULL, &sched_base_slice_fops);
 
 	debugfs_create_u32("latency_warn_ms", 0644, debugfs_sched, &sysctl_resched_latency_warn_ms);
 	debugfs_create_u32("latency_warn_once", 0644, debugfs_sched, &sysctl_resched_latency_warn_once);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3e9ca38512de..27f7cf205741 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -71,10 +71,10 @@ unsigned int sysctl_sched_tunable_scaling = SCHED_TUNABLESCALING_LOG;
 /*
  * Minimal preemption granularity for CPU-bound tasks:
  *
- * (default: 0.75 msec * (1 + ilog(ncpus)), units: nanoseconds)
+ * (default: 0.70 msec * (1 + ilog(ncpus)), units: nanoseconds)
  */
-unsigned int sysctl_sched_base_slice			= 750000ULL;
-static unsigned int normalized_sysctl_sched_base_slice	= 750000ULL;
+unsigned int sysctl_sched_base_slice			= 700000ULL;
+static unsigned int normalized_sysctl_sched_base_slice	= 700000ULL;
 
 const_debug unsigned int sysctl_sched_migration_cost	= 500000UL;
 
diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
index ff0e5ab4e37c..d0f90d61398f 100644
--- a/kernel/sched/syscalls.c
+++ b/kernel/sched/syscalls.c
@@ -309,6 +309,10 @@ static void __setscheduler_params(struct task_struct *p,
 			p->se.slice = clamp_t(u64, attr->sched_runtime,
 					      NSEC_PER_MSEC/10,   /* HZ=1000 * 10 */
 					      NSEC_PER_MSEC*100); /* HZ=100  / 10 */
+
+			if (p->se.slice % TICK_NSEC == 0)
+				pr_warn("It is not recommended to set the slice to an integer multiple of the tick\n");
+
 		} else {
 			p->se.custom_slice = 0;
 			p->se.slice = sysctl_sched_base_slice;
-- 
2.33.0

Re: [PATCH V2] sched: Forward deadline for early tick

Posted by Vincent Guittot 1 year ago

On Thu, 16 Jan 2025 at 09:13, zihan zhou <15645113830zzh@gmail.com> wrote:
>
> Thanks for your reply!
>
> > default slice = 0.75 msec * (1 + ilog(ncpus)) with ncpus capped at 8
> > which means that we have a default slice of
> > 0.75 for 1 cpu
> > 1.50 up to 3 cpus
> > 2.25 up to 7 cpus
> > 3.00 for 8 cpus and above
> >
> > For HZ=250 and HZ=100, all values are "ok". By "ok", I mean that task
> > will not get an extra tick but their runtime remains far higher than
> > their slice. The only config that has an issue is HZ=1000 with 8 cpus
> > or more.
> >
> > Using 0.70 instead of 0.75 should not change much for other configs
> > and would fix this config too with a default slice equals 2.8ms.
> > 0.70 for 1 cpu
> > 1.40 up to 3 cpus
> > 2.10 up to 7 cpus
> > 2.8 for 8 cpus and above
>
> It seems that changing the default slice is a good idea that won't have
> too much impact and solves the main problem. I strongly agree with it,
> which is more reasonable than the forward deadline.
>
> > That being said the problem remains the same if a task sets its custom
> > slice being a multiple of ticks or the time stolen by interrupts is
> > higher than 66us per tick in average
>
> Additionally, can we add some warnings on this basis? We do not prevent
> slices from being integer multiples of the tick, but we do not recommend
> doing it.

For sure we should provide recommendation but I'm not sure it's worth
a warning but better a documentation for user who wants to set their
slice

>
> Here is the patch I have written based on the above logic, looking forward
> to your feedback.
>
>
> Signed-off-by: zihan zhou <15645113830zzh@gmail.com>
> ---
>  kernel/sched/debug.c    | 48 ++++++++++++++++++++++++++++++++++++++++-
>  kernel/sched/fair.c     |  6 +++---
>  kernel/sched/syscalls.c |  4 ++++
>  3 files changed, 54 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index a1be00a988bf..da17d19082ba 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -166,6 +166,52 @@ static const struct file_operations sched_feat_fops = {
>         .release        = single_release,
>  };
>
> +
> +static ssize_t sched_base_slice_write(struct file *filp, const char __user *ubuf,
> +                                  size_t cnt, loff_t *ppos)
> +{
> +       char buf[16];
> +       unsigned int base_slice;
> +
> +       if (cnt > 15)
> +               cnt = 15;
> +
> +       if (copy_from_user(&buf, ubuf, cnt))
> +               return -EFAULT;
> +       buf[cnt] = '\0';
> +
> +       if (kstrtouint(buf, 10, &base_slice))
> +               return -EINVAL;
> +
> +
> +       sysctl_sched_base_slice = base_slice;

sysctl_sched_base_slice is updated by update_sysctl() for each cpu
hotplug so you can't really set it directly

Could you create a patch that only updates sysctl_sched_base_slice
and normalized_sysctl_sched_base_slice to 700000UL with figures and
explanation
You can then try a 2nd patch which updates it with debugfs but I'm not
convince at all by this benefit because it should update the
normalized_sysctl_sched_base_slice which is really meaningless for
user


> +
> +       if (sysctl_sched_base_slice % TICK_NSEC == 0)
> +               pr_warn("It is not recommended to set the slice to an integer multiple of the tick\n");
> +
> +       *ppos += cnt;
> +       return cnt;
> +}
> +
> +static int sched_base_slice_show(struct seq_file *m, void *v)
> +{
> +       seq_printf(m, "%d\n", sysctl_sched_base_slice);
> +       return 0;
> +}
> +
> +static int sched_base_slice_open(struct inode *inode, struct file *filp)
> +{
> +       return single_open(filp, sched_base_slice_show, NULL);
> +}
> +
> +static const struct file_operations sched_base_slice_fops = {
> +       .open           = sched_base_slice_open,
> +       .write          = sched_base_slice_write,
> +       .read           = seq_read,
> +       .llseek         = seq_lseek,
> +       .release        = single_release,
> +};
> +
>  #ifdef CONFIG_SMP
>
>  static ssize_t sched_scaling_write(struct file *filp, const char __user *ubuf,
> @@ -505,7 +551,7 @@ static __init int sched_init_debug(void)
>         debugfs_create_file("preempt", 0644, debugfs_sched, NULL, &sched_dynamic_fops);
>  #endif
>
> -       debugfs_create_u32("base_slice_ns", 0644, debugfs_sched, &sysctl_sched_base_slice);
> +       debugfs_create_file("base_slice_ns", 0644, debugfs_sched, NULL, &sched_base_slice_fops);
>
>         debugfs_create_u32("latency_warn_ms", 0644, debugfs_sched, &sysctl_resched_latency_warn_ms);
>         debugfs_create_u32("latency_warn_once", 0644, debugfs_sched, &sysctl_resched_latency_warn_once);
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 3e9ca38512de..27f7cf205741 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -71,10 +71,10 @@ unsigned int sysctl_sched_tunable_scaling = SCHED_TUNABLESCALING_LOG;
>  /*
>   * Minimal preemption granularity for CPU-bound tasks:
>   *
> - * (default: 0.75 msec * (1 + ilog(ncpus)), units: nanoseconds)
> + * (default: 0.70 msec * (1 + ilog(ncpus)), units: nanoseconds)
>   */
> -unsigned int sysctl_sched_base_slice                   = 750000ULL;
> -static unsigned int normalized_sysctl_sched_base_slice = 750000ULL;
> +unsigned int sysctl_sched_base_slice                   = 700000ULL;
> +static unsigned int normalized_sysctl_sched_base_slice = 700000ULL;
>
>  const_debug unsigned int sysctl_sched_migration_cost   = 500000UL;
>
> diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
> index ff0e5ab4e37c..d0f90d61398f 100644
> --- a/kernel/sched/syscalls.c
> +++ b/kernel/sched/syscalls.c
> @@ -309,6 +309,10 @@ static void __setscheduler_params(struct task_struct *p,
>                         p->se.slice = clamp_t(u64, attr->sched_runtime,
>                                               NSEC_PER_MSEC/10,   /* HZ=1000 * 10 */
>                                               NSEC_PER_MSEC*100); /* HZ=100  / 10 */
> +
> +                       if (p->se.slice % TICK_NSEC == 0)
> +                               pr_warn("It is not recommended to set the slice to an integer multiple of the tick\n");

You should better add documentation than a warning

Also, this code has moved in fair.c now

> +
>                 } else {
>                         p->se.custom_slice = 0;
>                         p->se.slice = sysctl_sched_base_slice;
> --
> 2.33.0
>

Re: [PATCH V2] sched: Forward deadline for early tick

Posted by zihan zhou 1 year ago

Thanks for your reply!

> For sure we should provide recommendation but I'm not sure it's worth
> a warning but better a documentation for user who wants to set their
> slice

> Could you create a patch that only updates sysctl_sched_base_slice
> and normalized_sysctl_sched_base_slice to 700000UL with figures and
> explanation

Thank you for your suggestion! Adding a warning is not worth it. I will
submit a new patch soon, which changes the default value of slice
and add the code comments about this change. I am not sure if I
should change the files in the Documentation directory. If I should,
I will move the content in the comments to the Documentation directory.

> sysctl_sched_base_slice is updated by update_sysctl() for each cpu
> hotplug so you can't really set it directly
> 
> You can then try a 2nd patch which updates it with debugfs but I'm not
> convince at all by this benefit because it should update the
> normalized_sysctl_sched_base_slice which is really meaningless for
> user

There is indeed a big problem with this, and I overlooked it. Thank you
for your corrections. I actually intended to add a numerical range
requirement to slice through sched_base_slice_fops, because I noticed that
syscalls has a range requirement (NSEC_PER_MSEC/10, NSEC_PER_MSEC*100)
for modifying slice. However, it seems that the existence of this range is
not important.

> > diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
> > index ff0e5ab4e37c..d0f90d61398f 100644
> > --- a/kernel/sched/syscalls.c
> > +++ b/kernel/sched/syscalls.c
> > @@ -309,6 +309,10 @@ static void __setscheduler_params(struct task_struct *p,
> >                         p->se.slice = clamp_t(u64, attr->sched_runtime,
> >                                               NSEC_PER_MSEC/10,   /* HZ=1000 * 10 */
> >                                               NSEC_PER_MSEC*100); /* HZ=100  / 10 */
> > +
> > +                       if (p->se.slice % TICK_NSEC == 0)
> > +                               pr_warn("It is not recommended to set the slice to an integer multiple of the tick\n");
> 
> You should better add documentation than a warning
> 
> Also, this code has moved in fair.c now

I'm a bit confused here. Has this code been moved from syscalls.c to
fair.c file? I haven't seen this change on the master kernel.
Is there any other code repository?

Re: [PATCH V2] sched: Forward deadline for early tick

Posted by Honglei Wang 1 year ago

Hi Zihan,

On 2025/1/20 15:38, zihan zhou wrote:
> Thanks for your reply!
> 
>> For sure we should provide recommendation but I'm not sure it's worth
>> a warning but better a documentation for user who wants to set their
>> slice
> 
>> Could you create a patch that only updates sysctl_sched_base_slice
>> and normalized_sysctl_sched_base_slice to 700000UL with figures and
>> explanation
> 
> Thank you for your suggestion! Adding a warning is not worth it. I will
> submit a new patch soon, which changes the default value of slice
> and add the code comments about this change. I am not sure if I
> should change the files in the Documentation directory. If I should,
> I will move the content in the comments to the Documentation directory.
> 
>> sysctl_sched_base_slice is updated by update_sysctl() for each cpu
>> hotplug so you can't really set it directly
>>
>> You can then try a 2nd patch which updates it with debugfs but I'm not
>> convince at all by this benefit because it should update the
>> normalized_sysctl_sched_base_slice which is really meaningless for
>> user
> 
> There is indeed a big problem with this, and I overlooked it. Thank you
> for your corrections. I actually intended to add a numerical range
> requirement to slice through sched_base_slice_fops, because I noticed that
> syscalls has a range requirement (NSEC_PER_MSEC/10, NSEC_PER_MSEC*100)
> for modifying slice. However, it seems that the existence of this range is
> not important.
> 
>>> diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
>>> index ff0e5ab4e37c..d0f90d61398f 100644
>>> --- a/kernel/sched/syscalls.c
>>> +++ b/kernel/sched/syscalls.c
>>> @@ -309,6 +309,10 @@ static void __setscheduler_params(struct task_struct *p,
>>>                         p->se.slice = clamp_t(u64, attr->sched_runtime,
>>>                                               NSEC_PER_MSEC/10,   /* HZ=1000 * 10 */
>>>                                               NSEC_PER_MSEC*100); /* HZ=100  / 10 */
>>> +
>>> +                       if (p->se.slice % TICK_NSEC == 0)
>>> +                               pr_warn("It is not recommended to set the slice to an integer multiple of the tick\n");
>>
>> You should better add documentation than a warning
>>
>> Also, this code has moved in fair.c now
> 
> I'm a bit confused here. Has this code been moved from syscalls.c to
> fair.c file? I haven't seen this change on the master kernel.
> Is there any other code repository?
> 

Just FYI. I think Vincent meant this one 2cf9ac40073d ("sched/fair:
Encapsulate set custom slice in a __setparam_fair() function"). You can
check it at the sched/core branch here:
https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/core&id=2cf9ac40073ddb6a70dd5ef94f1cf45501b696ed

Thanks,
Honglei

Re: [PATCH V2] sched: Forward deadline for early tick

Posted by Vincent Guittot 1 year ago

On Mon, 20 Jan 2025 at 12:44, Honglei Wang <jameshongleiwang@126.com> wrote:
>
> Hi Zihan,
>
> On 2025/1/20 15:38, zihan zhou wrote:
> > Thanks for your reply!
> >
> >> For sure we should provide recommendation but I'm not sure it's worth
> >> a warning but better a documentation for user who wants to set their
> >> slice
> >
> >> Could you create a patch that only updates sysctl_sched_base_slice
> >> and normalized_sysctl_sched_base_slice to 700000UL with figures and
> >> explanation
> >
> > Thank you for your suggestion! Adding a warning is not worth it. I will
> > submit a new patch soon, which changes the default value of slice
> > and add the code comments about this change. I am not sure if I
> > should change the files in the Documentation directory. If I should,
> > I will move the content in the comments to the Documentation directory.
> >
> >> sysctl_sched_base_slice is updated by update_sysctl() for each cpu
> >> hotplug so you can't really set it directly
> >>
> >> You can then try a 2nd patch which updates it with debugfs but I'm not
> >> convince at all by this benefit because it should update the
> >> normalized_sysctl_sched_base_slice which is really meaningless for
> >> user
> >
> > There is indeed a big problem with this, and I overlooked it. Thank you
> > for your corrections. I actually intended to add a numerical range
> > requirement to slice through sched_base_slice_fops, because I noticed that
> > syscalls has a range requirement (NSEC_PER_MSEC/10, NSEC_PER_MSEC*100)
> > for modifying slice. However, it seems that the existence of this range is
> > not important.
> >
> >>> diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
> >>> index ff0e5ab4e37c..d0f90d61398f 100644
> >>> --- a/kernel/sched/syscalls.c
> >>> +++ b/kernel/sched/syscalls.c
> >>> @@ -309,6 +309,10 @@ static void __setscheduler_params(struct task_struct *p,
> >>>                         p->se.slice = clamp_t(u64, attr->sched_runtime,
> >>>                                               NSEC_PER_MSEC/10,   /* HZ=1000 * 10 */
> >>>                                               NSEC_PER_MSEC*100); /* HZ=100  / 10 */
> >>> +
> >>> +                       if (p->se.slice % TICK_NSEC == 0)
> >>> +                               pr_warn("It is not recommended to set the slice to an integer multiple of the tick\n");
> >>
> >> You should better add documentation than a warning
> >>
> >> Also, this code has moved in fair.c now
> >
> > I'm a bit confused here. Has this code been moved from syscalls.c to
> > fair.c file? I haven't seen this change on the master kernel.
> > Is there any other code repository?
> >
>
> Just FYI. I think Vincent meant this one 2cf9ac40073d ("sched/fair:
> Encapsulate set custom slice in a __setparam_fair() function"). You can
> check it at the sched/core branch here:
> https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/core&id=2cf9ac40073ddb6a70dd5ef94f1cf45501b696ed

You can find the information in MAINTAINERS file at the scheduler
section. Actually it's

T: git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core

>
> Thanks,
> Honglei
>

Re: [PATCH V2] sched: Forward deadline for early tick

Posted by zihan zhou 1 year ago

> > Just FYI. I think Vincent meant this one 2cf9ac40073d ("sched/fair:
> > Encapsulate set custom slice in a __setparam_fair() function"). You can
> > check it at the sched/core branch here:
> > https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/core&id=2cf9ac40073ddb6a70dd5ef94f1cf45501b696ed
> 
> You can find the information in MAINTAINERS file at the scheduler
> section. Actually it's
> 
> T: git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core

Thank you Vincent, thank you Honglei, I have learned it!