[v1] sched: Forward deadline for early tick

[PATCH] sched: Forward deadline for early tick

Posted by zhouzihan30 1 year, 2 months ago

Due to the problem of tick time accuracy, 
the eevdf scheduler exhibits unexpected behavior.
For example, a machine with sysctl_sched_base_slice=3ms, CONFIG_HZ=1000
 should trigger a tick every 1ms. 
A se (sched_entity) with default weight 1024 should
 theoretically reach its deadline on the third tick. 
However, the tick often arrives a little faster than expected. 
In this case, the se can only wait until the next tick to
 consider that it has reached its deadline, and will run 1ms longer.

vruntime + sysctl_sched_base_slice =     deadline
        |-----------|-----------|-----------|-----------|
             1ms          1ms         1ms         1ms
                   ^           ^           ^           ^
                 tick1       tick2       tick3       tick4(nearly 4ms)

Here is a simple example of this scenario, 
where sysctl_sched_base_slice=3ms, CONFIG_HZ=1000, 
the CPU is Intel(R) Xeon(R) Platinum 8338C CPU @ 2.60GHz, 
and "while :; do :; done &" is run twice with pids 72112 and 72113.
According to the design of eevdf,
both should run for 3ms each, but often they run for 4ms.

         time    cpu  task name      wait time  sch delay   run time
                      [tid/pid]         (msec)     (msec)     (msec)
------------- ------  -------------  ---------  ---------  ---------
 56696.846136 [0001]  perf[72368]        0.000      0.000      0.000
 56696.849378 [0001]  bash[72112]        0.000      0.000      3.241
 56696.852379 [0001]  bash[72113]        0.000      0.000      3.000
 56696.852964 [0001]  sleep[72369]       0.000      6.261      0.584
 56696.856378 [0001]  bash[72112]        3.585      0.000      3.414
 56696.860379 [0001]  bash[72113]        3.999      0.000      4.000
 56696.864379 [0001]  bash[72112]        4.000      0.000      4.000
 56696.868377 [0001]  bash[72113]        4.000      0.000      3.997
 56696.871378 [0001]  bash[72112]        3.997      0.000      3.000
 56696.874377 [0001]  bash[72113]        3.000      0.000      2.999
 56696.877377 [0001]  bash[72112]        2.999      0.000      2.999
 56696.881377 [0001]  bash[72113]        2.999      0.000      3.999

The reason for this problem is that
 the actual time of each tick is less than 1ms.
We believe there are two reasons:
Reason 1:
Hardware error. 
The clock of an ordinary computer cannot guarantee its own accurate time.
Reason 2:
Calculation error.
Many clocks calculate time indirectly through the number of cycles, 
which will definitely have errors and be smaller than the actual value,
 the kernel code is:

clc= ((unsignedlonglong) delta*dev->mult) >>dev->shift;
dev->set_next_event((unsignedlong) clc, dev);

To solve this problem,
we add a sched feature FORWARD_DEADLINE,
consider forwarding the deadline appropriately. 
When vruntime is very close to the deadline,
we consider that the deadline has been reached.
This tolerance is set to vslice/128.
On our machine, the tick error will not be greater than this tolerance,
and an error of less than 1%
 should not affect the fairness of the scheduler.

when open FORWARD_DEADLINE,
 the task will run once every 3ms as designed by eevdf:

         time    cpu  task name         wait time  sch delay   run time
                      [tid/pid]            (msec)     (msec)     (msec)
------------- ------  ----------------  ---------  ---------  ---------
 57110.032369 [0001]  perf[72752]           0.000      0.000      0.000
 57110.035805 [0001]  bash[72112]           0.000      0.000      3.436
 57110.036400 [0001]  sleep[72755]          0.000      3.456      0.594
 57110.039806 [0001]  bash[72113]           0.000      0.000      3.405
 57110.042805 [0001]  bash[72112]           4.000      0.000      2.999
 57110.045811 [0001]  bash[72113]           2.999      0.000      3.005
 57110.045816 [0001]  ksoftirqd/1[98]       0.000      0.001      0.005
 57110.048804 [0001]  bash[72112]           3.010      0.000      2.987
 57110.051805 [0001]  bash[72113]           2.993      0.000      3.001
 57110.054804 [0001]  bash[72112]           3.001      0.000      2.998
 57110.057805 [0001]  bash[72113]           2.998      0.000      3.000
 57110.060804 [0001]  bash[72112]           3.000      0.000      2.999
 57110.063805 [0001]  bash[72113]           2.999      0.000      3.001
 57110.066804 [0001]  bash[72112]           3.001      0.000      2.999
 57110.069805 [0001]  bash[72113]           2.999      0.000      3.000
 57110.072804 [0001]  bash[72112]           3.000      0.000      2.999
---
 kernel/sched/fair.c     | 21 ++++++++++++++++++---
 kernel/sched/features.h |  7 +++++++
 2 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2d16c8545c71..ea342ef8db26 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1006,8 +1006,9 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
  */
 static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	if ((s64)(se->vruntime - se->deadline) < 0)
-		return false;
+
+	u64 vslice;
+	u64 tolerance = 0;
 
 	/*
 	 * For EEVDF the virtual time slope is determined by w_i (iow.
@@ -1016,11 +1017,25 @@ static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	 */
 	if (!se->custom_slice)
 		se->slice = sysctl_sched_base_slice;
+	vslice = calc_delta_fair(se->slice, se);
+
+
+	/*
+	 * When the deadline is only 1/128 away,
+	 * it is considered to have been reached.
+	 */
+	if (sched_feat(FORWARD_DEADLINE))
+		tolerance = vslice>>7;
+
+
+
+	if ((s64)(se->vruntime + tolerance - se->deadline) < 0)
+		return false;
 
 	/*
 	 * EEVDF: vd_i = ve_i + r_i / w_i
 	 */
-	se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
+	se->deadline = se->vruntime + vslice;
 
 	/*
 	 * The task has consumed its request, reschedule.
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 290874079f60..fad8e2bbf4ed 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -24,6 +24,13 @@ SCHED_FEAT(RUN_TO_PARITY, true)
  */
 SCHED_FEAT(PREEMPT_SHORT, true)
 
+/*
+ * For some cases where the tick is faster than expected,
+ * move the deadline forward.
+ */
+SCHED_FEAT(FORWARD_DEADLINE, true)
+
+
 /*
  * Prefer to schedule the task we woke last (assuming it failed
  * wakeup-preemption), since its likely going to consume data we
-- 
2.39.3 (Apple Git-146)
Signed-off-by: zhouzihan30 <zhouzihan30@jd.com>
Signed-off-by: yaozhenguo <yaozhenguo@jd.com>

Re: [PATCH] sched: Forward deadline for early tick

Posted by Peter Zijlstra 1 year, 2 months ago

On Tue, Nov 26, 2024 at 02:23:50PM +0800, zhouzihan30 wrote:
> Due to the problem of tick time accuracy, 
> the eevdf scheduler exhibits unexpected behavior.
> For example, a machine with sysctl_sched_base_slice=3ms, CONFIG_HZ=1000
>  should trigger a tick every 1ms. 
> A se (sched_entity) with default weight 1024 should
>  theoretically reach its deadline on the third tick. 
> However, the tick often arrives a little faster than expected. 
> In this case, the se can only wait until the next tick to
>  consider that it has reached its deadline, and will run 1ms longer.
> 
> vruntime + sysctl_sched_base_slice =     deadline
>         |-----------|-----------|-----------|-----------|
>              1ms          1ms         1ms         1ms
>                    ^           ^           ^           ^
>                  tick1       tick2       tick3       tick4(nearly 4ms)
> 
> Here is a simple example of this scenario, 
> where sysctl_sched_base_slice=3ms, CONFIG_HZ=1000, 
> the CPU is Intel(R) Xeon(R) Platinum 8338C CPU @ 2.60GHz, 
> and "while :; do :; done &" is run twice with pids 72112 and 72113.
> According to the design of eevdf,
> both should run for 3ms each, but often they run for 4ms.

> 
> The reason for this problem is that
>  the actual time of each tick is less than 1ms.
> We believe there are two reasons:
> Reason 1:
> Hardware error. 
> The clock of an ordinary computer cannot guarantee its own accurate time.
> Reason 2:
> Calculation error.
> Many clocks calculate time indirectly through the number of cycles, 
> which will definitely have errors and be smaller than the actual value,
>  the kernel code is:
> 
> clc= ((unsignedlonglong) delta*dev->mult) >>dev->shift;
> dev->set_next_event((unsignedlong) clc, dev);
> 
> To solve this problem,
> we add a sched feature FORWARD_DEADLINE,
> consider forwarding the deadline appropriately. 
> When vruntime is very close to the deadline,
> we consider that the deadline has been reached.
> This tolerance is set to vslice/128.
> On our machine, the tick error will not be greater than this tolerance,
> and an error of less than 1%
>  should not affect the fairness of the scheduler.

*sigh*

So yeah, discrete system, we get to quantize things. Timing has an
error proportional to the quantum chosen.

Current setup is the trivial whole integer or floor function. As such
the absolute error e is 0<e<q -- for a reason, see below.

You then introduce an additional error of: r/128 in order to minimize
the total error, except you made it worse. Consider case where r/128>q.

The only semi sane thing to do here is to replace floor with rounding,
then the absolute error changes to 0<e<q/2.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fbdca89c677f..d01b73e3f5aa 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1006,7 +1006,9 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
  */
 static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	if ((s64)(se->vruntime - se->deadline) < 0)
+	s64 v_half_tick = calc_delta_fair(TICK_NSEC/2, se);
+
+	if ((s64)(se->vruntime - se->deadline) < -v_half_tick)
 		return false;
 
 	/*

Now the consequence of doing this are that you will end up pushing a
task's (virtual) deadline into the future before it will have had time
to consume its full request, meaning its effective priority drops before
completion.

While it does not harm fairness, it does harm to the guarantees provided
by EEVDF such as they are.

Additionally, since update_deadline() is not proper (see its comment),
you're making the cumulative error significantly worse. A better version
would be something like:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fbdca89c677f..377e61be2cf6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1000,13 +1000,12 @@ int sched_update_scaling(void)
 
 static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
 
-/*
- * XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i
- * this is probably good enough.
- */
 static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	if ((s64)(se->vruntime - se->deadline) < 0)
+	s64 v_half_tick = calc_delta_fair(TICK_NSEC/2, se);
+	s64 v_slice;
+
+	if ((s64)(se->vruntime - se->deadline) < -v_half_tick)
 		return false;
 
 	/*
@@ -1017,10 +1016,14 @@ static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	if (!se->custom_slice)
 		se->slice = sysctl_sched_base_slice;
 
+	v_slice = calc_delta_fair(se->slice, se);
+
 	/*
-	 * EEVDF: vd_i = ve_i + r_i / w_i
+	 * EEVDF: vd_i += r_i / w_i
 	 */
-	se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
+	do {
+		se->deadline += v_slice;
+	} while ((s64)(se->vruntime - se->deadline) < v_half_tick);
 
 	/*
 	 * The task has consumed its request, reschedule.


Now, if all this is worth it I've no idea.

Since you clearly do not care about the request size, your simplest
solution is to simply decrease it.


Also, none of this has been near a compiler, and please double check the
math if you want to proceed with any of this, I've given that about as
much thought as compile time :-)

Re: [PATCH] sched: Forward deadline for early tick

Posted by zhouzihan30 1 year, 2 months ago

From: zhouzihan <zhouzihan30@jd.com>

Thank you very much for your reply, which has brought me lots
 of thoughts.

I have reconsidered this issue and believe that the root cause is that
 the kernel is difficult and unnecessary to implement an ideal eevdf
 due to real hardware:
for example,
an ideal eevdf may bring frequent switching, its cost makes kernel can't
 really do it.

So I see that the kernel has a very concise and clever implementation:
 update_deadline, which allows each task to use up the request size
 as much as possible in one go.

Here, the kernel has actually slightly violated eevdf: we are no longer
 concerned about whether a task is eligible for the time being.

In the prev patch, it was mentioned that due to tick errors, some tasks
 run longer than requested. So if we can do this: when a task vruntime
 approaches the deadline, we check if the task is eligible.
If it is eligible, we follow the previous logic and do not schedule it.
However, if it is ineligible, we schedule it because eevdf has the
 responsibility to not exec ineligible task.

In other words, the kernel has given the task a "benefit" based on the
 actual situation, and we still have the right to revoke this benefit.

In this way, it actually brings the kernel closer to an ideal eevdf,
and at the same time, your reply made me realize my mistake:
The deadline update should be updated using the following function,
which is more reasonable:
    vd_i += r_i / w_i
By using it, our scheduler is still fair,
 and each task can obtain its own time according to its weight.

About tolerance, I think min(vslice/128, tick/2) is better,
as your reply, vslice maybe too big, so use it.

However, there is still a new issue here:
If a se 1 terminates its deadline prematurely due to ineligible,
and then a se 2 runs, after the end of the run,
se 1 becomes eligible, but its deadline has already been pushed back,
which is not earliest eligible,
so se 1 can't run. This seems to not comply with eevdf specifications.

But I think this is acceptable. In the past, the logic causes a task to
 run an extra tick (ms), which means other tasks have to wait for an
 extra tick. Now, a task maybe run less time (us), but it will be
 compensated for in the next scheduling. In terms of overall accuracy,
 I think it has improved.

By the way, we may try to solve this by delaying the deadline update,
which means we schedule a task but not update its deadline,
util next exec it.
I am not sure if it is necessary to implement such complex logic here.
I think it is actually unnecessary,
because the ideal eevdf may not be suitable for the kernel.
It is no need to spend so much to solve the error of few time.
If there is a truly completely accurate system, it should not have
 tick error, and just closes the FORWARD_DEADLINE feature.
Of course, if you think it is necessary, I am willing try to implement it.

Signed-off-by: zhouzihan30 <zhouzihan30@jd.com>
---
 kernel/sched/fair.c     | 41 +++++++++++++++++++++++++++++++++++++----
 kernel/sched/features.h |  7 +++++++
 2 files changed, 44 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2d16c8545c71..e6e58c7d6d4c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1006,8 +1006,10 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
  */
 static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	if ((s64)(se->vruntime - se->deadline) < 0)
-		return false;
+
+	u64 vslice;
+	u64 tolerance = 0;
+	u64 next_deadline;
 
 	/*
 	 * For EEVDF the virtual time slope is determined by w_i (iow.
@@ -1016,11 +1018,42 @@ static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	 */
 	if (!se->custom_slice)
 		se->slice = sysctl_sched_base_slice;
+	vslice = calc_delta_fair(se->slice, se);
+
+	next_deadline = se->vruntime + vslice;
+
+	if (sched_feat(FORWARD_DEADLINE))
+		tolerance = min(vslice>>7, TICK_NSEC/2);
+
+	if ((s64)(se->vruntime + tolerance - se->deadline) < 0)
+		return false;
 
 	/*
-	 * EEVDF: vd_i = ve_i + r_i / w_i
+	 * when se->vruntime + tolerance - se->deadline >= 0
+	 * but se->vruntime - se->deadline < 0,
+	 * there is two case: if entity is eligible?
+	 * if entity is not eligible, we don't need wait deadline, because
+	 * eevdf don't guarantee
+	 * an ineligible entity can exec its request time in one go.
+	 * but when entity eligible, just let it run, which is the
+	 * same processing logic as before.
 	 */
-	se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
+
+	if (sched_feat(FORWARD_DEADLINE) && (s64)(se->vruntime - se->deadline) < 0) {
+		if (entity_eligible(cfs_rq, se)) {
+			return false;
+		} else {
+			/*
+			 * in this case, entity's request size does not use light，
+			 * but considering it is not eligible, we don't need exec it.
+			 * and we let vd_i += r_i / w_i, make scheduler fairness.
+			 */
+			next_deadline = se->deadline + vslice;
+		}
+	}
+
+
+	se->deadline = next_deadline;
 
 	/*
 	 * The task has consumed its request, reschedule.
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 290874079f60..5c74deec7209 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -24,6 +24,13 @@ SCHED_FEAT(RUN_TO_PARITY, true)
  */
 SCHED_FEAT(PREEMPT_SHORT, true)
 
+/*
+ * For some cases where the tick is faster than expected,
+ * move the deadline forward
+ */
+SCHED_FEAT(FORWARD_DEADLINE, true)
+
+
 /*
  * Prefer to schedule the task we woke last (assuming it failed
  * wakeup-preemption), since its likely going to consume data we
-- 
2.39.3 (Apple Git-146)

Re: [PATCH] sched: Forward deadline for early tick

Posted by Vincent Guittot 1 year, 1 month ago

On Wed, 27 Nov 2024 at 08:06, zhouzihan30 <15645113830zzh@gmail.com> wrote:
>
> From: zhouzihan <zhouzihan30@jd.com>
>
> Thank you very much for your reply, which has brought me lots
>  of thoughts.
>
> I have reconsidered this issue and believe that the root cause is that
>  the kernel is difficult and unnecessary to implement an ideal eevdf
>  due to real hardware:
> for example,
> an ideal eevdf may bring frequent switching, its cost makes kernel can't
>  really do it.
>
> So I see that the kernel has a very concise and clever implementation:
>  update_deadline, which allows each task to use up the request size
>  as much as possible in one go.
>
> Here, the kernel has actually slightly violated eevdf: we are no longer
>  concerned about whether a task is eligible for the time being.
>
> In the prev patch, it was mentioned that due to tick errors, some tasks
>  run longer than requested. So if we can do this: when a task vruntime

Could you give more details about the tick error ?

Could it be that you have irq accounting enabled  ? In this case the
delta_exec will always be lower than tick because of the time spent in
the irq context. As an example, the delta of rq_clock_task is always
less than 1ms on my system but the delta of rq_clock is sometimes
above and sometime below 1ms

This means that the task didn't effectively get its slice because of
time spent in IRQ context. Would it be better to set a default slice
slightly lower than an integer number of tick



>  approaches the deadline, we check if the task is eligible.
> If it is eligible, we follow the previous logic and do not schedule it.
> However, if it is ineligible, we schedule it because eevdf has the
>  responsibility to not exec ineligible task.
>
> In other words, the kernel has given the task a "benefit" based on the
>  actual situation, and we still have the right to revoke this benefit.
>
> In this way, it actually brings the kernel closer to an ideal eevdf,
> and at the same time, your reply made me realize my mistake:
> The deadline update should be updated using the following function,
> which is more reasonable:
>     vd_i += r_i / w_i
> By using it, our scheduler is still fair,
>  and each task can obtain its own time according to its weight.
>
> About tolerance, I think min(vslice/128, tick/2) is better,
> as your reply, vslice maybe too big, so use it.
>
> However, there is still a new issue here:
> If a se 1 terminates its deadline prematurely due to ineligible,
> and then a se 2 runs, after the end of the run,
> se 1 becomes eligible, but its deadline has already been pushed back,
> which is not earliest eligible,
> so se 1 can't run. This seems to not comply with eevdf specifications.
>
> But I think this is acceptable. In the past, the logic causes a task to
>  run an extra tick (ms), which means other tasks have to wait for an
>  extra tick. Now, a task maybe run less time (us), but it will be
>  compensated for in the next scheduling. In terms of overall accuracy,
>  I think it has improved.
>
> By the way, we may try to solve this by delaying the deadline update,
> which means we schedule a task but not update its deadline,
> util next exec it.
> I am not sure if it is necessary to implement such complex logic here.
> I think it is actually unnecessary,
> because the ideal eevdf may not be suitable for the kernel.
> It is no need to spend so much to solve the error of few time.
> If there is a truly completely accurate system, it should not have
>  tick error, and just closes the FORWARD_DEADLINE feature.
> Of course, if you think it is necessary, I am willing try to implement it.
>
> Signed-off-by: zhouzihan30 <zhouzihan30@jd.com>
> ---
>  kernel/sched/fair.c     | 41 +++++++++++++++++++++++++++++++++++++----
>  kernel/sched/features.h |  7 +++++++
>  2 files changed, 44 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2d16c8545c71..e6e58c7d6d4c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1006,8 +1006,10 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
>   */
>  static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> -       if ((s64)(se->vruntime - se->deadline) < 0)
> -               return false;
> +
> +       u64 vslice;
> +       u64 tolerance = 0;
> +       u64 next_deadline;
>
>         /*
>          * For EEVDF the virtual time slope is determined by w_i (iow.
> @@ -1016,11 +1018,42 @@ static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
>          */
>         if (!se->custom_slice)
>                 se->slice = sysctl_sched_base_slice;
> +       vslice = calc_delta_fair(se->slice, se);
> +
> +       next_deadline = se->vruntime + vslice;
> +
> +       if (sched_feat(FORWARD_DEADLINE))
> +               tolerance = min(vslice>>7, TICK_NSEC/2);
> +
> +       if ((s64)(se->vruntime + tolerance - se->deadline) < 0)
> +               return false;
>
>         /*
> -        * EEVDF: vd_i = ve_i + r_i / w_i
> +        * when se->vruntime + tolerance - se->deadline >= 0
> +        * but se->vruntime - se->deadline < 0,
> +        * there is two case: if entity is eligible?
> +        * if entity is not eligible, we don't need wait deadline, because
> +        * eevdf don't guarantee
> +        * an ineligible entity can exec its request time in one go.
> +        * but when entity eligible, just let it run, which is the
> +        * same processing logic as before.
>          */
> -       se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
> +
> +       if (sched_feat(FORWARD_DEADLINE) && (s64)(se->vruntime - se->deadline) < 0) {
> +               if (entity_eligible(cfs_rq, se)) {
> +                       return false;
> +               } else {
> +                       /*
> +                        * in this case, entity's request size does not use light，
> +                        * but considering it is not eligible, we don't need exec it.
> +                        * and we let vd_i += r_i / w_i, make scheduler fairness.
> +                        */
> +                       next_deadline = se->deadline + vslice;
> +               }
> +       }
> +
> +
> +       se->deadline = next_deadline;
>
>         /*
>          * The task has consumed its request, reschedule.
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 290874079f60..5c74deec7209 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -24,6 +24,13 @@ SCHED_FEAT(RUN_TO_PARITY, true)
>   */
>  SCHED_FEAT(PREEMPT_SHORT, true)
>
> +/*
> + * For some cases where the tick is faster than expected,
> + * move the deadline forward
> + */
> +SCHED_FEAT(FORWARD_DEADLINE, true)
> +
> +
>  /*
>   * Prefer to schedule the task we woke last (assuming it failed
>   * wakeup-preemption), since its likely going to consume data we
> --
> 2.39.3 (Apple Git-146)
>

Re: [PATCH] sched: Forward deadline for early tick

Posted by zhouzihan30 1 year, 1 month ago

Thank you Vincent Guittot for solving my confusion about tick error: why 
is it always less than 1ms on some machines.

It is normal for tick not to be equal to 1ms due to software or hardware,
but on some machines, tick is always less than 1ms, which is a bit strange.
I have not provided a good explanation for it, but now I know the reason.
The root cause is CONFIG_IRQ_TIME_ACCOUNTING.

I used bpftrace to monitor changes in the rq clock (task) in the system:

kprobe:update_rq_clock_task /pid == 6388/
{
    @rq = (struct rq *)arg0;
    $delta = (int64)arg1;
    @clock_pre = @rq->clock_task;
    printf("rq clock delta is %llu\n", $delta);
}

kretprobe:update_rq_clock_task /pid == 6388/
{
    $clock_post = @rq->clock_task;
    printf("rq clock task delta: %llu\n", $clock_post - @clock_pre);
}


result:
rq clock delta is 999994                                                          
rq clock task delta: 996616                    
rq clock delta is 1000026
rq clock task delta: 996550
rq clock delta is 1000047
rq clock task delta: 996716
rq clock delta is 999995 
rq clock task delta: 996454
rq clock delta is 1000058
rq clock task delta: 996621
rq clock delta is 999987 
rq clock task delta: 996457
rq clock delta is 1000047 
rq clock task delta: 996621
rq clock delta is 999966 
rq clock task delta: 996594
rq clock delta is 1000071
rq clock task delta: 996470
rq clock delta is 1000073
rq clock task delta: 996586
rq clock delta is 999958                       
rq clock task delta: 996446                    
rq clock delta is 1000018
rq clock task delta: 996574
rq clock delta is 999993 
rq clock task delta: 996908
rq clock delta is 1000037
rq clock task delta: 996547


As Vincent Guittot said:

< the delta of rq_clock_task is always
< less than 1ms on my system but the delta of rq_clock is sometimes
< above and sometime below 1ms

According to the kernel function: update_rq_clock_task, Both
 CONFIG_IRQ_TIME_ACCOUNTING and CONFIG_PARAVIRT_TIME_ACCOUNTING often
 result in the delta of rq_clock_task being lower than 1ms. I counted
 13016 delta cases, and in the end, 47% of the delta of rq_clock was
 less than 1ms, but all of the delta of rq_clock_task is always less
 than 1ms

In order to conduct a comparative experiment, I turned off those CONFIG
 and re checked the changes in clock, It is found that the values of
 rq clock and rq clock task become completely consistent, However, 
according to the information from perf, there are still errors in tick
 (slice=3ms) :

      time    cpu  task name     wait time  sch delay   run time
                   [tid/pid]        (msec)     (msec)     (msec)
---------- ------  ------------  ---------  ---------  ---------
110.436513 [0001]  perf[1414]        0.000      0.000      0.000 
110.440490 [0001]  bash[1341]        0.000      0.000      3.977 
110.441490 [0001]  bash[1344]        0.000      0.000      0.999 
110.441548 [0001]  perf[1414]        4.976      0.000      0.058 
110.445491 [0001]  bash[1344]        0.058      0.000      3.942 
110.449490 [0001]  bash[1341]        5.000      0.000      3.999 
110.452490 [0001]  bash[1344]        3.999      0.000      2.999 
110.456491 [0001]  bash[1341]        2.999      0.000      4.000 
110.460489 [0001]  bash[1344]        4.000      0.000      3.998 
110.463490 [0001]  bash[1341]        3.998      0.000      3.001 
110.467493 [0001]  bash[1344]        3.001      0.000      4.002 
110.471490 [0001]  bash[1341]        4.002      0.000      3.996 
110.474489 [0001]  bash[1344]        3.996      0.000      2.999 
110.477490 [0001]  bash[1341]        2.999      0.000      3.000 


It seems that regardless of whether or not there is
 CONFIG_IRQ_TIME_ACCOUNTING, tick errors can cause random variations in
 runtime between 3 and 4ms.



< This means that the task didn't effectively get its slice because of
< time spent in IRQ context. Would it be better to set a default slice
< slightly lower than an integer number of tick

We once considered subtracting a little from a slice when setting it,
for example, if someone sets 3ms, we can subtract 0.1ms from it and
 make it 2.9ms. But this is not a good solution. If someone sets it to
 3.1ms, should we use 2.9ms or 3ms? There doesn't seem to be a
 particularly good option, and it may lead to even greater system errors.

Changing the default value is a simple solution, in fact, we did it on 
the old kernel we used (we just set it 2.9ms. On our old kernel 6.6,
 tick error caused processes with the same weight have different run time,
the new kernel did not have this problem, but we still submitted this
 patch because we thought unexpected behavior might occur in other
 scenarios). However, apart from the kernel's default value, 
different OS seemes to have different behaviors, and the default value is
 often an integer number of tick... so we still hope to solve this
 problem in kernel.

Re: [PATCH] sched: Forward deadline for early tick

Posted by Vincent Guittot 1 year, 1 month ago

On Tue, 17 Dec 2024 at 07:14, zhouzihan30 <15645113830zzh@gmail.com> wrote:
>
>
> Thank you Vincent Guittot for solving my confusion about tick error: why
> is it always less than 1ms on some machines.
>
> It is normal for tick not to be equal to 1ms due to software or hardware,
> but on some machines, tick is always less than 1ms, which is a bit strange.
> I have not provided a good explanation for it, but now I know the reason.
> The root cause is CONFIG_IRQ_TIME_ACCOUNTING.
>
> I used bpftrace to monitor changes in the rq clock (task) in the system:
>
> kprobe:update_rq_clock_task /pid == 6388/
> {
>     @rq = (struct rq *)arg0;
>     $delta = (int64)arg1;
>     @clock_pre = @rq->clock_task;
>     printf("rq clock delta is %llu\n", $delta);
> }
>
> kretprobe:update_rq_clock_task /pid == 6388/
> {
>     $clock_post = @rq->clock_task;
>     printf("rq clock task delta: %llu\n", $clock_post - @clock_pre);
> }
>
>
> result:
> rq clock delta is 999994
> rq clock task delta: 996616
> rq clock delta is 1000026
> rq clock task delta: 996550
> rq clock delta is 1000047
> rq clock task delta: 996716
> rq clock delta is 999995
> rq clock task delta: 996454
> rq clock delta is 1000058
> rq clock task delta: 996621
> rq clock delta is 999987
> rq clock task delta: 996457
> rq clock delta is 1000047
> rq clock task delta: 996621
> rq clock delta is 999966
> rq clock task delta: 996594
> rq clock delta is 1000071
> rq clock task delta: 996470
> rq clock delta is 1000073
> rq clock task delta: 996586
> rq clock delta is 999958
> rq clock task delta: 996446
> rq clock delta is 1000018
> rq clock task delta: 996574
> rq clock delta is 999993
> rq clock task delta: 996908
> rq clock delta is 1000037
> rq clock task delta: 996547
>
>
> As Vincent Guittot said:
>
> < the delta of rq_clock_task is always
> < less than 1ms on my system but the delta of rq_clock is sometimes
> < above and sometime below 1ms
>
> According to the kernel function: update_rq_clock_task, Both
>  CONFIG_IRQ_TIME_ACCOUNTING and CONFIG_PARAVIRT_TIME_ACCOUNTING often
>  result in the delta of rq_clock_task being lower than 1ms. I counted
>  13016 delta cases, and in the end, 47% of the delta of rq_clock was
>  less than 1ms, but all of the delta of rq_clock_task is always less

Having delta of rq_clock toggling above or below 1ms is normal because
of the clockevent precision, if the previous delta was longer than 1ms
then the next one will be shorter. But the average of several ticks
remains 1ms like in your trace above

>  than 1ms
>
> In order to conduct a comparative experiment, I turned off those CONFIG
>  and re checked the changes in clock, It is found that the values of
>  rq clock and rq clock task become completely consistent, However,
> according to the information from perf, there are still errors in tick
>  (slice=3ms) :

Did you check that the whole tick was accounted for the task ?
According to your trace of rq clock delta and rq clock task delta
above, most of the sum of 3 consecutives tick is greater than 3ms for
rq clock delta so I would assume that the  sum of delta_exec would be
greater than 3ms as well after 3 ticks

>
>       time    cpu  task name     wait time  sch delay   run time
>                    [tid/pid]        (msec)     (msec)     (msec)
> ---------- ------  ------------  ---------  ---------  ---------
> 110.436513 [0001]  perf[1414]        0.000      0.000      0.000
> 110.440490 [0001]  bash[1341]        0.000      0.000      3.977
> 110.441490 [0001]  bash[1344]        0.000      0.000      0.999
> 110.441548 [0001]  perf[1414]        4.976      0.000      0.058
> 110.445491 [0001]  bash[1344]        0.058      0.000      3.942
> 110.449490 [0001]  bash[1341]        5.000      0.000      3.999
> 110.452490 [0001]  bash[1344]        3.999      0.000      2.999
> 110.456491 [0001]  bash[1341]        2.999      0.000      4.000
> 110.460489 [0001]  bash[1344]        4.000      0.000      3.998
> 110.463490 [0001]  bash[1341]        3.998      0.000      3.001
> 110.467493 [0001]  bash[1344]        3.001      0.000      4.002
> 110.471490 [0001]  bash[1341]        4.002      0.000      3.996
> 110.474489 [0001]  bash[1344]        3.996      0.000      2.999
> 110.477490 [0001]  bash[1341]        2.999      0.000      3.000
>
>
> It seems that regardless of whether or not there is
>  CONFIG_IRQ_TIME_ACCOUNTING, tick errors can cause random variations in
>  runtime between 3 and 4ms.
>
>
>
> < This means that the task didn't effectively get its slice because of
> < time spent in IRQ context. Would it be better to set a default slice
> < slightly lower than an integer number of tick
>
> We once considered subtracting a little from a slice when setting it,
> for example, if someone sets 3ms, we can subtract 0.1ms from it and
>  make it 2.9ms. But this is not a good solution. If someone sets it to
>  3.1ms, should we use 2.9ms or 3ms? There doesn't seem to be a
>  particularly good option, and it may lead to even greater system errors.

And we end up giving less than its slice to task which could have set
it to this value for a good reason.


>
> Changing the default value is a simple solution, in fact, we did it on
> the old kernel we used (we just set it 2.9ms. On our old kernel 6.6,
>  tick error caused processes with the same weight have different run time,
> the new kernel did not have this problem, but we still submitted this
>  patch because we thought unexpected behavior might occur in other
>  scenarios). However, apart from the kernel's default value,
> different OS seemes to have different behaviors, and the default value is
>  often an integer number of tick... so we still hope to solve this
>  problem in kernel.
>
>
>
>
>
>
>

Re: [PATCH] sched: Forward deadline for early tick

Posted by zihan zhou 1 year, 1 month ago

From: zhouzihan30 <zhouzihan30@jd.com>

Thank you for your reply!

> Having delta of rq_clock toggling above or below 1ms is normal because
> of the clockevent precision, if the previous delta was longer than 1ms
> then the next one will be shorter. But the average of several ticks
> remains 1ms like in your trace above
> 
> >  than 1ms
> >
> > In order to conduct a comparative experiment, I turned off those CONFIG
> >  and re checked the changes in clock, It is found that the values of
> >  rq clock and rq clock task become completely consistent, However,
> > according to the information from perf, there are still errors in tick
> >  (slice=3ms) :
> 
> Did you check that the whole tick was accounted for the task ?
> According to your trace of rq clock delta and rq clock task delta
> above, most of the sum of 3 consecutives tick is greater than 3ms for
> rq clock delta so I would assume that the  sum of delta_exec would be
> greater than 3ms as well after 3 ticks
> 
> >
> >       time    cpu  task name     wait time  sch delay   run time
> >                    [tid/pid]        (msec)     (msec)     (msec)
> > ---------- ------  ------------  ---------  ---------  ---------
> > 110.436513 [0001]  perf[1414]        0.000      0.000      0.000
> > 110.440490 [0001]  bash[1341]        0.000      0.000      3.977
> > 110.441490 [0001]  bash[1344]        0.000      0.000      0.999
> > 110.441548 [0001]  perf[1414]        4.976      0.000      0.058
> > 110.445491 [0001]  bash[1344]        0.058      0.000      3.942
> > 110.449490 [0001]  bash[1341]        5.000      0.000      3.999
> > 110.452490 [0001]  bash[1344]        3.999      0.000      2.999
> > 110.456491 [0001]  bash[1341]        2.999      0.000      4.000
> > 110.460489 [0001]  bash[1344]        4.000      0.000      3.998
> > 110.463490 [0001]  bash[1341]        3.998      0.000      3.001
> > 110.467493 [0001]  bash[1344]        3.001      0.000      4.002
> > 110.471490 [0001]  bash[1341]        4.002      0.000      3.996
> > 110.474489 [0001]  bash[1344]        3.996      0.000      2.999
> > 110.477490 [0001]  bash[1341]        2.999      0.000      3.000
> >

I use perf to record the impact of tick errors on runtime. When slice=3ms,
two busy tasks compete for one CPU. If one task runs for 4ms, it means
 that three ticks are less than 3ms. The task can only switch to another
 task after running 4ms on the next tick, which is 1ms more. Based on my
 observation, about 50% of the time will be like this (no 
CONFIG_IRQ_TIME_ACCOUNTING, if there is, there will be more time for the
 task to run for 4ms even if slice=3ms).


> > We once considered subtracting a little from a slice when setting it,
> > for example, if someone sets 3ms, we can subtract 0.1ms from it and
> >  make it 2.9ms. But this is not a good solution. If someone sets it to
> >  3.1ms, should we use 2.9ms or 3ms? There doesn't seem to be a
> >  particularly good option, and it may lead to even greater system errors.
> 
> And we end up giving less than its slice to task which could have set
> it to this value for a good reason.

Thank you, I think we have reached an agreement that the time given to a
 task at once should be less than or equal to the slice. EEVDF never
 guarantees that a task must run a slice at once, but the kernel ensures
 this. However, due to tick errors, there have been some issues like task
 has exceeded the allotted time.
I will propose patch v2 to try to solve this problem.