kernel/sched/fair.c | 21 ++++++++++++++++++--- kernel/sched/features.h | 7 +++++++ 2 files changed, 25 insertions(+), 3 deletions(-)
Due to the problem of tick time accuracy,
the eevdf scheduler exhibits unexpected behavior.
For example, a machine with sysctl_sched_base_slice=3ms, CONFIG_HZ=1000
should trigger a tick every 1ms.
A se (sched_entity) with default weight 1024 should
theoretically reach its deadline on the third tick.
However, the tick often arrives a little faster than expected.
In this case, the se can only wait until the next tick to
consider that it has reached its deadline, and will run 1ms longer.
vruntime + sysctl_sched_base_slice = deadline
|-----------|-----------|-----------|-----------|
1ms 1ms 1ms 1ms
^ ^ ^ ^
tick1 tick2 tick3 tick4(nearly 4ms)
Here is a simple example of this scenario,
where sysctl_sched_base_slice=3ms, CONFIG_HZ=1000,
the CPU is Intel(R) Xeon(R) Platinum 8338C CPU @ 2.60GHz,
and "while :; do :; done &" is run twice with pids 72112 and 72113.
According to the design of eevdf,
both should run for 3ms each, but often they run for 4ms.
time cpu task name wait time sch delay run time
[tid/pid] (msec) (msec) (msec)
------------- ------ ------------- --------- --------- ---------
56696.846136 [0001] perf[72368] 0.000 0.000 0.000
56696.849378 [0001] bash[72112] 0.000 0.000 3.241
56696.852379 [0001] bash[72113] 0.000 0.000 3.000
56696.852964 [0001] sleep[72369] 0.000 6.261 0.584
56696.856378 [0001] bash[72112] 3.585 0.000 3.414
56696.860379 [0001] bash[72113] 3.999 0.000 4.000
56696.864379 [0001] bash[72112] 4.000 0.000 4.000
56696.868377 [0001] bash[72113] 4.000 0.000 3.997
56696.871378 [0001] bash[72112] 3.997 0.000 3.000
56696.874377 [0001] bash[72113] 3.000 0.000 2.999
56696.877377 [0001] bash[72112] 2.999 0.000 2.999
56696.881377 [0001] bash[72113] 2.999 0.000 3.999
The reason for this problem is that
the actual time of each tick is less than 1ms.
We believe there are two reasons:
Reason 1:
Hardware error.
The clock of an ordinary computer cannot guarantee its own accurate time.
Reason 2:
Calculation error.
Many clocks calculate time indirectly through the number of cycles,
which will definitely have errors and be smaller than the actual value,
the kernel code is:
clc= ((unsignedlonglong) delta*dev->mult) >>dev->shift;
dev->set_next_event((unsignedlong) clc, dev);
To solve this problem,
we add a sched feature FORWARD_DEADLINE,
consider forwarding the deadline appropriately.
When vruntime is very close to the deadline,
we consider that the deadline has been reached.
This tolerance is set to vslice/128.
On our machine, the tick error will not be greater than this tolerance,
and an error of less than 1%
should not affect the fairness of the scheduler.
when open FORWARD_DEADLINE,
the task will run once every 3ms as designed by eevdf:
time cpu task name wait time sch delay run time
[tid/pid] (msec) (msec) (msec)
------------- ------ ---------------- --------- --------- ---------
57110.032369 [0001] perf[72752] 0.000 0.000 0.000
57110.035805 [0001] bash[72112] 0.000 0.000 3.436
57110.036400 [0001] sleep[72755] 0.000 3.456 0.594
57110.039806 [0001] bash[72113] 0.000 0.000 3.405
57110.042805 [0001] bash[72112] 4.000 0.000 2.999
57110.045811 [0001] bash[72113] 2.999 0.000 3.005
57110.045816 [0001] ksoftirqd/1[98] 0.000 0.001 0.005
57110.048804 [0001] bash[72112] 3.010 0.000 2.987
57110.051805 [0001] bash[72113] 2.993 0.000 3.001
57110.054804 [0001] bash[72112] 3.001 0.000 2.998
57110.057805 [0001] bash[72113] 2.998 0.000 3.000
57110.060804 [0001] bash[72112] 3.000 0.000 2.999
57110.063805 [0001] bash[72113] 2.999 0.000 3.001
57110.066804 [0001] bash[72112] 3.001 0.000 2.999
57110.069805 [0001] bash[72113] 2.999 0.000 3.000
57110.072804 [0001] bash[72112] 3.000 0.000 2.999
---
kernel/sched/fair.c | 21 ++++++++++++++++++---
kernel/sched/features.h | 7 +++++++
2 files changed, 25 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2d16c8545c71..ea342ef8db26 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1006,8 +1006,9 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
*/
static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
- if ((s64)(se->vruntime - se->deadline) < 0)
- return false;
+
+ u64 vslice;
+ u64 tolerance = 0;
/*
* For EEVDF the virtual time slope is determined by w_i (iow.
@@ -1016,11 +1017,25 @@ static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
*/
if (!se->custom_slice)
se->slice = sysctl_sched_base_slice;
+ vslice = calc_delta_fair(se->slice, se);
+
+
+ /*
+ * When the deadline is only 1/128 away,
+ * it is considered to have been reached.
+ */
+ if (sched_feat(FORWARD_DEADLINE))
+ tolerance = vslice>>7;
+
+
+
+ if ((s64)(se->vruntime + tolerance - se->deadline) < 0)
+ return false;
/*
* EEVDF: vd_i = ve_i + r_i / w_i
*/
- se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
+ se->deadline = se->vruntime + vslice;
/*
* The task has consumed its request, reschedule.
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 290874079f60..fad8e2bbf4ed 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -24,6 +24,13 @@ SCHED_FEAT(RUN_TO_PARITY, true)
*/
SCHED_FEAT(PREEMPT_SHORT, true)
+/*
+ * For some cases where the tick is faster than expected,
+ * move the deadline forward.
+ */
+SCHED_FEAT(FORWARD_DEADLINE, true)
+
+
/*
* Prefer to schedule the task we woke last (assuming it failed
* wakeup-preemption), since its likely going to consume data we
--
2.39.3 (Apple Git-146)
Signed-off-by: zhouzihan30 <zhouzihan30@jd.com>
Signed-off-by: yaozhenguo <yaozhenguo@jd.com>
On Tue, Nov 26, 2024 at 02:23:50PM +0800, zhouzihan30 wrote:
> Due to the problem of tick time accuracy,
> the eevdf scheduler exhibits unexpected behavior.
> For example, a machine with sysctl_sched_base_slice=3ms, CONFIG_HZ=1000
> should trigger a tick every 1ms.
> A se (sched_entity) with default weight 1024 should
> theoretically reach its deadline on the third tick.
> However, the tick often arrives a little faster than expected.
> In this case, the se can only wait until the next tick to
> consider that it has reached its deadline, and will run 1ms longer.
>
> vruntime + sysctl_sched_base_slice = deadline
> |-----------|-----------|-----------|-----------|
> 1ms 1ms 1ms 1ms
> ^ ^ ^ ^
> tick1 tick2 tick3 tick4(nearly 4ms)
>
> Here is a simple example of this scenario,
> where sysctl_sched_base_slice=3ms, CONFIG_HZ=1000,
> the CPU is Intel(R) Xeon(R) Platinum 8338C CPU @ 2.60GHz,
> and "while :; do :; done &" is run twice with pids 72112 and 72113.
> According to the design of eevdf,
> both should run for 3ms each, but often they run for 4ms.
>
> The reason for this problem is that
> the actual time of each tick is less than 1ms.
> We believe there are two reasons:
> Reason 1:
> Hardware error.
> The clock of an ordinary computer cannot guarantee its own accurate time.
> Reason 2:
> Calculation error.
> Many clocks calculate time indirectly through the number of cycles,
> which will definitely have errors and be smaller than the actual value,
> the kernel code is:
>
> clc= ((unsignedlonglong) delta*dev->mult) >>dev->shift;
> dev->set_next_event((unsignedlong) clc, dev);
>
> To solve this problem,
> we add a sched feature FORWARD_DEADLINE,
> consider forwarding the deadline appropriately.
> When vruntime is very close to the deadline,
> we consider that the deadline has been reached.
> This tolerance is set to vslice/128.
> On our machine, the tick error will not be greater than this tolerance,
> and an error of less than 1%
> should not affect the fairness of the scheduler.
*sigh*
So yeah, discrete system, we get to quantize things. Timing has an
error proportional to the quantum chosen.
Current setup is the trivial whole integer or floor function. As such
the absolute error e is 0<e<q -- for a reason, see below.
You then introduce an additional error of: r/128 in order to minimize
the total error, except you made it worse. Consider case where r/128>q.
The only semi sane thing to do here is to replace floor with rounding,
then the absolute error changes to 0<e<q/2.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fbdca89c677f..d01b73e3f5aa 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1006,7 +1006,9 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
*/
static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
- if ((s64)(se->vruntime - se->deadline) < 0)
+ s64 v_half_tick = calc_delta_fair(TICK_NSEC/2, se);
+
+ if ((s64)(se->vruntime - se->deadline) < -v_half_tick)
return false;
/*
Now the consequence of doing this are that you will end up pushing a
task's (virtual) deadline into the future before it will have had time
to consume its full request, meaning its effective priority drops before
completion.
While it does not harm fairness, it does harm to the guarantees provided
by EEVDF such as they are.
Additionally, since update_deadline() is not proper (see its comment),
you're making the cumulative error significantly worse. A better version
would be something like:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fbdca89c677f..377e61be2cf6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1000,13 +1000,12 @@ int sched_update_scaling(void)
static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
-/*
- * XXX: strictly: vd_i += N*r_i/w_i such that: vd_i > ve_i
- * this is probably good enough.
- */
static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
- if ((s64)(se->vruntime - se->deadline) < 0)
+ s64 v_half_tick = calc_delta_fair(TICK_NSEC/2, se);
+ s64 v_slice;
+
+ if ((s64)(se->vruntime - se->deadline) < -v_half_tick)
return false;
/*
@@ -1017,10 +1016,14 @@ static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
if (!se->custom_slice)
se->slice = sysctl_sched_base_slice;
+ v_slice = calc_delta_fair(se->slice, se);
+
/*
- * EEVDF: vd_i = ve_i + r_i / w_i
+ * EEVDF: vd_i += r_i / w_i
*/
- se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
+ do {
+ se->deadline += v_slice;
+ } while ((s64)(se->vruntime - se->deadline) < v_half_tick);
/*
* The task has consumed its request, reschedule.
Now, if all this is worth it I've no idea.
Since you clearly do not care about the request size, your simplest
solution is to simply decrease it.
Also, none of this has been near a compiler, and please double check the
math if you want to proceed with any of this, I've given that about as
much thought as compile time :-)
From: zhouzihan <zhouzihan30@jd.com>
Thank you very much for your reply, which has brought me lots
of thoughts.
I have reconsidered this issue and believe that the root cause is that
the kernel is difficult and unnecessary to implement an ideal eevdf
due to real hardware:
for example,
an ideal eevdf may bring frequent switching, its cost makes kernel can't
really do it.
So I see that the kernel has a very concise and clever implementation:
update_deadline, which allows each task to use up the request size
as much as possible in one go.
Here, the kernel has actually slightly violated eevdf: we are no longer
concerned about whether a task is eligible for the time being.
In the prev patch, it was mentioned that due to tick errors, some tasks
run longer than requested. So if we can do this: when a task vruntime
approaches the deadline, we check if the task is eligible.
If it is eligible, we follow the previous logic and do not schedule it.
However, if it is ineligible, we schedule it because eevdf has the
responsibility to not exec ineligible task.
In other words, the kernel has given the task a "benefit" based on the
actual situation, and we still have the right to revoke this benefit.
In this way, it actually brings the kernel closer to an ideal eevdf,
and at the same time, your reply made me realize my mistake:
The deadline update should be updated using the following function,
which is more reasonable:
vd_i += r_i / w_i
By using it, our scheduler is still fair,
and each task can obtain its own time according to its weight.
About tolerance, I think min(vslice/128, tick/2) is better,
as your reply, vslice maybe too big, so use it.
However, there is still a new issue here:
If a se 1 terminates its deadline prematurely due to ineligible,
and then a se 2 runs, after the end of the run,
se 1 becomes eligible, but its deadline has already been pushed back,
which is not earliest eligible,
so se 1 can't run. This seems to not comply with eevdf specifications.
But I think this is acceptable. In the past, the logic causes a task to
run an extra tick (ms), which means other tasks have to wait for an
extra tick. Now, a task maybe run less time (us), but it will be
compensated for in the next scheduling. In terms of overall accuracy,
I think it has improved.
By the way, we may try to solve this by delaying the deadline update,
which means we schedule a task but not update its deadline,
util next exec it.
I am not sure if it is necessary to implement such complex logic here.
I think it is actually unnecessary,
because the ideal eevdf may not be suitable for the kernel.
It is no need to spend so much to solve the error of few time.
If there is a truly completely accurate system, it should not have
tick error, and just closes the FORWARD_DEADLINE feature.
Of course, if you think it is necessary, I am willing try to implement it.
Signed-off-by: zhouzihan30 <zhouzihan30@jd.com>
---
kernel/sched/fair.c | 41 +++++++++++++++++++++++++++++++++++++----
kernel/sched/features.h | 7 +++++++
2 files changed, 44 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2d16c8545c71..e6e58c7d6d4c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1006,8 +1006,10 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
*/
static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
- if ((s64)(se->vruntime - se->deadline) < 0)
- return false;
+
+ u64 vslice;
+ u64 tolerance = 0;
+ u64 next_deadline;
/*
* For EEVDF the virtual time slope is determined by w_i (iow.
@@ -1016,11 +1018,42 @@ static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
*/
if (!se->custom_slice)
se->slice = sysctl_sched_base_slice;
+ vslice = calc_delta_fair(se->slice, se);
+
+ next_deadline = se->vruntime + vslice;
+
+ if (sched_feat(FORWARD_DEADLINE))
+ tolerance = min(vslice>>7, TICK_NSEC/2);
+
+ if ((s64)(se->vruntime + tolerance - se->deadline) < 0)
+ return false;
/*
- * EEVDF: vd_i = ve_i + r_i / w_i
+ * when se->vruntime + tolerance - se->deadline >= 0
+ * but se->vruntime - se->deadline < 0,
+ * there is two case: if entity is eligible?
+ * if entity is not eligible, we don't need wait deadline, because
+ * eevdf don't guarantee
+ * an ineligible entity can exec its request time in one go.
+ * but when entity eligible, just let it run, which is the
+ * same processing logic as before.
*/
- se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
+
+ if (sched_feat(FORWARD_DEADLINE) && (s64)(se->vruntime - se->deadline) < 0) {
+ if (entity_eligible(cfs_rq, se)) {
+ return false;
+ } else {
+ /*
+ * in this case, entity's request size does not use light,
+ * but considering it is not eligible, we don't need exec it.
+ * and we let vd_i += r_i / w_i, make scheduler fairness.
+ */
+ next_deadline = se->deadline + vslice;
+ }
+ }
+
+
+ se->deadline = next_deadline;
/*
* The task has consumed its request, reschedule.
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 290874079f60..5c74deec7209 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -24,6 +24,13 @@ SCHED_FEAT(RUN_TO_PARITY, true)
*/
SCHED_FEAT(PREEMPT_SHORT, true)
+/*
+ * For some cases where the tick is faster than expected,
+ * move the deadline forward
+ */
+SCHED_FEAT(FORWARD_DEADLINE, true)
+
+
/*
* Prefer to schedule the task we woke last (assuming it failed
* wakeup-preemption), since its likely going to consume data we
--
2.39.3 (Apple Git-146)
On Wed, 27 Nov 2024 at 08:06, zhouzihan30 <15645113830zzh@gmail.com> wrote:
>
> From: zhouzihan <zhouzihan30@jd.com>
>
> Thank you very much for your reply, which has brought me lots
> of thoughts.
>
> I have reconsidered this issue and believe that the root cause is that
> the kernel is difficult and unnecessary to implement an ideal eevdf
> due to real hardware:
> for example,
> an ideal eevdf may bring frequent switching, its cost makes kernel can't
> really do it.
>
> So I see that the kernel has a very concise and clever implementation:
> update_deadline, which allows each task to use up the request size
> as much as possible in one go.
>
> Here, the kernel has actually slightly violated eevdf: we are no longer
> concerned about whether a task is eligible for the time being.
>
> In the prev patch, it was mentioned that due to tick errors, some tasks
> run longer than requested. So if we can do this: when a task vruntime
Could you give more details about the tick error ?
Could it be that you have irq accounting enabled ? In this case the
delta_exec will always be lower than tick because of the time spent in
the irq context. As an example, the delta of rq_clock_task is always
less than 1ms on my system but the delta of rq_clock is sometimes
above and sometime below 1ms
This means that the task didn't effectively get its slice because of
time spent in IRQ context. Would it be better to set a default slice
slightly lower than an integer number of tick
> approaches the deadline, we check if the task is eligible.
> If it is eligible, we follow the previous logic and do not schedule it.
> However, if it is ineligible, we schedule it because eevdf has the
> responsibility to not exec ineligible task.
>
> In other words, the kernel has given the task a "benefit" based on the
> actual situation, and we still have the right to revoke this benefit.
>
> In this way, it actually brings the kernel closer to an ideal eevdf,
> and at the same time, your reply made me realize my mistake:
> The deadline update should be updated using the following function,
> which is more reasonable:
> vd_i += r_i / w_i
> By using it, our scheduler is still fair,
> and each task can obtain its own time according to its weight.
>
> About tolerance, I think min(vslice/128, tick/2) is better,
> as your reply, vslice maybe too big, so use it.
>
> However, there is still a new issue here:
> If a se 1 terminates its deadline prematurely due to ineligible,
> and then a se 2 runs, after the end of the run,
> se 1 becomes eligible, but its deadline has already been pushed back,
> which is not earliest eligible,
> so se 1 can't run. This seems to not comply with eevdf specifications.
>
> But I think this is acceptable. In the past, the logic causes a task to
> run an extra tick (ms), which means other tasks have to wait for an
> extra tick. Now, a task maybe run less time (us), but it will be
> compensated for in the next scheduling. In terms of overall accuracy,
> I think it has improved.
>
> By the way, we may try to solve this by delaying the deadline update,
> which means we schedule a task but not update its deadline,
> util next exec it.
> I am not sure if it is necessary to implement such complex logic here.
> I think it is actually unnecessary,
> because the ideal eevdf may not be suitable for the kernel.
> It is no need to spend so much to solve the error of few time.
> If there is a truly completely accurate system, it should not have
> tick error, and just closes the FORWARD_DEADLINE feature.
> Of course, if you think it is necessary, I am willing try to implement it.
>
> Signed-off-by: zhouzihan30 <zhouzihan30@jd.com>
> ---
> kernel/sched/fair.c | 41 +++++++++++++++++++++++++++++++++++++----
> kernel/sched/features.h | 7 +++++++
> 2 files changed, 44 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2d16c8545c71..e6e58c7d6d4c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1006,8 +1006,10 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se);
> */
> static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
> {
> - if ((s64)(se->vruntime - se->deadline) < 0)
> - return false;
> +
> + u64 vslice;
> + u64 tolerance = 0;
> + u64 next_deadline;
>
> /*
> * For EEVDF the virtual time slope is determined by w_i (iow.
> @@ -1016,11 +1018,42 @@ static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
> */
> if (!se->custom_slice)
> se->slice = sysctl_sched_base_slice;
> + vslice = calc_delta_fair(se->slice, se);
> +
> + next_deadline = se->vruntime + vslice;
> +
> + if (sched_feat(FORWARD_DEADLINE))
> + tolerance = min(vslice>>7, TICK_NSEC/2);
> +
> + if ((s64)(se->vruntime + tolerance - se->deadline) < 0)
> + return false;
>
> /*
> - * EEVDF: vd_i = ve_i + r_i / w_i
> + * when se->vruntime + tolerance - se->deadline >= 0
> + * but se->vruntime - se->deadline < 0,
> + * there is two case: if entity is eligible?
> + * if entity is not eligible, we don't need wait deadline, because
> + * eevdf don't guarantee
> + * an ineligible entity can exec its request time in one go.
> + * but when entity eligible, just let it run, which is the
> + * same processing logic as before.
> */
> - se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
> +
> + if (sched_feat(FORWARD_DEADLINE) && (s64)(se->vruntime - se->deadline) < 0) {
> + if (entity_eligible(cfs_rq, se)) {
> + return false;
> + } else {
> + /*
> + * in this case, entity's request size does not use light,
> + * but considering it is not eligible, we don't need exec it.
> + * and we let vd_i += r_i / w_i, make scheduler fairness.
> + */
> + next_deadline = se->deadline + vslice;
> + }
> + }
> +
> +
> + se->deadline = next_deadline;
>
> /*
> * The task has consumed its request, reschedule.
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 290874079f60..5c74deec7209 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -24,6 +24,13 @@ SCHED_FEAT(RUN_TO_PARITY, true)
> */
> SCHED_FEAT(PREEMPT_SHORT, true)
>
> +/*
> + * For some cases where the tick is faster than expected,
> + * move the deadline forward
> + */
> +SCHED_FEAT(FORWARD_DEADLINE, true)
> +
> +
> /*
> * Prefer to schedule the task we woke last (assuming it failed
> * wakeup-preemption), since its likely going to consume data we
> --
> 2.39.3 (Apple Git-146)
>
Thank you Vincent Guittot for solving my confusion about tick error: why
is it always less than 1ms on some machines.
It is normal for tick not to be equal to 1ms due to software or hardware,
but on some machines, tick is always less than 1ms, which is a bit strange.
I have not provided a good explanation for it, but now I know the reason.
The root cause is CONFIG_IRQ_TIME_ACCOUNTING.
I used bpftrace to monitor changes in the rq clock (task) in the system:
kprobe:update_rq_clock_task /pid == 6388/
{
@rq = (struct rq *)arg0;
$delta = (int64)arg1;
@clock_pre = @rq->clock_task;
printf("rq clock delta is %llu\n", $delta);
}
kretprobe:update_rq_clock_task /pid == 6388/
{
$clock_post = @rq->clock_task;
printf("rq clock task delta: %llu\n", $clock_post - @clock_pre);
}
result:
rq clock delta is 999994
rq clock task delta: 996616
rq clock delta is 1000026
rq clock task delta: 996550
rq clock delta is 1000047
rq clock task delta: 996716
rq clock delta is 999995
rq clock task delta: 996454
rq clock delta is 1000058
rq clock task delta: 996621
rq clock delta is 999987
rq clock task delta: 996457
rq clock delta is 1000047
rq clock task delta: 996621
rq clock delta is 999966
rq clock task delta: 996594
rq clock delta is 1000071
rq clock task delta: 996470
rq clock delta is 1000073
rq clock task delta: 996586
rq clock delta is 999958
rq clock task delta: 996446
rq clock delta is 1000018
rq clock task delta: 996574
rq clock delta is 999993
rq clock task delta: 996908
rq clock delta is 1000037
rq clock task delta: 996547
As Vincent Guittot said:
< the delta of rq_clock_task is always
< less than 1ms on my system but the delta of rq_clock is sometimes
< above and sometime below 1ms
According to the kernel function: update_rq_clock_task, Both
CONFIG_IRQ_TIME_ACCOUNTING and CONFIG_PARAVIRT_TIME_ACCOUNTING often
result in the delta of rq_clock_task being lower than 1ms. I counted
13016 delta cases, and in the end, 47% of the delta of rq_clock was
less than 1ms, but all of the delta of rq_clock_task is always less
than 1ms
In order to conduct a comparative experiment, I turned off those CONFIG
and re checked the changes in clock, It is found that the values of
rq clock and rq clock task become completely consistent, However,
according to the information from perf, there are still errors in tick
(slice=3ms) :
time cpu task name wait time sch delay run time
[tid/pid] (msec) (msec) (msec)
---------- ------ ------------ --------- --------- ---------
110.436513 [0001] perf[1414] 0.000 0.000 0.000
110.440490 [0001] bash[1341] 0.000 0.000 3.977
110.441490 [0001] bash[1344] 0.000 0.000 0.999
110.441548 [0001] perf[1414] 4.976 0.000 0.058
110.445491 [0001] bash[1344] 0.058 0.000 3.942
110.449490 [0001] bash[1341] 5.000 0.000 3.999
110.452490 [0001] bash[1344] 3.999 0.000 2.999
110.456491 [0001] bash[1341] 2.999 0.000 4.000
110.460489 [0001] bash[1344] 4.000 0.000 3.998
110.463490 [0001] bash[1341] 3.998 0.000 3.001
110.467493 [0001] bash[1344] 3.001 0.000 4.002
110.471490 [0001] bash[1341] 4.002 0.000 3.996
110.474489 [0001] bash[1344] 3.996 0.000 2.999
110.477490 [0001] bash[1341] 2.999 0.000 3.000
It seems that regardless of whether or not there is
CONFIG_IRQ_TIME_ACCOUNTING, tick errors can cause random variations in
runtime between 3 and 4ms.
< This means that the task didn't effectively get its slice because of
< time spent in IRQ context. Would it be better to set a default slice
< slightly lower than an integer number of tick
We once considered subtracting a little from a slice when setting it,
for example, if someone sets 3ms, we can subtract 0.1ms from it and
make it 2.9ms. But this is not a good solution. If someone sets it to
3.1ms, should we use 2.9ms or 3ms? There doesn't seem to be a
particularly good option, and it may lead to even greater system errors.
Changing the default value is a simple solution, in fact, we did it on
the old kernel we used (we just set it 2.9ms. On our old kernel 6.6,
tick error caused processes with the same weight have different run time,
the new kernel did not have this problem, but we still submitted this
patch because we thought unexpected behavior might occur in other
scenarios). However, apart from the kernel's default value,
different OS seemes to have different behaviors, and the default value is
often an integer number of tick... so we still hope to solve this
problem in kernel.
On Tue, 17 Dec 2024 at 07:14, zhouzihan30 <15645113830zzh@gmail.com> wrote:
>
>
> Thank you Vincent Guittot for solving my confusion about tick error: why
> is it always less than 1ms on some machines.
>
> It is normal for tick not to be equal to 1ms due to software or hardware,
> but on some machines, tick is always less than 1ms, which is a bit strange.
> I have not provided a good explanation for it, but now I know the reason.
> The root cause is CONFIG_IRQ_TIME_ACCOUNTING.
>
> I used bpftrace to monitor changes in the rq clock (task) in the system:
>
> kprobe:update_rq_clock_task /pid == 6388/
> {
> @rq = (struct rq *)arg0;
> $delta = (int64)arg1;
> @clock_pre = @rq->clock_task;
> printf("rq clock delta is %llu\n", $delta);
> }
>
> kretprobe:update_rq_clock_task /pid == 6388/
> {
> $clock_post = @rq->clock_task;
> printf("rq clock task delta: %llu\n", $clock_post - @clock_pre);
> }
>
>
> result:
> rq clock delta is 999994
> rq clock task delta: 996616
> rq clock delta is 1000026
> rq clock task delta: 996550
> rq clock delta is 1000047
> rq clock task delta: 996716
> rq clock delta is 999995
> rq clock task delta: 996454
> rq clock delta is 1000058
> rq clock task delta: 996621
> rq clock delta is 999987
> rq clock task delta: 996457
> rq clock delta is 1000047
> rq clock task delta: 996621
> rq clock delta is 999966
> rq clock task delta: 996594
> rq clock delta is 1000071
> rq clock task delta: 996470
> rq clock delta is 1000073
> rq clock task delta: 996586
> rq clock delta is 999958
> rq clock task delta: 996446
> rq clock delta is 1000018
> rq clock task delta: 996574
> rq clock delta is 999993
> rq clock task delta: 996908
> rq clock delta is 1000037
> rq clock task delta: 996547
>
>
> As Vincent Guittot said:
>
> < the delta of rq_clock_task is always
> < less than 1ms on my system but the delta of rq_clock is sometimes
> < above and sometime below 1ms
>
> According to the kernel function: update_rq_clock_task, Both
> CONFIG_IRQ_TIME_ACCOUNTING and CONFIG_PARAVIRT_TIME_ACCOUNTING often
> result in the delta of rq_clock_task being lower than 1ms. I counted
> 13016 delta cases, and in the end, 47% of the delta of rq_clock was
> less than 1ms, but all of the delta of rq_clock_task is always less
Having delta of rq_clock toggling above or below 1ms is normal because
of the clockevent precision, if the previous delta was longer than 1ms
then the next one will be shorter. But the average of several ticks
remains 1ms like in your trace above
> than 1ms
>
> In order to conduct a comparative experiment, I turned off those CONFIG
> and re checked the changes in clock, It is found that the values of
> rq clock and rq clock task become completely consistent, However,
> according to the information from perf, there are still errors in tick
> (slice=3ms) :
Did you check that the whole tick was accounted for the task ?
According to your trace of rq clock delta and rq clock task delta
above, most of the sum of 3 consecutives tick is greater than 3ms for
rq clock delta so I would assume that the sum of delta_exec would be
greater than 3ms as well after 3 ticks
>
> time cpu task name wait time sch delay run time
> [tid/pid] (msec) (msec) (msec)
> ---------- ------ ------------ --------- --------- ---------
> 110.436513 [0001] perf[1414] 0.000 0.000 0.000
> 110.440490 [0001] bash[1341] 0.000 0.000 3.977
> 110.441490 [0001] bash[1344] 0.000 0.000 0.999
> 110.441548 [0001] perf[1414] 4.976 0.000 0.058
> 110.445491 [0001] bash[1344] 0.058 0.000 3.942
> 110.449490 [0001] bash[1341] 5.000 0.000 3.999
> 110.452490 [0001] bash[1344] 3.999 0.000 2.999
> 110.456491 [0001] bash[1341] 2.999 0.000 4.000
> 110.460489 [0001] bash[1344] 4.000 0.000 3.998
> 110.463490 [0001] bash[1341] 3.998 0.000 3.001
> 110.467493 [0001] bash[1344] 3.001 0.000 4.002
> 110.471490 [0001] bash[1341] 4.002 0.000 3.996
> 110.474489 [0001] bash[1344] 3.996 0.000 2.999
> 110.477490 [0001] bash[1341] 2.999 0.000 3.000
>
>
> It seems that regardless of whether or not there is
> CONFIG_IRQ_TIME_ACCOUNTING, tick errors can cause random variations in
> runtime between 3 and 4ms.
>
>
>
> < This means that the task didn't effectively get its slice because of
> < time spent in IRQ context. Would it be better to set a default slice
> < slightly lower than an integer number of tick
>
> We once considered subtracting a little from a slice when setting it,
> for example, if someone sets 3ms, we can subtract 0.1ms from it and
> make it 2.9ms. But this is not a good solution. If someone sets it to
> 3.1ms, should we use 2.9ms or 3ms? There doesn't seem to be a
> particularly good option, and it may lead to even greater system errors.
And we end up giving less than its slice to task which could have set
it to this value for a good reason.
>
> Changing the default value is a simple solution, in fact, we did it on
> the old kernel we used (we just set it 2.9ms. On our old kernel 6.6,
> tick error caused processes with the same weight have different run time,
> the new kernel did not have this problem, but we still submitted this
> patch because we thought unexpected behavior might occur in other
> scenarios). However, apart from the kernel's default value,
> different OS seemes to have different behaviors, and the default value is
> often an integer number of tick... so we still hope to solve this
> problem in kernel.
>
>
>
>
>
>
>
From: zhouzihan30 <zhouzihan30@jd.com> Thank you for your reply! > Having delta of rq_clock toggling above or below 1ms is normal because > of the clockevent precision, if the previous delta was longer than 1ms > then the next one will be shorter. But the average of several ticks > remains 1ms like in your trace above > > > than 1ms > > > > In order to conduct a comparative experiment, I turned off those CONFIG > > and re checked the changes in clock, It is found that the values of > > rq clock and rq clock task become completely consistent, However, > > according to the information from perf, there are still errors in tick > > (slice=3ms) : > > Did you check that the whole tick was accounted for the task ? > According to your trace of rq clock delta and rq clock task delta > above, most of the sum of 3 consecutives tick is greater than 3ms for > rq clock delta so I would assume that the sum of delta_exec would be > greater than 3ms as well after 3 ticks > > > > > time cpu task name wait time sch delay run time > > [tid/pid] (msec) (msec) (msec) > > ---------- ------ ------------ --------- --------- --------- > > 110.436513 [0001] perf[1414] 0.000 0.000 0.000 > > 110.440490 [0001] bash[1341] 0.000 0.000 3.977 > > 110.441490 [0001] bash[1344] 0.000 0.000 0.999 > > 110.441548 [0001] perf[1414] 4.976 0.000 0.058 > > 110.445491 [0001] bash[1344] 0.058 0.000 3.942 > > 110.449490 [0001] bash[1341] 5.000 0.000 3.999 > > 110.452490 [0001] bash[1344] 3.999 0.000 2.999 > > 110.456491 [0001] bash[1341] 2.999 0.000 4.000 > > 110.460489 [0001] bash[1344] 4.000 0.000 3.998 > > 110.463490 [0001] bash[1341] 3.998 0.000 3.001 > > 110.467493 [0001] bash[1344] 3.001 0.000 4.002 > > 110.471490 [0001] bash[1341] 4.002 0.000 3.996 > > 110.474489 [0001] bash[1344] 3.996 0.000 2.999 > > 110.477490 [0001] bash[1341] 2.999 0.000 3.000 > > I use perf to record the impact of tick errors on runtime. When slice=3ms, two busy tasks compete for one CPU. If one task runs for 4ms, it means that three ticks are less than 3ms. The task can only switch to another task after running 4ms on the next tick, which is 1ms more. Based on my observation, about 50% of the time will be like this (no CONFIG_IRQ_TIME_ACCOUNTING, if there is, there will be more time for the task to run for 4ms even if slice=3ms). > > We once considered subtracting a little from a slice when setting it, > > for example, if someone sets 3ms, we can subtract 0.1ms from it and > > make it 2.9ms. But this is not a good solution. If someone sets it to > > 3.1ms, should we use 2.9ms or 3ms? There doesn't seem to be a > > particularly good option, and it may lead to even greater system errors. > > And we end up giving less than its slice to task which could have set > it to this value for a good reason. Thank you, I think we have reached an agreement that the time given to a task at once should be less than or equal to the slice. EEVDF never guarantees that a task must run a slice at once, but the kernel ensures this. However, due to tick errors, there have been some issues like task has exceeded the allotted time. I will propose patch v2 to try to solve this problem.
© 2016 - 2026 Red Hat, Inc.