Introduce short-duration task checks, as there is a requirement
to leverage this attribute for better task placement.
There are several choices of metrics that could be used to
indicate if a task is a short-duration task.
At first thought the (p->se.sum_exec_runtime / p->nvcsw)
can be used to measure the task duration. However, the
history long past was factored too heavily in such a formula.
Ideally, the old activity should decay and not affect
the current status too much.
Although something based on PELT can be used, se.util_avg might
not be appropriate to describe the task duration:
1. Task p1 and task p2 are doing frequent ping-pong scheduling on
one CPU, both p1 and p2 have a short duration, but the util_avg
can be up to 50%.
2. Suppose a task lasting less than 4ms is regarded as a short task.
If task p3 runs for 6 ms and sleeps for 32 ms, p3 should not be a
short-duration task. However, PELT would decay p3's accumulated
running time from 6 ms to 3 ms, because 32 ms is the half-life
in PELT. As a result, p3 would be incorrectly treated as a short
task.
It was found that there was once a similar feature to track the
duration of a task, which is in Commit ad4b78bbcbab ("sched: Add
new wakeup preemption mode: WAKEUP_RUNNING"). Unfortunately, it
was reverted because it was an experiment. So pick the patch up
again, by recording the average duration when a task voluntarily
switches out.
The threshold of short duration is sysctl_sched_min_granularity / 8,
so it can be tuned by the user. By default, the threshold is 375 us.
The reason to reuse sysctl_sched_min_granularity is that it reflects
how long the user would like the task to run. So the criteria of a
short task have a connection to it.
Josh is not in favor of tying the threshold to
sysctl_sched_min_granularity, ideally there should be a dedicated
parameter for the threshold, but that introduces complexity for
maintenance and the user.
Introduce SIS_SHORT to enable the short duration check.
Suggested-by: Tim Chen <tim.c.chen@intel.com>
Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
include/linux/sched.h | 3 +++
kernel/sched/core.c | 2 ++
kernel/sched/debug.c | 1 +
kernel/sched/fair.c | 23 +++++++++++++++++++++++
kernel/sched/features.h | 1 +
5 files changed, 30 insertions(+)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ffb6eb55cd13..26f4768e63f3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -557,6 +557,9 @@ struct sched_entity {
u64 prev_sum_exec_runtime;
u64 nr_migrations;
+ u64 prev_sum_exec_runtime_vol;
+ /* average duration of a task */
+ u64 dur_avg;
#ifdef CONFIG_FAIR_GROUP_SCHED
int depth;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index daff72f00385..c5202f1be3f7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4348,6 +4348,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
p->se.prev_sum_exec_runtime = 0;
p->se.nr_migrations = 0;
p->se.vruntime = 0;
+ p->se.dur_avg = 0;
+ p->se.prev_sum_exec_runtime_vol = 0;
INIT_LIST_HEAD(&p->se.group_node);
#ifdef CONFIG_FAIR_GROUP_SCHED
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 1637b65ba07a..8d64fba16cfe 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1024,6 +1024,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
__PS("nr_involuntary_switches", p->nivcsw);
P(se.load.weight);
+ P(se.dur_avg);
#ifdef CONFIG_SMP
P(se.avg.load_sum);
P(se.avg.runnable_sum);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e4a0b8bd941c..abdb7a442052 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4824,6 +4824,16 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
static int
wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se);
+/*
+ * If a task switches in and then voluntarily relinquishes the
+ * CPU quickly, it is regarded as a short duration task.
+ */
+static inline int is_short_task(struct task_struct *p)
+{
+ return sched_feat(SIS_SHORT) && p->se.dur_avg &&
+ ((p->se.dur_avg * 8) <= sysctl_sched_min_granularity);
+}
+
/*
* Pick the next process, keeping these things in mind, in this order:
* 1) keep things fair between processes/task groups
@@ -5995,6 +6005,18 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
static void set_next_buddy(struct sched_entity *se);
+static inline void dur_avg_update(struct task_struct *p, bool task_sleep)
+{
+ u64 dur;
+
+ if (!task_sleep)
+ return;
+
+ dur = p->se.sum_exec_runtime - p->se.prev_sum_exec_runtime_vol;
+ p->se.prev_sum_exec_runtime_vol = p->se.sum_exec_runtime;
+ update_avg(&p->se.dur_avg, dur);
+}
+
/*
* The dequeue_task method is called before nr_running is
* decreased. We remove the task from the rbtree and
@@ -6067,6 +6089,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
dequeue_throttle:
util_est_update(&rq->cfs, p, task_sleep);
+ dur_avg_update(p, task_sleep);
hrtick_update(rq);
}
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index ee7f23c76bd3..efdc29c42161 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -62,6 +62,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
*/
SCHED_FEAT(SIS_PROP, false)
SCHED_FEAT(SIS_UTIL, true)
+SCHED_FEAT(SIS_SHORT, true)
/*
* Issue a WARN when we do multiple update_rq_clock() calls
--
2.25.1
On 16/12/2022 07:11, Chen Yu wrote: [...] > @@ -5995,6 +6005,18 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) > > static void set_next_buddy(struct sched_entity *se); > > +static inline void dur_avg_update(struct task_struct *p, bool task_sleep) > +{ > + u64 dur; > + > + if (!task_sleep) > + return; > + > + dur = p->se.sum_exec_runtime - p->se.prev_sum_exec_runtime_vol; > + p->se.prev_sum_exec_runtime_vol = p->se.sum_exec_runtime; Shouldn't se->prev_sum_exec_runtime_vol be set in enqueue_task_fair() and not in dequeue_task_fair()->dur_avg_update()? Otherwise `dur` will contain sleep time. Like we do for se->prev_sum_exec_runtime in set_next_entity() but for one `set_next_entity()-put_prev_entity()` run section. AFAICS, you want to measure the exec_runtime sum over all run sections between enqueue and dequeue. > + update_avg(&p->se.dur_avg, dur); > +} > + [...]
On Thu, Jan 05, 2023 at 12:33:16PM +0100, Dietmar Eggemann wrote: > On 16/12/2022 07:11, Chen Yu wrote: > > [...] > > > @@ -5995,6 +6005,18 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) > > > > static void set_next_buddy(struct sched_entity *se); > > > > +static inline void dur_avg_update(struct task_struct *p, bool task_sleep) > > +{ > > + u64 dur; > > + > > + if (!task_sleep) > > + return; > > + > > + dur = p->se.sum_exec_runtime - p->se.prev_sum_exec_runtime_vol; > > + p->se.prev_sum_exec_runtime_vol = p->se.sum_exec_runtime; > > Shouldn't se->prev_sum_exec_runtime_vol be set in enqueue_task_fair() > and not in dequeue_task_fair()->dur_avg_update()? Otherwise `dur` will > contain sleep time. > > Like we do for se->prev_sum_exec_runtime in set_next_entity() but for > one `set_next_entity()-put_prev_entity()` run section. > > AFAICS, you want to measure the exec_runtime sum over all run sections > between enqueue and dequeue. You were thinking of the dynamic PELT window size thread? (which is what I had to think of when I looked at this). I think we can still do that with this prev_sum_exec_runtime_vol (can't say I love the name though). At any point (assuming we update sum_exec_runtime) sum_exec_runtime - prev_sum_exec_runtime_vol is the duration of the current activation.
Hi Peter, On 2023-01-16 at 11:33:26 +0100, Peter Zijlstra wrote: > On Thu, Jan 05, 2023 at 12:33:16PM +0100, Dietmar Eggemann wrote: > > On 16/12/2022 07:11, Chen Yu wrote: > > > > [...] > > > > > @@ -5995,6 +6005,18 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) > > > > > > static void set_next_buddy(struct sched_entity *se); > > > > > > +static inline void dur_avg_update(struct task_struct *p, bool task_sleep) > > > +{ > > > + u64 dur; > > > + > > > + if (!task_sleep) > > > + return; > > > + > > > + dur = p->se.sum_exec_runtime - p->se.prev_sum_exec_runtime_vol; > > > + p->se.prev_sum_exec_runtime_vol = p->se.sum_exec_runtime; > > > > Shouldn't se->prev_sum_exec_runtime_vol be set in enqueue_task_fair() > > and not in dequeue_task_fair()->dur_avg_update()? Otherwise `dur` will > > contain sleep time. > > > > Like we do for se->prev_sum_exec_runtime in set_next_entity() but for > > one `set_next_entity()-put_prev_entity()` run section. > > > > AFAICS, you want to measure the exec_runtime sum over all run sections > > between enqueue and dequeue. > > You were thinking of the dynamic PELT window size thread? (which is what > I had to think of when I looked at this). > > I think we can still do that with this prev_sum_exec_runtime_vol (can't > say I love the name though). I agree that this name is not accurate, maybe prev_sleep_sum_exec_runtime? I'm open to any other name for this : ) Currently I'm checking Prateek's data on Zen3 and Yicong's data on Arm64, and their data suggested that: inhibiting the spreading of short wakee is not always a good idea on a system with small LLC. Meanwhile, according to my test on a system with large number of CPUs in 1 LLC, short duration wakee become a trouble maker if spreading them on different CPUs, which could trigger unexpected race condition. I'm thinking of taking nr_llc_cpu into consideration when defining a short duration task, and do some experiment on this. thanks, Chenyu
On 16/01/2023 10:33, Peter Zijlstra wrote: > On Thu, Jan 05, 2023 at 12:33:16PM +0100, Dietmar Eggemann wrote: >> On 16/12/2022 07:11, Chen Yu wrote: [...] > You were thinking of the dynamic PELT window size thread? (which is what > I had to think of when I looked at this). Yes, indeed. > I think we can still do that with this prev_sum_exec_runtime_vol (can't > say I love the name though). At any point (assuming we update > sum_exec_runtime) sum_exec_runtime - prev_sum_exec_runtime_vol is the > duration of the current activation. I ran Jankbench with your UTIL_EST_FASTER patch and: runtime = curr->se.sum_exec_runtime - curr->se.prev_sum_exec_runtime_vol plus: runtime >>= 10 before doing: util_est_fast = faster_est_approx(runtime * 2) ^^^ (boost) on a Pixel6 and the results look promising: Max frame duration (ms) +-------------------+-----------+------------+ | wa_path | iteration | value | +-------------------+-----------+------------+ | base | 10 | 147.571352 | | pelt-hl-m2 | 10 | 119.416351 | | pelt-hl-m4 | 10 | 96.473412 | | util_est_faster | 10 | 84.834999 | +-------------------+-----------+------------+ Mean frame duration (average case) +---------------+-------------------+-------+-----------+ | variable | kernel | value | perc_diff | +---------------+-------------------+-------+-----------+ | mean_duration | base | 14.7 | 0.0% | | mean_duration | pelt-hl-m2 | 13.6 | -7.5% | | mean_duration | pelt-hl-m4 | 13.0 | -11.68% | | mean_duration | util_est_faster | 12.6 | -14.01% | +---------------+-------------------+-------+-----------+ Jank percentage +-----------+-------------------+-------+-----------+ | variable | kernel | value | perc_diff | +-----------+-------------------+-------+-----------+ | jank_perc | base | 1.8 | 0.0% | | jank_perc | pelt-hl-m2 | 1.8 | -4.91% | | jank_perc | pelt-hl-m4 | 1.2 | -36.61% | | jank_perc | util_est_faster | 0.8 | -57.8% | +-----------+-------------------+-------+-----------+ Power usage [mW] +--------------+-------------------+-------+-----------+ | chan_name | kernel | value | perc_diff | +--------------+-------------------+-------+-----------+ | total_power | base | 144.4 | 0.0% | | total_power | pelt-hl-m2 | 141.6 | -1.97% | | total_power | pelt-hl-m4 | 163.2 | 12.99% | | total_power | util_est_faster | 150.9 | 4.45% | +--------------+-------------------+-------+-----------+ At first glance it looks promising! Have to do more testing to understand the behaviour fully.
Hi Dietmar, thanks for reviewing the patch! On 2023-01-05 at 12:33:16 +0100, Dietmar Eggemann wrote: > On 16/12/2022 07:11, Chen Yu wrote: > > [...] > > > @@ -5995,6 +6005,18 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) > > > > static void set_next_buddy(struct sched_entity *se); > > > > +static inline void dur_avg_update(struct task_struct *p, bool task_sleep) > > +{ > > + u64 dur; > > + > > + if (!task_sleep) > > + return; > > + > > + dur = p->se.sum_exec_runtime - p->se.prev_sum_exec_runtime_vol; > > + p->se.prev_sum_exec_runtime_vol = p->se.sum_exec_runtime; > > Shouldn't se->prev_sum_exec_runtime_vol be set in enqueue_task_fair() > and not in dequeue_task_fair()->dur_avg_update()? Otherwise `dur` will > contain sleep time. > After the task p is dequeued, p's sum_exec_runtime will not be increased. Unless task p is switched in again, p's sum_exec_runtime will continue to increase. So dur should not include the sleep time, because we substract between the sum_exec_runtime rather than rq->clock_task. Not sure if I understand this correctly? My original thought was that, record the average run time of every section: Only consider that task voluntarily relinquishes the CPU. For example, suppose on CPU1, task p1 and p2 run alternatively: --------------------> time | p1 runs 1ms | p2 preempt p1 | p1 switch in, runs 0.5ms and blocks | ^ ^ ^ |_____________| |_____________________________________| ^ | p1 dequeued p1's duration in one section is (1 + 0.5)ms. Because if p2 does not preempt p1, p1 can run 1.5ms. This reflects the nature of a task, how long it wishes to run at most. > Like we do for se->prev_sum_exec_runtime in set_next_entity() but for > one `set_next_entity()-put_prev_entity()` run section. > > AFAICS, you want to measure the exec_runtime sum over all run sections > between enqueue and dequeue. Yes, we tried to record the 'decayed' average exec_runtime for each section. Say, task p runs for a ms , then p is dequeued and blocks for b ms, and then runs for c ms, its average duration is 0.875 * a + 0.125 * c , which is what update_avg() does. thanks, Chenyu
On 06/01/2023 09:34, Chen Yu wrote: > Hi Dietmar, > thanks for reviewing the patch! > On 2023-01-05 at 12:33:16 +0100, Dietmar Eggemann wrote: >> On 16/12/2022 07:11, Chen Yu wrote: >> >> [...] >> >>> @@ -5995,6 +6005,18 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) >>> >>> static void set_next_buddy(struct sched_entity *se); >>> >>> +static inline void dur_avg_update(struct task_struct *p, bool task_sleep) >>> +{ >>> + u64 dur; >>> + >>> + if (!task_sleep) >>> + return; >>> + >>> + dur = p->se.sum_exec_runtime - p->se.prev_sum_exec_runtime_vol; >>> + p->se.prev_sum_exec_runtime_vol = p->se.sum_exec_runtime; >> >> Shouldn't se->prev_sum_exec_runtime_vol be set in enqueue_task_fair() >> and not in dequeue_task_fair()->dur_avg_update()? Otherwise `dur` will >> contain sleep time. >> > After the task p is dequeued, p's sum_exec_runtime will not be increased. True. > Unless task p is switched in again, p's sum_exec_runtime will continue to > increase. So dur should not include the sleep time, because we substract Not sure I get this sentence? p's se->sum_exec_runtime will only increase if p is current, so running? > between the sum_exec_runtime rather than rq->clock_task. Not sure if I understand > this correctly? No, you're right. We're not dealing with time snapshots but rather with sum_exec_runtime snapshots. So the value will not change between dequeue and the next enqueue. e ... enqueue_task_fair() d ... dequeue_task_fair() s ... set_next_entity() p ... put_prev_entity() u ... update_curr_fair()->update_curr() p1: ---|---||--|--|---|--|--||--- d es u p s u pd ^ ^ | | (A) (B) Same se->prev_sum_exec_runtime_vol value in (A) and (B). > My original thought was that, record the average run time of every section: > Only consider that task voluntarily relinquishes the CPU. > For example, suppose on CPU1, task p1 and p2 run alternatively: > > --------------------> time > > | p1 runs 1ms | p2 preempt p1 | p1 switch in, runs 0.5ms and blocks | > ^ ^ ^ > |_____________| |_____________________________________| > ^ > | > p1 dequeued > > p1's duration in one section is (1 + 0.5)ms. Because if p2 does not > preempt p1, p1 can run 1.5ms. This reflects the nature of a task, > how long it wishes to run at most. > >> Like we do for se->prev_sum_exec_runtime in set_next_entity() but for >> one `set_next_entity()-put_prev_entity()` run section. >> >> AFAICS, you want to measure the exec_runtime sum over all run sections >> between enqueue and dequeue. > Yes, we tried to record the 'decayed' average exec_runtime for each section. > Say, task p runs for a ms , then p is dequeued and blocks for b ms, and then > runs for c ms, its average duration is 0.875 * a + 0.125 * c , which is > what update_avg() does. OK.
On 2023-01-06 at 12:28:26 +0100, Dietmar Eggemann wrote: > On 06/01/2023 09:34, Chen Yu wrote: > > Hi Dietmar, > > thanks for reviewing the patch! > > On 2023-01-05 at 12:33:16 +0100, Dietmar Eggemann wrote: > >> On 16/12/2022 07:11, Chen Yu wrote: > >> > >> [...] > >> > >>> @@ -5995,6 +6005,18 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) > >>> > >>> static void set_next_buddy(struct sched_entity *se); > >>> > >>> +static inline void dur_avg_update(struct task_struct *p, bool task_sleep) > >>> +{ > >>> + u64 dur; > >>> + > >>> + if (!task_sleep) > >>> + return; > >>> + > >>> + dur = p->se.sum_exec_runtime - p->se.prev_sum_exec_runtime_vol; > >>> + p->se.prev_sum_exec_runtime_vol = p->se.sum_exec_runtime; > >> > >> Shouldn't se->prev_sum_exec_runtime_vol be set in enqueue_task_fair() > >> and not in dequeue_task_fair()->dur_avg_update()? Otherwise `dur` will > >> contain sleep time. > >> > > After the task p is dequeued, p's sum_exec_runtime will not be increased. > > True. > > > Unless task p is switched in again, p's sum_exec_runtime will continue to > > increase. So dur should not include the sleep time, because we substract > > Not sure I get this sentence? p's se->sum_exec_runtime will only > increase if p is current, so running? > Yes, it was a typo, should be "will not continue to increase". > > between the sum_exec_runtime rather than rq->clock_task. Not sure if I understand > > this correctly? > > No, you're right. We're not dealing with time snapshots but rather with > sum_exec_runtime snapshots. So the value will not change between dequeue > and the next enqueue. > > e ... enqueue_task_fair() > d ... dequeue_task_fair() > s ... set_next_entity() > p ... put_prev_entity() > u ... update_curr_fair()->update_curr() > > p1: > > ---|---||--|--|---|--|--||--- > d es u p s u pd > > ^ ^ > | | > (A) (B) > > Same se->prev_sum_exec_runtime_vol value in (A) and (B). > Yes. > > My original thought was that, record the average run time of every section: > > Only consider that task voluntarily relinquishes the CPU. > > For example, suppose on CPU1, task p1 and p2 run alternatively: > > > > --------------------> time > > > > | p1 runs 1ms | p2 preempt p1 | p1 switch in, runs 0.5ms and blocks | > > ^ ^ ^ > > |_____________| |_____________________________________| > > ^ > > | > > p1 dequeued > > > > p1's duration in one section is (1 + 0.5)ms. Because if p2 does not > > preempt p1, p1 can run 1.5ms. This reflects the nature of a task, > > how long it wishes to run at most. > > > >> Like we do for se->prev_sum_exec_runtime in set_next_entity() but for > >> one `set_next_entity()-put_prev_entity()` run section. > >> > >> AFAICS, you want to measure the exec_runtime sum over all run sections > >> between enqueue and dequeue. > > Yes, we tried to record the 'decayed' average exec_runtime for each section. > > Say, task p runs for a ms , then p is dequeued and blocks for b ms, and then > > runs for c ms, its average duration is 0.875 * a + 0.125 * c , which is > > what update_avg() does. > > OK. > I'll add more descriptions in next version to avoid confusing. thanks, Chenyu
© 2016 - 2025 Red Hat, Inc.