[v4] sched/fair: Choose the CPU where short task is running during wake up

[RFC PATCH v4 1/2] sched/fair: Introduce short duration task check

Posted by Chen Yu 2 years, 9 months ago

Introduce short-duration task checks, as there is a requirement
to leverage this attribute for better task placement.

There are several choices of metrics that could be used to
indicate if a task is a short-duration task.

At first thought the (p->se.sum_exec_runtime / p->nvcsw)
can be used to measure the task duration. However, the
history long past was factored too heavily in such a formula.
Ideally, the old activity should decay and not affect
the current status too much.

Although something based on PELT can be used, se.util_avg might
not be appropriate to describe the task duration:
1. Task p1 and task p2 are doing frequent ping-pong scheduling on
   one CPU, both p1 and p2 have a short duration, but the util_avg
   can be up to 50%.
2. Suppose a task lasting less than 4ms is regarded as a short task.
   If task p3 runs for 6 ms and sleeps for 32 ms, p3 should not be a
   short-duration task. However, PELT would decay p3's accumulated
   running time from 6 ms to 3 ms, because 32 ms is the half-life
   in PELT. As a result, p3 would be incorrectly treated as a short
   task.

It was found that there was once a similar feature to track the
duration of a task, which is in Commit ad4b78bbcbab ("sched: Add
new wakeup preemption mode: WAKEUP_RUNNING"). Unfortunately, it
was reverted because it was an experiment. So pick the patch up
again, by recording the average duration when a task voluntarily
switches out.

The threshold of short duration is sysctl_sched_min_granularity / 8,
so it can be tuned by the user. By default, the threshold is 375 us.
The reason to reuse sysctl_sched_min_granularity is that it reflects
how long the user would like the task to run. So the criteria of a
short task have a connection to it.

Josh is not in favor of tying the threshold to
sysctl_sched_min_granularity, ideally there should be a dedicated
parameter for the threshold, but that introduces complexity for
maintenance and the user.

Introduce SIS_SHORT to enable the short duration check.

Suggested-by: Tim Chen <tim.c.chen@intel.com>
Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 include/linux/sched.h   |  3 +++
 kernel/sched/core.c     |  2 ++
 kernel/sched/debug.c    |  1 +
 kernel/sched/fair.c     | 23 +++++++++++++++++++++++
 kernel/sched/features.h |  1 +
 5 files changed, 30 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ffb6eb55cd13..26f4768e63f3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -557,6 +557,9 @@ struct sched_entity {
 	u64				prev_sum_exec_runtime;
 
 	u64				nr_migrations;
+	u64				prev_sum_exec_runtime_vol;
+	/* average duration of a task */
+	u64				dur_avg;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	int				depth;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index daff72f00385..c5202f1be3f7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4348,6 +4348,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	p->se.prev_sum_exec_runtime	= 0;
 	p->se.nr_migrations		= 0;
 	p->se.vruntime			= 0;
+	p->se.dur_avg			= 0;
+	p->se.prev_sum_exec_runtime_vol	= 0;
 	INIT_LIST_HEAD(&p->se.group_node);
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 1637b65ba07a..8d64fba16cfe 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -1024,6 +1024,7 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
 	__PS("nr_involuntary_switches", p->nivcsw);
 
 	P(se.load.weight);
+	P(se.dur_avg);
 #ifdef CONFIG_SMP
 	P(se.avg.load_sum);
 	P(se.avg.runnable_sum);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e4a0b8bd941c..abdb7a442052 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4824,6 +4824,16 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 static int
 wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se);
 
+/*
+ * If a task switches in and then voluntarily relinquishes the
+ * CPU quickly, it is regarded as a short duration task.
+ */
+static inline int is_short_task(struct task_struct *p)
+{
+	return sched_feat(SIS_SHORT) && p->se.dur_avg &&
+		((p->se.dur_avg * 8) <= sysctl_sched_min_granularity);
+}
+
 /*
  * Pick the next process, keeping these things in mind, in this order:
  * 1) keep things fair between processes/task groups
@@ -5995,6 +6005,18 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 static void set_next_buddy(struct sched_entity *se);
 
+static inline void dur_avg_update(struct task_struct *p, bool task_sleep)
+{
+	u64 dur;
+
+	if (!task_sleep)
+		return;
+
+	dur = p->se.sum_exec_runtime - p->se.prev_sum_exec_runtime_vol;
+	p->se.prev_sum_exec_runtime_vol = p->se.sum_exec_runtime;
+	update_avg(&p->se.dur_avg, dur);
+}
+
 /*
  * The dequeue_task method is called before nr_running is
  * decreased. We remove the task from the rbtree and
@@ -6067,6 +6089,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 dequeue_throttle:
 	util_est_update(&rq->cfs, p, task_sleep);
+	dur_avg_update(p, task_sleep);
 	hrtick_update(rq);
 }
 
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index ee7f23c76bd3..efdc29c42161 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -62,6 +62,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
  */
 SCHED_FEAT(SIS_PROP, false)
 SCHED_FEAT(SIS_UTIL, true)
+SCHED_FEAT(SIS_SHORT, true)
 
 /*
  * Issue a WARN when we do multiple update_rq_clock() calls
-- 
2.25.1

Re: [RFC PATCH v4 1/2] sched/fair: Introduce short duration task check

Posted by Dietmar Eggemann 2 years, 8 months ago

On 16/12/2022 07:11, Chen Yu wrote:

[...]

> @@ -5995,6 +6005,18 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  
>  static void set_next_buddy(struct sched_entity *se);
>  
> +static inline void dur_avg_update(struct task_struct *p, bool task_sleep)
> +{
> +	u64 dur;
> +
> +	if (!task_sleep)
> +		return;
> +
> +	dur = p->se.sum_exec_runtime - p->se.prev_sum_exec_runtime_vol;
> +	p->se.prev_sum_exec_runtime_vol = p->se.sum_exec_runtime;

Shouldn't se->prev_sum_exec_runtime_vol be set in enqueue_task_fair()
and not in dequeue_task_fair()->dur_avg_update()? Otherwise `dur` will
contain sleep time.

Like we do for se->prev_sum_exec_runtime in set_next_entity() but for
one `set_next_entity()-put_prev_entity()` run section.

AFAICS, you want to measure the exec_runtime sum over all run sections
between enqueue and dequeue.

> +	update_avg(&p->se.dur_avg, dur);
> +}
> +

[...]

Re: [RFC PATCH v4 1/2] sched/fair: Introduce short duration task check

Posted by Peter Zijlstra 2 years, 8 months ago

On Thu, Jan 05, 2023 at 12:33:16PM +0100, Dietmar Eggemann wrote:
> On 16/12/2022 07:11, Chen Yu wrote:
> 
> [...]
> 
> > @@ -5995,6 +6005,18 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >  
> >  static void set_next_buddy(struct sched_entity *se);
> >  
> > +static inline void dur_avg_update(struct task_struct *p, bool task_sleep)
> > +{
> > +	u64 dur;
> > +
> > +	if (!task_sleep)
> > +		return;
> > +
> > +	dur = p->se.sum_exec_runtime - p->se.prev_sum_exec_runtime_vol;
> > +	p->se.prev_sum_exec_runtime_vol = p->se.sum_exec_runtime;
> 
> Shouldn't se->prev_sum_exec_runtime_vol be set in enqueue_task_fair()
> and not in dequeue_task_fair()->dur_avg_update()? Otherwise `dur` will
> contain sleep time.
> 
> Like we do for se->prev_sum_exec_runtime in set_next_entity() but for
> one `set_next_entity()-put_prev_entity()` run section.
> 
> AFAICS, you want to measure the exec_runtime sum over all run sections
> between enqueue and dequeue.

You were thinking of the dynamic PELT window size thread? (which is what
I had to think of when I looked at this).

I think we can still do that with this prev_sum_exec_runtime_vol (can't
say I love the name though). At any point (assuming we update
sum_exec_runtime) sum_exec_runtime - prev_sum_exec_runtime_vol is the
duration of the current activation.

Re: [RFC PATCH v4 1/2] sched/fair: Introduce short duration task check

Posted by Chen Yu 2 years, 8 months ago

Hi Peter,
On 2023-01-16 at 11:33:26 +0100, Peter Zijlstra wrote:
> On Thu, Jan 05, 2023 at 12:33:16PM +0100, Dietmar Eggemann wrote:
> > On 16/12/2022 07:11, Chen Yu wrote:
> > 
> > [...]
> > 
> > > @@ -5995,6 +6005,18 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > >  
> > >  static void set_next_buddy(struct sched_entity *se);
> > >  
> > > +static inline void dur_avg_update(struct task_struct *p, bool task_sleep)
> > > +{
> > > +	u64 dur;
> > > +
> > > +	if (!task_sleep)
> > > +		return;
> > > +
> > > +	dur = p->se.sum_exec_runtime - p->se.prev_sum_exec_runtime_vol;
> > > +	p->se.prev_sum_exec_runtime_vol = p->se.sum_exec_runtime;
> > 
> > Shouldn't se->prev_sum_exec_runtime_vol be set in enqueue_task_fair()
> > and not in dequeue_task_fair()->dur_avg_update()? Otherwise `dur` will
> > contain sleep time.
> > 
> > Like we do for se->prev_sum_exec_runtime in set_next_entity() but for
> > one `set_next_entity()-put_prev_entity()` run section.
> > 
> > AFAICS, you want to measure the exec_runtime sum over all run sections
> > between enqueue and dequeue.
> 
> You were thinking of the dynamic PELT window size thread? (which is what
> I had to think of when I looked at this).
> 
> I think we can still do that with this prev_sum_exec_runtime_vol (can't
> say I love the name though).
I agree that this name is not accurate, maybe prev_sleep_sum_exec_runtime?
I'm open to any other name for this : )

Currently I'm checking Prateek's data on Zen3 and Yicong's data on Arm64,
and their data suggested that: inhibiting the spreading of short wakee is not
always a good idea on a system with small LLC. Meanwhile, according to my
test on a system with large number of CPUs in 1 LLC, short duration wakee become
a trouble maker if spreading them on different CPUs, which could trigger unexpected
race condition. I'm thinking of taking nr_llc_cpu into consideration when defining
a short duration task, and do some experiment on this.

thanks,
Chenyu

Re: [RFC PATCH v4 1/2] sched/fair: Introduce short duration task check

Posted by Dietmar Eggemann 2 years, 8 months ago

On 16/01/2023 10:33, Peter Zijlstra wrote:
> On Thu, Jan 05, 2023 at 12:33:16PM +0100, Dietmar Eggemann wrote:
>> On 16/12/2022 07:11, Chen Yu wrote:

[...]

> You were thinking of the dynamic PELT window size thread? (which is what
> I had to think of when I looked at this).

Yes, indeed.

> I think we can still do that with this prev_sum_exec_runtime_vol (can't
> say I love the name though). At any point (assuming we update
> sum_exec_runtime) sum_exec_runtime - prev_sum_exec_runtime_vol is the
> duration of the current activation.

I ran Jankbench with your UTIL_EST_FASTER patch and:

    runtime = curr->se.sum_exec_runtime -
              curr->se.prev_sum_exec_runtime_vol

plus:

    runtime >>= 10

before doing:

    util_est_fast = faster_est_approx(runtime * 2)
                                              ^^^ (boost)

on a Pixel6 and the results look promising:

Max frame duration (ms)

+-------------------+-----------+------------+
|      wa_path      | iteration |   value    |
+-------------------+-----------+------------+
|        base       |    10     | 147.571352 |
|    pelt-hl-m2     |    10     | 119.416351 |
|    pelt-hl-m4     |    10     | 96.473412  |
|  util_est_faster  |    10     | 84.834999  |
+-------------------+-----------+------------+

Mean frame duration (average case)

+---------------+-------------------+-------+-----------+
|   variable    |      kernel       | value | perc_diff |
+---------------+-------------------+-------+-----------+
| mean_duration |        base       | 14.7  |   0.0%    |
| mean_duration |    pelt-hl-m2     | 13.6  |   -7.5%   |
| mean_duration |    pelt-hl-m4     | 13.0  |  -11.68%  |
| mean_duration |  util_est_faster  | 12.6  |  -14.01%  |
+---------------+-------------------+-------+-----------+

Jank percentage

+-----------+-------------------+-------+-----------+
| variable  |      kernel       | value | perc_diff |
+-----------+-------------------+-------+-----------+
| jank_perc |        base       |  1.8  |   0.0%    |
| jank_perc |    pelt-hl-m2     |  1.8  |  -4.91%   |
| jank_perc |    pelt-hl-m4     |  1.2  |  -36.61%  |
| jank_perc |  util_est_faster  |  0.8  |  -57.8%   |
+-----------+-------------------+-------+-----------+

Power usage [mW]

+--------------+-------------------+-------+-----------+
|  chan_name   |      kernel       | value | perc_diff |
+--------------+-------------------+-------+-----------+
| total_power  |        base       | 144.4 |   0.0%    |
| total_power  |    pelt-hl-m2     | 141.6 |  -1.97%   |
| total_power  |    pelt-hl-m4     | 163.2 |  12.99%   |
| total_power  |  util_est_faster  | 150.9 |   4.45%   |
+--------------+-------------------+-------+-----------+

At first glance it looks promising! Have to do more testing to
understand the behaviour fully.

Re: [RFC PATCH v4 1/2] sched/fair: Introduce short duration task check

Posted by Chen Yu 2 years, 8 months ago

Hi Dietmar,
thanks for reviewing the patch!
On 2023-01-05 at 12:33:16 +0100, Dietmar Eggemann wrote:
> On 16/12/2022 07:11, Chen Yu wrote:
> 
> [...]
> 
> > @@ -5995,6 +6005,18 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >  
> >  static void set_next_buddy(struct sched_entity *se);
> >  
> > +static inline void dur_avg_update(struct task_struct *p, bool task_sleep)
> > +{
> > +	u64 dur;
> > +
> > +	if (!task_sleep)
> > +		return;
> > +
> > +	dur = p->se.sum_exec_runtime - p->se.prev_sum_exec_runtime_vol;
> > +	p->se.prev_sum_exec_runtime_vol = p->se.sum_exec_runtime;
> 
> Shouldn't se->prev_sum_exec_runtime_vol be set in enqueue_task_fair()
> and not in dequeue_task_fair()->dur_avg_update()? Otherwise `dur` will
> contain sleep time.
> 
After the task p is dequeued, p's sum_exec_runtime will not be increased.
Unless task p is switched in again, p's sum_exec_runtime will continue to
increase. So dur should not include the sleep time, because we substract
between the sum_exec_runtime rather than rq->clock_task. Not sure if I understand
this correctly?

My original thought was that, record the average run time of every section:
Only consider that task voluntarily relinquishes the CPU.
For example, suppose on CPU1, task p1 and p2 run alternatively:

 --------------------> time

 | p1 runs 1ms | p2 preempt p1 | p1 switch in, runs 0.5ms and blocks |
               ^               ^                                     ^
 |_____________|               |_____________________________________|
                                                                     ^
                                                                     |
                                                                  p1 dequeued

p1's duration in one section is (1 + 0.5)ms. Because if p2 does not
preempt p1, p1 can run 1.5ms. This reflects the nature of a task,
how long it wishes to run at most.

> Like we do for se->prev_sum_exec_runtime in set_next_entity() but for
> one `set_next_entity()-put_prev_entity()` run section.
> 
> AFAICS, you want to measure the exec_runtime sum over all run sections
> between enqueue and dequeue.
Yes, we tried to record the 'decayed' average exec_runtime for each section.
Say, task p runs for a ms , then p is dequeued and blocks for b ms, and then
runs for c ms, its average duration is 0.875 * a + 0.125 * c , which is
what update_avg() does.

thanks,
Chenyu

Re: [RFC PATCH v4 1/2] sched/fair: Introduce short duration task check

Posted by Dietmar Eggemann 2 years, 8 months ago

On 06/01/2023 09:34, Chen Yu wrote:
> Hi Dietmar,
> thanks for reviewing the patch!
> On 2023-01-05 at 12:33:16 +0100, Dietmar Eggemann wrote:
>> On 16/12/2022 07:11, Chen Yu wrote:
>>
>> [...]
>>
>>> @@ -5995,6 +6005,18 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>>  
>>>  static void set_next_buddy(struct sched_entity *se);
>>>  
>>> +static inline void dur_avg_update(struct task_struct *p, bool task_sleep)
>>> +{
>>> +	u64 dur;
>>> +
>>> +	if (!task_sleep)
>>> +		return;
>>> +
>>> +	dur = p->se.sum_exec_runtime - p->se.prev_sum_exec_runtime_vol;
>>> +	p->se.prev_sum_exec_runtime_vol = p->se.sum_exec_runtime;
>>
>> Shouldn't se->prev_sum_exec_runtime_vol be set in enqueue_task_fair()
>> and not in dequeue_task_fair()->dur_avg_update()? Otherwise `dur` will
>> contain sleep time.
>>
> After the task p is dequeued, p's sum_exec_runtime will not be increased.

True.

> Unless task p is switched in again, p's sum_exec_runtime will continue to
> increase. So dur should not include the sleep time, because we substract

Not sure I get this sentence? p's se->sum_exec_runtime will only
increase if p is current, so running?

> between the sum_exec_runtime rather than rq->clock_task. Not sure if I understand
> this correctly?

No, you're right. We're not dealing with time snapshots but rather with
sum_exec_runtime snapshots. So the value will not change between dequeue
and the next enqueue.

e ... enqueue_task_fair()
d ... dequeue_task_fair()
s ... set_next_entity()
p ... put_prev_entity()
u ... update_curr_fair()->update_curr()

p1:

---|---||--|--|---|--|--||---
   d   es  u  p   s  u  pd

   ^   ^
   |   |
  (A) (B)

Same se->prev_sum_exec_runtime_vol value in (A) and (B).

> My original thought was that, record the average run time of every section:
> Only consider that task voluntarily relinquishes the CPU.
> For example, suppose on CPU1, task p1 and p2 run alternatively:
> 
>  --------------------> time
> 
>  | p1 runs 1ms | p2 preempt p1 | p1 switch in, runs 0.5ms and blocks |
>                ^               ^                                     ^
>  |_____________|               |_____________________________________|
>                                                                      ^
>                                                                      |
>                                                                   p1 dequeued
> 
> p1's duration in one section is (1 + 0.5)ms. Because if p2 does not
> preempt p1, p1 can run 1.5ms. This reflects the nature of a task,
> how long it wishes to run at most.
> 
>> Like we do for se->prev_sum_exec_runtime in set_next_entity() but for
>> one `set_next_entity()-put_prev_entity()` run section.
>>
>> AFAICS, you want to measure the exec_runtime sum over all run sections
>> between enqueue and dequeue.
> Yes, we tried to record the 'decayed' average exec_runtime for each section.
> Say, task p runs for a ms , then p is dequeued and blocks for b ms, and then
> runs for c ms, its average duration is 0.875 * a + 0.125 * c , which is
> what update_avg() does.

OK.

Re: [RFC PATCH v4 1/2] sched/fair: Introduce short duration task check

Posted by Chen Yu 2 years, 8 months ago

On 2023-01-06 at 12:28:26 +0100, Dietmar Eggemann wrote:
> On 06/01/2023 09:34, Chen Yu wrote:
> > Hi Dietmar,
> > thanks for reviewing the patch!
> > On 2023-01-05 at 12:33:16 +0100, Dietmar Eggemann wrote:
> >> On 16/12/2022 07:11, Chen Yu wrote:
> >>
> >> [...]
> >>
> >>> @@ -5995,6 +6005,18 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >>>  
> >>>  static void set_next_buddy(struct sched_entity *se);
> >>>  
> >>> +static inline void dur_avg_update(struct task_struct *p, bool task_sleep)
> >>> +{
> >>> +	u64 dur;
> >>> +
> >>> +	if (!task_sleep)
> >>> +		return;
> >>> +
> >>> +	dur = p->se.sum_exec_runtime - p->se.prev_sum_exec_runtime_vol;
> >>> +	p->se.prev_sum_exec_runtime_vol = p->se.sum_exec_runtime;
> >>
> >> Shouldn't se->prev_sum_exec_runtime_vol be set in enqueue_task_fair()
> >> and not in dequeue_task_fair()->dur_avg_update()? Otherwise `dur` will
> >> contain sleep time.
> >>
> > After the task p is dequeued, p's sum_exec_runtime will not be increased.
> 
> True.
> 
> > Unless task p is switched in again, p's sum_exec_runtime will continue to
> > increase. So dur should not include the sleep time, because we substract
> 
> Not sure I get this sentence? p's se->sum_exec_runtime will only
> increase if p is current, so running?
>
Yes, it was a typo, should be "will not continue to increase".
> > between the sum_exec_runtime rather than rq->clock_task. Not sure if I understand
> > this correctly?
> 
> No, you're right. We're not dealing with time snapshots but rather with
> sum_exec_runtime snapshots. So the value will not change between dequeue
> and the next enqueue.
> 
> e ... enqueue_task_fair()
> d ... dequeue_task_fair()
> s ... set_next_entity()
> p ... put_prev_entity()
> u ... update_curr_fair()->update_curr()
> 
> p1:
> 
> ---|---||--|--|---|--|--||---
>    d   es  u  p   s  u  pd
> 
>    ^   ^
>    |   |
>   (A) (B)
> 
> Same se->prev_sum_exec_runtime_vol value in (A) and (B).
> 
Yes.
> > My original thought was that, record the average run time of every section:
> > Only consider that task voluntarily relinquishes the CPU.
> > For example, suppose on CPU1, task p1 and p2 run alternatively:
> > 
> >  --------------------> time
> > 
> >  | p1 runs 1ms | p2 preempt p1 | p1 switch in, runs 0.5ms and blocks |
> >                ^               ^                                     ^
> >  |_____________|               |_____________________________________|
> >                                                                      ^
> >                                                                      |
> >                                                                   p1 dequeued
> > 
> > p1's duration in one section is (1 + 0.5)ms. Because if p2 does not
> > preempt p1, p1 can run 1.5ms. This reflects the nature of a task,
> > how long it wishes to run at most.
> > 
> >> Like we do for se->prev_sum_exec_runtime in set_next_entity() but for
> >> one `set_next_entity()-put_prev_entity()` run section.
> >>
> >> AFAICS, you want to measure the exec_runtime sum over all run sections
> >> between enqueue and dequeue.
> > Yes, we tried to record the 'decayed' average exec_runtime for each section.
> > Say, task p runs for a ms , then p is dequeued and blocks for b ms, and then
> > runs for c ms, its average duration is 0.875 * a + 0.125 * c , which is
> > what update_avg() does.
> 
> OK.
> 
I'll add more descriptions in next version to avoid confusing.

thanks,
Chenyu