[PATCH v3 1/2] sched/fair: Introduce short duration task check

Chen Yu posted 2 patches 2 years, 9 months ago
There is a newer version of this series
[PATCH v3 1/2] sched/fair: Introduce short duration task check
Posted by Chen Yu 2 years, 9 months ago
Introduce short-duration task checks, as there is requirement
to leverage this attribute for better task placement.

There are several choices of metrics that could be used to
indicate if a task is a short-duration task.

At first thought the (p->se.sum_exec_runtime / p->nvcsw)
could be used to measure the task duration. However, the
history long past was factored too heavily in such a formula.
Ideally, the old activity should decay and not affect
the current status too much.

Although something based on PELT could be used, se.util_avg might
not be appropriate to describe the task duration:
1. Task p1 and task p2 are doing frequent ping-pong scheduling on
   one CPU, both p1 and p2 have a short duration, but the util_avg
   can be up to 50%.
2. Suppose a task lasting less than 4ms is regarded as a short task.
   If task p3 runs for 6ms and sleeps for 32ms, p3 should not be a
   short-duration task. However, PELT would decay p3's accumulated
   running time from 6ms to 3ms, because 32ms is the half-life in PELT.
   As a result, p3 would be incorrectly treated as a short task.

It was found that there was once a similar feature to track the
duration of a task, which is in Commit ad4b78bbcbab ("sched: Add
new wakeup preemption mode: WAKEUP_RUNNING"). Unfortunately, it
was reverted because it was an experiment. So pick the patch up
again, by recording the average duration when a task voluntarily
switches out. Introduce SIS_SHORT to control this strategy.

The threshold of short duration reuses sysctl_sched_min_granularity,
so it can be tuned by the user. Ideally there should be a dedicated
parameter for the threshold, but that might introduce complexity.

Suggested-by: Tim Chen <tim.c.chen@intel.com>
Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 include/linux/sched.h   |  4 ++++
 kernel/sched/core.c     |  2 ++
 kernel/sched/fair.c     | 17 +++++++++++++++++
 kernel/sched/features.h |  1 +
 4 files changed, 24 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ffb6eb55cd13..64b7acb77a11 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -558,6 +558,10 @@ struct sched_entity {
 
 	u64				nr_migrations;
 
+	u64				prev_sum_exec_runtime_vol;
+	/* average duration of a task */
+	u64				dur_avg;
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	int				depth;
 	struct sched_entity		*parent;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index daff72f00385..c5202f1be3f7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4348,6 +4348,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 	p->se.prev_sum_exec_runtime	= 0;
 	p->se.nr_migrations		= 0;
 	p->se.vruntime			= 0;
+	p->se.dur_avg			= 0;
+	p->se.prev_sum_exec_runtime_vol	= 0;
 	INIT_LIST_HEAD(&p->se.group_node);
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e4a0b8bd941c..a4b314b664f8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6200,6 +6200,16 @@ static int wake_wide(struct task_struct *p)
 	return 1;
 }
 
+/*
+ * If a task switches in and then voluntarily relinquishes the
+ * CPU quickly, it is regarded as a short duration task.
+ */
+static inline int is_short_task(struct task_struct *p)
+{
+	return sched_feat(SIS_SHORT) &&
+		(p->se.dur_avg <= sysctl_sched_min_granularity);
+}
+
 /*
  * The purpose of wake_affine() is to quickly determine on which CPU we can run
  * soonest. For the purpose of speed we only consider the waking and previous
@@ -7680,6 +7690,13 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
 	struct sched_entity *se = &prev->se;
 	struct cfs_rq *cfs_rq;
 
+	if (sched_feat(SIS_SHORT) && !prev->on_rq) {
+		u64 this_dur = se->sum_exec_runtime - se->prev_sum_exec_runtime_vol;
+
+		se->prev_sum_exec_runtime_vol = se->sum_exec_runtime;
+		update_avg(&se->dur_avg, this_dur);
+	}
+
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
 		put_prev_entity(cfs_rq, se);
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index ee7f23c76bd3..efdc29c42161 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -62,6 +62,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
  */
 SCHED_FEAT(SIS_PROP, false)
 SCHED_FEAT(SIS_UTIL, true)
+SCHED_FEAT(SIS_SHORT, true)
 
 /*
  * Issue a WARN when we do multiple update_rq_clock() calls
-- 
2.25.1
Re: [PATCH v3 1/2] sched/fair: Introduce short duration task check
Posted by Honglei Wang 2 years, 9 months ago

On 2022/12/1 16:44, Chen Yu wrote:
> Introduce short-duration task checks, as there is requirement
> to leverage this attribute for better task placement.
> 
> There are several choices of metrics that could be used to
> indicate if a task is a short-duration task.
> 
> At first thought the (p->se.sum_exec_runtime / p->nvcsw)
> could be used to measure the task duration. However, the
> history long past was factored too heavily in such a formula.
> Ideally, the old activity should decay and not affect
> the current status too much.
> 
> Although something based on PELT could be used, se.util_avg might
> not be appropriate to describe the task duration:
> 1. Task p1 and task p2 are doing frequent ping-pong scheduling on
>     one CPU, both p1 and p2 have a short duration, but the util_avg
>     can be up to 50%.
> 2. Suppose a task lasting less than 4ms is regarded as a short task.
>     If task p3 runs for 6ms and sleeps for 32ms, p3 should not be a
>     short-duration task. However, PELT would decay p3's accumulated
>     running time from 6ms to 3ms, because 32ms is the half-life in PELT.
>     As a result, p3 would be incorrectly treated as a short task.
> 
> It was found that there was once a similar feature to track the
> duration of a task, which is in Commit ad4b78bbcbab ("sched: Add
> new wakeup preemption mode: WAKEUP_RUNNING"). Unfortunately, it
> was reverted because it was an experiment. So pick the patch up
> again, by recording the average duration when a task voluntarily
> switches out. Introduce SIS_SHORT to control this strategy.
> 
> The threshold of short duration reuses sysctl_sched_min_granularity,
> so it can be tuned by the user. Ideally there should be a dedicated
> parameter for the threshold, but that might introduce complexity.
> 
> Suggested-by: Tim Chen <tim.c.chen@intel.com>
> Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> ---
>   include/linux/sched.h   |  4 ++++
>   kernel/sched/core.c     |  2 ++
>   kernel/sched/fair.c     | 17 +++++++++++++++++
>   kernel/sched/features.h |  1 +
>   4 files changed, 24 insertions(+)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index ffb6eb55cd13..64b7acb77a11 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -558,6 +558,10 @@ struct sched_entity {
>   
>   	u64				nr_migrations;
>   
> +	u64				prev_sum_exec_runtime_vol;
> +	/* average duration of a task */
> +	u64				dur_avg;
> +
>   #ifdef CONFIG_FAIR_GROUP_SCHED
>   	int				depth;
>   	struct sched_entity		*parent;
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index daff72f00385..c5202f1be3f7 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4348,6 +4348,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
>   	p->se.prev_sum_exec_runtime	= 0;
>   	p->se.nr_migrations		= 0;
>   	p->se.vruntime			= 0;
> +	p->se.dur_avg			= 0;
> +	p->se.prev_sum_exec_runtime_vol	= 0;
>   	INIT_LIST_HEAD(&p->se.group_node);
>   
>   #ifdef CONFIG_FAIR_GROUP_SCHED
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e4a0b8bd941c..a4b314b664f8 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6200,6 +6200,16 @@ static int wake_wide(struct task_struct *p)
>   	return 1;
>   }
>   
> +/*
> + * If a task switches in and then voluntarily relinquishes the
> + * CPU quickly, it is regarded as a short duration task.
> + */
> +static inline int is_short_task(struct task_struct *p)
> +{
> +	return sched_feat(SIS_SHORT) &&
> +		(p->se.dur_avg <= sysctl_sched_min_granularity);
> +}
> +

Hi Yu,

I still have a bit concern about the sysctl_sched_min_granularity 
stuff.. This grab can be set to different value which will impact the 
action of this patch and make things not totally under control.

Not sure if we can add a new grab for this.. The test result shows good 
improvement for short task, and with this grab, admins will be able to 
custom the system base on their own 'short task' view.

Thanks,
Honglei

>   /*
>    * The purpose of wake_affine() is to quickly determine on which CPU we can run
>    * soonest. For the purpose of speed we only consider the waking and previous
> @@ -7680,6 +7690,13 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
>   	struct sched_entity *se = &prev->se;
>   	struct cfs_rq *cfs_rq;
>   
> +	if (sched_feat(SIS_SHORT) && !prev->on_rq) {
> +		u64 this_dur = se->sum_exec_runtime - se->prev_sum_exec_runtime_vol;
> +
> +		se->prev_sum_exec_runtime_vol = se->sum_exec_runtime;
> +		update_avg(&se->dur_avg, this_dur);
> +	}
> +
>   	for_each_sched_entity(se) {
>   		cfs_rq = cfs_rq_of(se);
>   		put_prev_entity(cfs_rq, se);
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index ee7f23c76bd3..efdc29c42161 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -62,6 +62,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
>    */
>   SCHED_FEAT(SIS_PROP, false)
>   SCHED_FEAT(SIS_UTIL, true)
> +SCHED_FEAT(SIS_SHORT, true)
>   
>   /*
>    * Issue a WARN when we do multiple update_rq_clock() calls
Re: [PATCH v3 1/2] sched/fair: Introduce short duration task check
Posted by Chen Yu 2 years, 9 months ago
Hi Honglei,
On 2022-12-02 at 15:44:18 +0800, Honglei Wang wrote:
> 
> 
> On 2022/12/1 16:44, Chen Yu wrote:
> > Introduce short-duration task checks, as there is requirement
> > to leverage this attribute for better task placement.
> > 
> > There are several choices of metrics that could be used to
> > indicate if a task is a short-duration task.
> > 
> > At first thought the (p->se.sum_exec_runtime / p->nvcsw)
> > could be used to measure the task duration. However, the
> > history long past was factored too heavily in such a formula.
> > Ideally, the old activity should decay and not affect
> > the current status too much.
> > 
> > Although something based on PELT could be used, se.util_avg might
> > not be appropriate to describe the task duration:
> > 1. Task p1 and task p2 are doing frequent ping-pong scheduling on
> >     one CPU, both p1 and p2 have a short duration, but the util_avg
> >     can be up to 50%.
> > 2. Suppose a task lasting less than 4ms is regarded as a short task.
> >     If task p3 runs for 6ms and sleeps for 32ms, p3 should not be a
> >     short-duration task. However, PELT would decay p3's accumulated
> >     running time from 6ms to 3ms, because 32ms is the half-life in PELT.
> >     As a result, p3 would be incorrectly treated as a short task.
> > 
> > It was found that there was once a similar feature to track the
> > duration of a task, which is in Commit ad4b78bbcbab ("sched: Add
> > new wakeup preemption mode: WAKEUP_RUNNING"). Unfortunately, it
> > was reverted because it was an experiment. So pick the patch up
> > again, by recording the average duration when a task voluntarily
> > switches out. Introduce SIS_SHORT to control this strategy.
> > 
> > The threshold of short duration reuses sysctl_sched_min_granularity,
> > so it can be tuned by the user. Ideally there should be a dedicated
> > parameter for the threshold, but that might introduce complexity.
> > 
> > Suggested-by: Tim Chen <tim.c.chen@intel.com>
> > Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
> > Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> > ---
> >   include/linux/sched.h   |  4 ++++
> >   kernel/sched/core.c     |  2 ++
> >   kernel/sched/fair.c     | 17 +++++++++++++++++
> >   kernel/sched/features.h |  1 +
> >   4 files changed, 24 insertions(+)
> > 
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index ffb6eb55cd13..64b7acb77a11 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -558,6 +558,10 @@ struct sched_entity {
> >   	u64				nr_migrations;
> > +	u64				prev_sum_exec_runtime_vol;
> > +	/* average duration of a task */
> > +	u64				dur_avg;
> > +
> >   #ifdef CONFIG_FAIR_GROUP_SCHED
> >   	int				depth;
> >   	struct sched_entity		*parent;
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index daff72f00385..c5202f1be3f7 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4348,6 +4348,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
> >   	p->se.prev_sum_exec_runtime	= 0;
> >   	p->se.nr_migrations		= 0;
> >   	p->se.vruntime			= 0;
> > +	p->se.dur_avg			= 0;
> > +	p->se.prev_sum_exec_runtime_vol	= 0;
> >   	INIT_LIST_HEAD(&p->se.group_node);
> >   #ifdef CONFIG_FAIR_GROUP_SCHED
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index e4a0b8bd941c..a4b314b664f8 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6200,6 +6200,16 @@ static int wake_wide(struct task_struct *p)
> >   	return 1;
> >   }
> > +/*
> > + * If a task switches in and then voluntarily relinquishes the
> > + * CPU quickly, it is regarded as a short duration task.
> > + */
> > +static inline int is_short_task(struct task_struct *p)
> > +{
> > +	return sched_feat(SIS_SHORT) &&
> > +		(p->se.dur_avg <= sysctl_sched_min_granularity);
> > +}
> > +
> 
> Hi Yu,
> 
> I still have a bit concern about the sysctl_sched_min_granularity stuff..
> This grab can be set to different value which will impact the action of this
> patch and make things not totally under control.
> 
> Not sure if we can add a new grab for this.. The test result shows good
> improvement for short task, and with this grab, admins will be able to
> custom the system base on their own 'short task' view.
>
It would be ideal to have a dedicated parameter to tweak this. For example,
something under /sys/kernel/debug/sched/, and initilized to sysctl_sched_min_granularity
by default. 
Hi Peter, Vincent,
may I have your opinion if this is applicable?

thanks,
Chenyu
Re: [PATCH v3 1/2] sched/fair: Introduce short duration task check
Posted by Joel Fernandes 2 years, 9 months ago

> On Dec 3, 2022, at 2:50 AM, Chen Yu <yu.c.chen@intel.com> wrote:
> 
> Hi Honglei,
>> On 2022-12-02 at 15:44:18 +0800, Honglei Wang wrote:
>> 
>> 
>>> On 2022/12/1 16:44, Chen Yu wrote:
>>> Introduce short-duration task checks, as there is requirement
>>> to leverage this attribute for better task placement.
>>> 
>>> There are several choices of metrics that could be used to
>>> indicate if a task is a short-duration task.
>>> 
>>> At first thought the (p->se.sum_exec_runtime / p->nvcsw)
>>> could be used to measure the task duration. However, the
>>> history long past was factored too heavily in such a formula.
>>> Ideally, the old activity should decay and not affect
>>> the current status too much.
>>> 
>>> Although something based on PELT could be used, se.util_avg might
>>> not be appropriate to describe the task duration:
>>> 1. Task p1 and task p2 are doing frequent ping-pong scheduling on
>>>    one CPU, both p1 and p2 have a short duration, but the util_avg
>>>    can be up to 50%.
>>> 2. Suppose a task lasting less than 4ms is regarded as a short task.
>>>    If task p3 runs for 6ms and sleeps for 32ms, p3 should not be a
>>>    short-duration task. However, PELT would decay p3's accumulated
>>>    running time from 6ms to 3ms, because 32ms is the half-life in PELT.
>>>    As a result, p3 would be incorrectly treated as a short task.
>>> 
>>> It was found that there was once a similar feature to track the
>>> duration of a task, which is in Commit ad4b78bbcbab ("sched: Add
>>> new wakeup preemption mode: WAKEUP_RUNNING"). Unfortunately, it
>>> was reverted because it was an experiment. So pick the patch up
>>> again, by recording the average duration when a task voluntarily
>>> switches out. Introduce SIS_SHORT to control this strategy.
>>> 
>>> The threshold of short duration reuses sysctl_sched_min_granularity,
>>> so it can be tuned by the user. Ideally there should be a dedicated
>>> parameter for the threshold, but that might introduce complexity.
>>> 
>>> Suggested-by: Tim Chen <tim.c.chen@intel.com>
>>> Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
>>> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
>>> ---
>>>  include/linux/sched.h   |  4 ++++
>>>  kernel/sched/core.c     |  2 ++
>>>  kernel/sched/fair.c     | 17 +++++++++++++++++
>>>  kernel/sched/features.h |  1 +
>>>  4 files changed, 24 insertions(+)
>>> 
>>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>>> index ffb6eb55cd13..64b7acb77a11 100644
>>> --- a/include/linux/sched.h
>>> +++ b/include/linux/sched.h
>>> @@ -558,6 +558,10 @@ struct sched_entity {
>>>      u64                nr_migrations;
>>> +    u64                prev_sum_exec_runtime_vol;
>>> +    /* average duration of a task */
>>> +    u64                dur_avg;
>>> +
>>>  #ifdef CONFIG_FAIR_GROUP_SCHED
>>>      int                depth;
>>>      struct sched_entity        *parent;
>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>> index daff72f00385..c5202f1be3f7 100644
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -4348,6 +4348,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
>>>      p->se.prev_sum_exec_runtime    = 0;
>>>      p->se.nr_migrations        = 0;
>>>      p->se.vruntime            = 0;
>>> +    p->se.dur_avg            = 0;
>>> +    p->se.prev_sum_exec_runtime_vol    = 0;
>>>      INIT_LIST_HEAD(&p->se.group_node);
>>>  #ifdef CONFIG_FAIR_GROUP_SCHED
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index e4a0b8bd941c..a4b314b664f8 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -6200,6 +6200,16 @@ static int wake_wide(struct task_struct *p)
>>>      return 1;
>>>  }
>>> +/*
>>> + * If a task switches in and then voluntarily relinquishes the
>>> + * CPU quickly, it is regarded as a short duration task.
>>> + */
>>> +static inline int is_short_task(struct task_struct *p)
>>> +{
>>> +    return sched_feat(SIS_SHORT) &&
>>> +        (p->se.dur_avg <= sysctl_sched_min_granularity);
>>> +}
>>> +
>> 
>> Hi Yu,
>> 
>> I still have a bit concern about the sysctl_sched_min_granularity stuff..
>> This grab can be set to different value which will impact the action of this
>> patch and make things not totally under control.

There are already ways to misconfigure sched sysctl to make bad/weird things happen.

>> Not sure if we can add a new grab for this.. The test result shows good
>> improvement for short task, and with this grab, admins will be able to
>> custom the system base on their own 'short task' view.
>> 
> It would be ideal to have a dedicated parameter to tweak this. For example,
> something under /sys/kernel/debug/sched/, and initilized to sysctl_sched_min_granularity
> by default. 

It would be nice to not have to introduce a new knob for this. IMO, min_granularity is reasonable.

Thanks.



> 
> thanks,
> Chenyu
Re: [PATCH v3 1/2] sched/fair: Introduce short duration task check
Posted by Chen Yu 2 years, 9 months ago
Hi Joel,
On 2022-12-03 at 10:35:46 -0500, Joel Fernandes wrote:
> 
> 
> > On Dec 3, 2022, at 2:50 AM, Chen Yu <yu.c.chen@intel.com> wrote:
> > 
> > Hi Honglei,
> >> On 2022-12-02 at 15:44:18 +0800, Honglei Wang wrote:
> >> 
> >> 
> >>> On 2022/12/1 16:44, Chen Yu wrote:
> >>> Introduce short-duration task checks, as there is requirement
> >>> to leverage this attribute for better task placement.
> >>> 
> >>> There are several choices of metrics that could be used to
> >>> indicate if a task is a short-duration task.
> >>> 
> >>> At first thought the (p->se.sum_exec_runtime / p->nvcsw)
> >>> could be used to measure the task duration. However, the
> >>> history long past was factored too heavily in such a formula.
> >>> Ideally, the old activity should decay and not affect
> >>> the current status too much.
> >>> 
> >>> Although something based on PELT could be used, se.util_avg might
> >>> not be appropriate to describe the task duration:
> >>> 1. Task p1 and task p2 are doing frequent ping-pong scheduling on
> >>>    one CPU, both p1 and p2 have a short duration, but the util_avg
> >>>    can be up to 50%.
> >>> 2. Suppose a task lasting less than 4ms is regarded as a short task.
> >>>    If task p3 runs for 6ms and sleeps for 32ms, p3 should not be a
> >>>    short-duration task. However, PELT would decay p3's accumulated
> >>>    running time from 6ms to 3ms, because 32ms is the half-life in PELT.
> >>>    As a result, p3 would be incorrectly treated as a short task.
> >>> 
> >>> It was found that there was once a similar feature to track the
> >>> duration of a task, which is in Commit ad4b78bbcbab ("sched: Add
> >>> new wakeup preemption mode: WAKEUP_RUNNING"). Unfortunately, it
> >>> was reverted because it was an experiment. So pick the patch up
> >>> again, by recording the average duration when a task voluntarily
> >>> switches out. Introduce SIS_SHORT to control this strategy.
> >>> 
> >>> The threshold of short duration reuses sysctl_sched_min_granularity,
> >>> so it can be tuned by the user. Ideally there should be a dedicated
> >>> parameter for the threshold, but that might introduce complexity.
> >>> 
> >>> Suggested-by: Tim Chen <tim.c.chen@intel.com>
> >>> Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
> >>> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> >>> ---
> >>>  include/linux/sched.h   |  4 ++++
> >>>  kernel/sched/core.c     |  2 ++
> >>>  kernel/sched/fair.c     | 17 +++++++++++++++++
> >>>  kernel/sched/features.h |  1 +
> >>>  4 files changed, 24 insertions(+)
> >>> 
> >>> diff --git a/include/linux/sched.h b/include/linux/sched.h
> >>> index ffb6eb55cd13..64b7acb77a11 100644
> >>> --- a/include/linux/sched.h
> >>> +++ b/include/linux/sched.h
> >>> @@ -558,6 +558,10 @@ struct sched_entity {
> >>>      u64                nr_migrations;
> >>> +    u64                prev_sum_exec_runtime_vol;
> >>> +    /* average duration of a task */
> >>> +    u64                dur_avg;
> >>> +
> >>>  #ifdef CONFIG_FAIR_GROUP_SCHED
> >>>      int                depth;
> >>>      struct sched_entity        *parent;
> >>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >>> index daff72f00385..c5202f1be3f7 100644
> >>> --- a/kernel/sched/core.c
> >>> +++ b/kernel/sched/core.c
> >>> @@ -4348,6 +4348,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
> >>>      p->se.prev_sum_exec_runtime    = 0;
> >>>      p->se.nr_migrations        = 0;
> >>>      p->se.vruntime            = 0;
> >>> +    p->se.dur_avg            = 0;
> >>> +    p->se.prev_sum_exec_runtime_vol    = 0;
> >>>      INIT_LIST_HEAD(&p->se.group_node);
> >>>  #ifdef CONFIG_FAIR_GROUP_SCHED
> >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>> index e4a0b8bd941c..a4b314b664f8 100644
> >>> --- a/kernel/sched/fair.c
> >>> +++ b/kernel/sched/fair.c
> >>> @@ -6200,6 +6200,16 @@ static int wake_wide(struct task_struct *p)
> >>>      return 1;
> >>>  }
> >>> +/*
> >>> + * If a task switches in and then voluntarily relinquishes the
> >>> + * CPU quickly, it is regarded as a short duration task.
> >>> + */
> >>> +static inline int is_short_task(struct task_struct *p)
> >>> +{
> >>> +    return sched_feat(SIS_SHORT) &&
> >>> +        (p->se.dur_avg <= sysctl_sched_min_granularity);
> >>> +}
> >>> +
> >> 
> >> Hi Yu,
> >> 
> >> I still have a bit concern about the sysctl_sched_min_granularity stuff..
> >> This grab can be set to different value which will impact the action of this
> >> patch and make things not totally under control.
> 
> There are already ways to misconfigure sched sysctl to make bad/weird things happen.
> 
> >> Not sure if we can add a new grab for this.. The test result shows good
> >> improvement for short task, and with this grab, admins will be able to
> >> custom the system base on their own 'short task' view.
> >> 
> > It would be ideal to have a dedicated parameter to tweak this. For example,
> > something under /sys/kernel/debug/sched/, and initilized to sysctl_sched_min_granularity
> > by default. 
> 
> It would be nice to not have to introduce a new knob for this. IMO, min_granularity is reasonable.
>
OK, got it, thanks for the suggestion.

thanks,
Chenyu 
Re: [PATCH v3 1/2] sched/fair: Introduce short duration task check
Posted by Vincent Guittot 2 years, 9 months ago
On Mon, 5 Dec 2022 at 09:39, Chen Yu <yu.c.chen@intel.com> wrote:
>
> Hi Joel,
> On 2022-12-03 at 10:35:46 -0500, Joel Fernandes wrote:
> >
> >
> > > On Dec 3, 2022, at 2:50 AM, Chen Yu <yu.c.chen@intel.com> wrote:
> > >
> > > Hi Honglei,
> > >> On 2022-12-02 at 15:44:18 +0800, Honglei Wang wrote:
> > >>
> > >>
> > >>> On 2022/12/1 16:44, Chen Yu wrote:
> > >>> Introduce short-duration task checks, as there is requirement
> > >>> to leverage this attribute for better task placement.
> > >>>
> > >>> There are several choices of metrics that could be used to
> > >>> indicate if a task is a short-duration task.
> > >>>
> > >>> At first thought the (p->se.sum_exec_runtime / p->nvcsw)
> > >>> could be used to measure the task duration. However, the
> > >>> history long past was factored too heavily in such a formula.
> > >>> Ideally, the old activity should decay and not affect
> > >>> the current status too much.
> > >>>
> > >>> Although something based on PELT could be used, se.util_avg might
> > >>> not be appropriate to describe the task duration:
> > >>> 1. Task p1 and task p2 are doing frequent ping-pong scheduling on
> > >>>    one CPU, both p1 and p2 have a short duration, but the util_avg
> > >>>    can be up to 50%.
> > >>> 2. Suppose a task lasting less than 4ms is regarded as a short task.
> > >>>    If task p3 runs for 6ms and sleeps for 32ms, p3 should not be a
> > >>>    short-duration task. However, PELT would decay p3's accumulated
> > >>>    running time from 6ms to 3ms, because 32ms is the half-life in PELT.
> > >>>    As a result, p3 would be incorrectly treated as a short task.
> > >>>
> > >>> It was found that there was once a similar feature to track the
> > >>> duration of a task, which is in Commit ad4b78bbcbab ("sched: Add
> > >>> new wakeup preemption mode: WAKEUP_RUNNING"). Unfortunately, it
> > >>> was reverted because it was an experiment. So pick the patch up
> > >>> again, by recording the average duration when a task voluntarily
> > >>> switches out. Introduce SIS_SHORT to control this strategy.
> > >>>
> > >>> The threshold of short duration reuses sysctl_sched_min_granularity,
> > >>> so it can be tuned by the user. Ideally there should be a dedicated
> > >>> parameter for the threshold, but that might introduce complexity.
> > >>>
> > >>> Suggested-by: Tim Chen <tim.c.chen@intel.com>
> > >>> Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
> > >>> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> > >>> ---
> > >>>  include/linux/sched.h   |  4 ++++
> > >>>  kernel/sched/core.c     |  2 ++
> > >>>  kernel/sched/fair.c     | 17 +++++++++++++++++
> > >>>  kernel/sched/features.h |  1 +
> > >>>  4 files changed, 24 insertions(+)
> > >>>
> > >>> diff --git a/include/linux/sched.h b/include/linux/sched.h
> > >>> index ffb6eb55cd13..64b7acb77a11 100644
> > >>> --- a/include/linux/sched.h
> > >>> +++ b/include/linux/sched.h
> > >>> @@ -558,6 +558,10 @@ struct sched_entity {
> > >>>      u64                nr_migrations;
> > >>> +    u64                prev_sum_exec_runtime_vol;
> > >>> +    /* average duration of a task */
> > >>> +    u64                dur_avg;
> > >>> +
> > >>>  #ifdef CONFIG_FAIR_GROUP_SCHED
> > >>>      int                depth;
> > >>>      struct sched_entity        *parent;
> > >>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > >>> index daff72f00385..c5202f1be3f7 100644
> > >>> --- a/kernel/sched/core.c
> > >>> +++ b/kernel/sched/core.c
> > >>> @@ -4348,6 +4348,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
> > >>>      p->se.prev_sum_exec_runtime    = 0;
> > >>>      p->se.nr_migrations        = 0;
> > >>>      p->se.vruntime            = 0;
> > >>> +    p->se.dur_avg            = 0;
> > >>> +    p->se.prev_sum_exec_runtime_vol    = 0;
> > >>>      INIT_LIST_HEAD(&p->se.group_node);
> > >>>  #ifdef CONFIG_FAIR_GROUP_SCHED
> > >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > >>> index e4a0b8bd941c..a4b314b664f8 100644
> > >>> --- a/kernel/sched/fair.c
> > >>> +++ b/kernel/sched/fair.c
> > >>> @@ -6200,6 +6200,16 @@ static int wake_wide(struct task_struct *p)
> > >>>      return 1;
> > >>>  }
> > >>> +/*
> > >>> + * If a task switches in and then voluntarily relinquishes the
> > >>> + * CPU quickly, it is regarded as a short duration task.
> > >>> + */
> > >>> +static inline int is_short_task(struct task_struct *p)
> > >>> +{
> > >>> +    return sched_feat(SIS_SHORT) &&
> > >>> +        (p->se.dur_avg <= sysctl_sched_min_granularity);
> > >>> +}
> > >>> +
> > >>
> > >> Hi Yu,
> > >>
> > >> I still have a bit concern about the sysctl_sched_min_granularity stuff..
> > >> This grab can be set to different value which will impact the action of this
> > >> patch and make things not totally under control.
> >
> > There are already ways to misconfigure sched sysctl to make bad/weird things happen.
> >
> > >> Not sure if we can add a new grab for this.. The test result shows good
> > >> improvement for short task, and with this grab, admins will be able to
> > >> custom the system base on their own 'short task' view.
> > >>
> > > It would be ideal to have a dedicated parameter to tweak this. For example,
> > > something under /sys/kernel/debug/sched/, and initilized to sysctl_sched_min_granularity
> > > by default.
> >
> > It would be nice to not have to introduce a new knob for this. IMO, min_granularity is reasonable.
> >
> OK, got it, thanks for the suggestion.

Sorry for the late answer.
We don't want to add more dedicated knobs. So using
sysctl_sched_min_granularity as you are doing in this patch looks ok

>
> thanks,
> Chenyu
Re: [PATCH v3 1/2] sched/fair: Introduce short duration task check
Posted by Josh Don 2 years, 9 months ago
> We don't want to add more dedicated knobs. So using
> sysctl_sched_min_granularity as you are doing in this patch looks ok

If not a knob, then maybe at least a smaller hardcoded value? Several
milliseconds seems like too long for this optimization; the
opportunity cost of not doing the idle search likely doesn't make
sense if the waker is going to be active for several additional
milliseconds. I would have guessed something on the order of 100us.

Chen, what is the run duration of tasks in your ping pong example?
Re: [PATCH v3 1/2] sched/fair: Introduce short duration task check
Posted by Chen Yu 2 years, 9 months ago
Hi Josh,
On 2022-12-06 at 18:23:47 -0800, Josh Don wrote:
> > We don't want to add more dedicated knobs. So using
> > sysctl_sched_min_granularity as you are doing in this patch looks ok
> 
> If not a knob, then maybe at least a smaller hardcoded value? Several
> milliseconds seems like too long for this optimization; the
> opportunity cost of not doing the idle search likely doesn't make
> sense if the waker is going to be active for several additional
> milliseconds. I would have guessed something on the order of 100us.
> 
> Chen, what is the run duration of tasks in your ping pong example?
OK, I'll measure the duration of all the tests and summarize the data later.

thanks,
Chenyu
Re: [PATCH v3 1/2] sched/fair: Introduce short duration task check
Posted by Yicong Yang 2 years, 9 months ago
On 2022/12/7 22:24, Chen Yu wrote:
> Hi Josh,
> On 2022-12-06 at 18:23:47 -0800, Josh Don wrote:
>>> We don't want to add more dedicated knobs. So using
>>> sysctl_sched_min_granularity as you are doing in this patch looks ok
>>
>> If not a knob, then maybe at least a smaller hardcoded value? Several
>> milliseconds seems like too long for this optimization; the
>> opportunity cost of not doing the idle search likely doesn't make
>> sense if the waker is going to be active for several additional
>> milliseconds. I would have guessed something on the order of 100us.
>>
>> Chen, what is the run duration of tasks in your ping pong example?
> OK, I'll measure the duration of all the tests and summarize the data later.
> 

If we decide to use sysctl_sched_min_granularity as the threshold of a short
task, can we also export p->se.dur_avg (somewhere like /proc/[pid]/sched)?
It may be helpful to choose a proper value for sysctl_sched_min_granularity.

Thanks.
Re: [PATCH v3 1/2] sched/fair: Introduce short duration task check
Posted by Chen Yu 2 years, 9 months ago
On 2022-12-12 at 19:22:48 +0800, Yicong Yang wrote:
> On 2022/12/7 22:24, Chen Yu wrote:
> > Hi Josh,
> > On 2022-12-06 at 18:23:47 -0800, Josh Don wrote:
> >>> We don't want to add more dedicated knobs. So using
> >>> sysctl_sched_min_granularity as you are doing in this patch looks ok
> >>
> >> If not a knob, then maybe at least a smaller hardcoded value? Several
> >> milliseconds seems like too long for this optimization; the
> >> opportunity cost of not doing the idle search likely doesn't make
> >> sense if the waker is going to be active for several additional
> >> milliseconds. I would have guessed something on the order of 100us.
> >>
> >> Chen, what is the run duration of tasks in your ping pong example?
> > OK, I'll measure the duration of all the tests and summarize the data later.
> > 
> 
> If we decide to use sysctl_sched_min_granularity as the threshold of a short
> task, can we also export p->se.dur_avg (somewhere like /proc/[pid]/sched)?
> It may be helpful to choose a proper value for sysctl_sched_min_granularity.
This looks reasonable, I can add one in next version.
BTW, I've changed the threshold to (sysctl_sched_min_granularity / 8) in my next
version, as this is the value that fit my previous test case and also not to break
the case Josh mentioned. Also I've changed the place where dur_avg is updated from
put_prev_task() to dequeue_task(), which could fix an issue in v3. Will send v4
out after the test finished.

thanks,
Chenyu
> 
> Thanks.
Re: [PATCH v3 1/2] sched/fair: Introduce short duration task check
Posted by Josh Don 2 years, 9 months ago
> BTW, I've changed the threshold to (sysctl_sched_min_granularity / 8) in my next
> version, as this is the value that fit my previous test case and also not to break
> the case Josh mentioned.

Do you mean a hardcoded value of some number of micros, or literally
sched_min_granularity / 8? I don't think the latter is necessary, and
indeed can lead to weirdness if min_gran is too small or too large. I
don't think the concept of what a short duration task is should
expand/contract with min_gran.

Best,
Josh
Re: [PATCH v3 1/2] sched/fair: Introduce short duration task check
Posted by Chen Yu 2 years, 9 months ago
On 2022-12-12 at 10:17:35 -0800, Josh Don wrote:
> > BTW, I've changed the threshold to (sysctl_sched_min_granularity / 8) in my next
> > version, as this is the value that fit my previous test case and also not to break
> > the case Josh mentioned.
> 
> Do you mean a hardcoded value of some number of micros, or literally
> sched_min_granularity / 8?
The latter. According to the test, the average task duration when system
is under heavy load:
6 ~ 9 us for netperf
7 ~ 70 us for hackbench
7 ~ 8 us for tbench
13 ~ 20 ms for schbench
Overall the duration of the micros are quite small(except for schbench).
The default sysctl_sched_min_granularity is 750 us in kernel if no user
has changed it. Then 750 / 8 = 93 us, which is close to what you suggested(100us).
On the other hand, if someone changes sysctl_sched_min_granularity,
then '8' can be viewed as log2(256). That is, if there are 256 CPUs online,
and the sysctl_sched_min_granularity is changed to 750 us * log2(256) by
the user, we can devide the sysctl_sched_min_granularity by 8 in case the
sysctl_sched_min_granularity is too large.

My concern of using hardcoded value is that, this value depends on how fast
the CPU runs(cpu frequency). The value I measured above is when the
CPU is running at 1.9Ghz. If a CPU runs faster, a hard code value might not
be appropriate and could not be tuned.
> I don't think the latter is necessary, and
> indeed can lead to weirdness if min_gran is too small or too large. I
> don't think the concept of what a short duration task is should
> expand/contract with min_gran.
The value of sysctl_sched_min_granularity might indicate how long the
user would like a task to run at least. If the user enlarge this value,
does it mean the user wants every task in the system to run longer?
From this point I found connection between the the definition of short task
duration and this value. I'm open to changing this value to a fixed one, may
I have more insights on how this value would be set in production environment?

thanks,
Chenyu
> 
> Best,
> Josh
Re: [PATCH v3 1/2] sched/fair: Introduce short duration task check
Posted by Honglei Wang 2 years, 9 months ago

On 2022/12/13 13:46, Chen Yu wrote:
> On 2022-12-12 at 10:17:35 -0800, Josh Don wrote:
>>> BTW, I've changed the threshold to (sysctl_sched_min_granularity / 8) in my next
>>> version, as this is the value that fit my previous test case and also not to break
>>> the case Josh mentioned.
>>
>> Do you mean a hardcoded value of some number of micros, or literally
>> sched_min_granularity / 8?
> The latter. According to the test, the average task duration when system
> is under heavy load:
> 6 ~ 9 us for netperf
> 7 ~ 70 us for hackbench
> 7 ~ 8 us for tbench
> 13 ~ 20 ms for schbench
> Overall the duration of the micros are quite small(except for schbench).
> The default sysctl_sched_min_granularity is 750 us in kernel if no user
> has changed it. Then 750 / 8 = 93 us, which is close to what you suggested(100us).
> On the other hand, if someone changes sysctl_sched_min_granularity,
> then '8' can be viewed as log2(256). That is, if there are 256 CPUs online,
> and the sysctl_sched_min_granularity is changed to 750 us * log2(256) by
> the user, we can devide the sysctl_sched_min_granularity by 8 in case the
> sysctl_sched_min_granularity is too large.
> 

Hi Yu,

Seems there is a min_t() call in get_update_sysctl_factor(). In most 
cases, we'll get 750 us * (1+log2(8)) = 3000 us in default due to 
sysctl_sched_tunable_scaling is set as '1' default. (Correct me if I 
misunderstand).

For the value in production environment, I've seen 10 ms and 3 ms in 
different place, FYI. Hope this help.

Thanks,
Honglei

> My concern of using hardcoded value is that, this value depends on how fast
> the CPU runs(cpu frequency). The value I measured above is when the
> CPU is running at 1.9Ghz. If a CPU runs faster, a hard code value might not
> be appropriate and could not be tuned.
>> I don't think the latter is necessary, and
>> indeed can lead to weirdness if min_gran is too small or too large. I
>> don't think the concept of what a short duration task is should
>> expand/contract with min_gran.
> The value of sysctl_sched_min_granularity might indicate how long the
> user would like a task to run at least. If the user enlarge this value,
> does it mean the user wants every task in the system to run longer?
>  From this point I found connection between the the definition of short task
> duration and this value. I'm open to changing this value to a fixed one, may
> I have more insights on how this value would be set in production environment?
> 
> thanks,
> Chenyu
>>
>> Best,
>> Josh
Re: [PATCH v3 1/2] sched/fair: Introduce short duration task check
Posted by Chen Yu 2 years, 9 months ago
On 2022-12-13 at 18:06:50 +0800, Honglei Wang wrote:
> 
> 
> On 2022/12/13 13:46, Chen Yu wrote:
> > On 2022-12-12 at 10:17:35 -0800, Josh Don wrote:
> > > > BTW, I've changed the threshold to (sysctl_sched_min_granularity / 8) in my next
> > > > version, as this is the value that fit my previous test case and also not to break
> > > > the case Josh mentioned.
> > > 
> > > Do you mean a hardcoded value of some number of micros, or literally
> > > sched_min_granularity / 8?
> > The latter. According to the test, the average task duration when system
> > is under heavy load:
> > 6 ~ 9 us for netperf
> > 7 ~ 70 us for hackbench
> > 7 ~ 8 us for tbench
> > 13 ~ 20 ms for schbench
> > Overall the duration of the micros are quite small(except for schbench).
> > The default sysctl_sched_min_granularity is 750 us in kernel if no user
> > has changed it. Then 750 / 8 = 93 us, which is close to what you suggested(100us).
> > On the other hand, if someone changes sysctl_sched_min_granularity,
> > then '8' can be viewed as log2(256). That is, if there are 256 CPUs online,
> > and the sysctl_sched_min_granularity is changed to 750 us * log2(256) by
> > the user, we can devide the sysctl_sched_min_granularity by 8 in case the
> > sysctl_sched_min_granularity is too large.
> > 
> 
> Hi Yu,
> 
> Seems there is a min_t() call in get_update_sysctl_factor(). In most cases,
> we'll get 750 us * (1+log2(8)) = 3000 us in default due to
> sysctl_sched_tunable_scaling is set as '1' default. (Correct me if I
> misunderstand).
>
Thanks for pointing this out! I overlooked this part previously.
So (sysctl_sched_min_granularity / 8) is 375 us by default.
> For the value in production environment, I've seen 10 ms and 3 ms in
> different place, FYI. Hope this help.
I see. 10 ms would generate a short duration threshold of 1.25 ms. From
my understanding if the user increases the sysctl_sched_min_granularity
then it is expected to have longer average duration tasks in the system.
And the short duration threshold should also be raised.

thanks,
Chenyu
Re: [PATCH v3 1/2] sched/fair: Introduce short duration task check
Posted by Chen Yu 2 years, 9 months ago
On 2022-12-05 at 10:59:06 +0100, Vincent Guittot wrote:
> On Mon, 5 Dec 2022 at 09:39, Chen Yu <yu.c.chen@intel.com> wrote:
> >
> > Hi Joel,
> > On 2022-12-03 at 10:35:46 -0500, Joel Fernandes wrote:
> > >
> > >
> > > > On Dec 3, 2022, at 2:50 AM, Chen Yu <yu.c.chen@intel.com> wrote:
> > > >
> > > > Hi Honglei,
> > > >> On 2022-12-02 at 15:44:18 +0800, Honglei Wang wrote:
> > > >>
> > > >>
> > > >>> On 2022/12/1 16:44, Chen Yu wrote:
> > > >>> Introduce short-duration task checks, as there is requirement
> > > >>> to leverage this attribute for better task placement.
> > > >>>
> > > >>> There are several choices of metrics that could be used to
> > > >>> indicate if a task is a short-duration task.
> > > >>>
> > > >>> At first thought the (p->se.sum_exec_runtime / p->nvcsw)
> > > >>> could be used to measure the task duration. However, the
> > > >>> history long past was factored too heavily in such a formula.
> > > >>> Ideally, the old activity should decay and not affect
> > > >>> the current status too much.
> > > >>>
> > > >>> Although something based on PELT could be used, se.util_avg might
> > > >>> not be appropriate to describe the task duration:
> > > >>> 1. Task p1 and task p2 are doing frequent ping-pong scheduling on
> > > >>>    one CPU, both p1 and p2 have a short duration, but the util_avg
> > > >>>    can be up to 50%.
> > > >>> 2. Suppose a task lasting less than 4ms is regarded as a short task.
> > > >>>    If task p3 runs for 6ms and sleeps for 32ms, p3 should not be a
> > > >>>    short-duration task. However, PELT would decay p3's accumulated
> > > >>>    running time from 6ms to 3ms, because 32ms is the half-life in PELT.
> > > >>>    As a result, p3 would be incorrectly treated as a short task.
> > > >>>
> > > >>> It was found that there was once a similar feature to track the
> > > >>> duration of a task, which is in Commit ad4b78bbcbab ("sched: Add
> > > >>> new wakeup preemption mode: WAKEUP_RUNNING"). Unfortunately, it
> > > >>> was reverted because it was an experiment. So pick the patch up
> > > >>> again, by recording the average duration when a task voluntarily
> > > >>> switches out. Introduce SIS_SHORT to control this strategy.
> > > >>>
> > > >>> The threshold of short duration reuses sysctl_sched_min_granularity,
> > > >>> so it can be tuned by the user. Ideally there should be a dedicated
> > > >>> parameter for the threshold, but that might introduce complexity.
> > > >>>
> > > >>> Suggested-by: Tim Chen <tim.c.chen@intel.com>
> > > >>> Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
> > > >>> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> > > >>> ---
> > > >>>  include/linux/sched.h   |  4 ++++
> > > >>>  kernel/sched/core.c     |  2 ++
> > > >>>  kernel/sched/fair.c     | 17 +++++++++++++++++
> > > >>>  kernel/sched/features.h |  1 +
> > > >>>  4 files changed, 24 insertions(+)
> > > >>>
> > > >>> diff --git a/include/linux/sched.h b/include/linux/sched.h
> > > >>> index ffb6eb55cd13..64b7acb77a11 100644
> > > >>> --- a/include/linux/sched.h
> > > >>> +++ b/include/linux/sched.h
> > > >>> @@ -558,6 +558,10 @@ struct sched_entity {
> > > >>>      u64                nr_migrations;
> > > >>> +    u64                prev_sum_exec_runtime_vol;
> > > >>> +    /* average duration of a task */
> > > >>> +    u64                dur_avg;
> > > >>> +
> > > >>>  #ifdef CONFIG_FAIR_GROUP_SCHED
> > > >>>      int                depth;
> > > >>>      struct sched_entity        *parent;
> > > >>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > >>> index daff72f00385..c5202f1be3f7 100644
> > > >>> --- a/kernel/sched/core.c
> > > >>> +++ b/kernel/sched/core.c
> > > >>> @@ -4348,6 +4348,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
> > > >>>      p->se.prev_sum_exec_runtime    = 0;
> > > >>>      p->se.nr_migrations        = 0;
> > > >>>      p->se.vruntime            = 0;
> > > >>> +    p->se.dur_avg            = 0;
> > > >>> +    p->se.prev_sum_exec_runtime_vol    = 0;
> > > >>>      INIT_LIST_HEAD(&p->se.group_node);
> > > >>>  #ifdef CONFIG_FAIR_GROUP_SCHED
> > > >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > >>> index e4a0b8bd941c..a4b314b664f8 100644
> > > >>> --- a/kernel/sched/fair.c
> > > >>> +++ b/kernel/sched/fair.c
> > > >>> @@ -6200,6 +6200,16 @@ static int wake_wide(struct task_struct *p)
> > > >>>      return 1;
> > > >>>  }
> > > >>> +/*
> > > >>> + * If a task switches in and then voluntarily relinquishes the
> > > >>> + * CPU quickly, it is regarded as a short duration task.
> > > >>> + */
> > > >>> +static inline int is_short_task(struct task_struct *p)
> > > >>> +{
> > > >>> +    return sched_feat(SIS_SHORT) &&
> > > >>> +        (p->se.dur_avg <= sysctl_sched_min_granularity);
> > > >>> +}
> > > >>> +
> > > >>
> > > >> Hi Yu,
> > > >>
> > > >> I still have a bit concern about the sysctl_sched_min_granularity stuff..
> > > >> This grab can be set to different value which will impact the action of this
> > > >> patch and make things not totally under control.
> > >
> > > There are already ways to misconfigure sched sysctl to make bad/weird things happen.
> > >
> > > >> Not sure if we can add a new grab for this.. The test result shows good
> > > >> improvement for short task, and with this grab, admins will be able to
> > > >> custom the system base on their own 'short task' view.
> > > >>
> > > > It would be ideal to have a dedicated parameter to tweak this. For example,
> > > > something under /sys/kernel/debug/sched/, and initilized to sysctl_sched_min_granularity
> > > > by default.
> > >
> > > It would be nice to not have to introduce a new knob for this. IMO, min_granularity is reasonable.
> > >
> > OK, got it, thanks for the suggestion.
> 
> Sorry for the late answer.
> We don't want to add more dedicated knobs. So using
> sysctl_sched_min_granularity as you are doing in this patch looks ok
>
I see, thanks Vincent.

thanks,
Chenyu 
> >
> > thanks,
> > Chenyu