Introduce short-duration task checks, as there is requirement
to leverage this attribute for better task placement.
There are several choices of metrics that could be used to
indicate if a task is a short-duration task.
At first thought the (p->se.sum_exec_runtime / p->nvcsw)
could be used to measure the task duration. However, the
history long past was factored too heavily in such a formula.
Ideally, the old activity should decay and not affect
the current status too much.
Although something based on PELT could be used, se.util_avg might
not be appropriate to describe the task duration:
1. Task p1 and task p2 are doing frequent ping-pong scheduling on
one CPU, both p1 and p2 have a short duration, but the util_avg
can be up to 50%.
2. Suppose a task lasting less than 4ms is regarded as a short task.
If task p3 runs for 6ms and sleeps for 32ms, p3 should not be a
short-duration task. However, PELT would decay p3's accumulated
running time from 6ms to 3ms, because 32ms is the half-life in PELT.
As a result, p3 would be incorrectly treated as a short task.
It was found that there was once a similar feature to track the
duration of a task, which is in Commit ad4b78bbcbab ("sched: Add
new wakeup preemption mode: WAKEUP_RUNNING"). Unfortunately, it
was reverted because it was an experiment. So pick the patch up
again, by recording the average duration when a task voluntarily
switches out. Introduce SIS_SHORT to control this strategy.
The threshold of short duration reuses sysctl_sched_min_granularity,
so it can be tuned by the user. Ideally there should be a dedicated
parameter for the threshold, but that might introduce complexity.
Suggested-by: Tim Chen <tim.c.chen@intel.com>
Suggested-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
include/linux/sched.h | 4 ++++
kernel/sched/core.c | 2 ++
kernel/sched/fair.c | 17 +++++++++++++++++
kernel/sched/features.h | 1 +
4 files changed, 24 insertions(+)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ffb6eb55cd13..64b7acb77a11 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -558,6 +558,10 @@ struct sched_entity {
u64 nr_migrations;
+ u64 prev_sum_exec_runtime_vol;
+ /* average duration of a task */
+ u64 dur_avg;
+
#ifdef CONFIG_FAIR_GROUP_SCHED
int depth;
struct sched_entity *parent;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index daff72f00385..c5202f1be3f7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4348,6 +4348,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
p->se.prev_sum_exec_runtime = 0;
p->se.nr_migrations = 0;
p->se.vruntime = 0;
+ p->se.dur_avg = 0;
+ p->se.prev_sum_exec_runtime_vol = 0;
INIT_LIST_HEAD(&p->se.group_node);
#ifdef CONFIG_FAIR_GROUP_SCHED
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e4a0b8bd941c..a4b314b664f8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6200,6 +6200,16 @@ static int wake_wide(struct task_struct *p)
return 1;
}
+/*
+ * If a task switches in and then voluntarily relinquishes the
+ * CPU quickly, it is regarded as a short duration task.
+ */
+static inline int is_short_task(struct task_struct *p)
+{
+ return sched_feat(SIS_SHORT) &&
+ (p->se.dur_avg <= sysctl_sched_min_granularity);
+}
+
/*
* The purpose of wake_affine() is to quickly determine on which CPU we can run
* soonest. For the purpose of speed we only consider the waking and previous
@@ -7680,6 +7690,13 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
struct sched_entity *se = &prev->se;
struct cfs_rq *cfs_rq;
+ if (sched_feat(SIS_SHORT) && !prev->on_rq) {
+ u64 this_dur = se->sum_exec_runtime - se->prev_sum_exec_runtime_vol;
+
+ se->prev_sum_exec_runtime_vol = se->sum_exec_runtime;
+ update_avg(&se->dur_avg, this_dur);
+ }
+
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
put_prev_entity(cfs_rq, se);
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index ee7f23c76bd3..efdc29c42161 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -62,6 +62,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
*/
SCHED_FEAT(SIS_PROP, false)
SCHED_FEAT(SIS_UTIL, true)
+SCHED_FEAT(SIS_SHORT, true)
/*
* Issue a WARN when we do multiple update_rq_clock() calls
--
2.25.1
On 2022/12/1 16:44, Chen Yu wrote: > Introduce short-duration task checks, as there is requirement > to leverage this attribute for better task placement. > > There are several choices of metrics that could be used to > indicate if a task is a short-duration task. > > At first thought the (p->se.sum_exec_runtime / p->nvcsw) > could be used to measure the task duration. However, the > history long past was factored too heavily in such a formula. > Ideally, the old activity should decay and not affect > the current status too much. > > Although something based on PELT could be used, se.util_avg might > not be appropriate to describe the task duration: > 1. Task p1 and task p2 are doing frequent ping-pong scheduling on > one CPU, both p1 and p2 have a short duration, but the util_avg > can be up to 50%. > 2. Suppose a task lasting less than 4ms is regarded as a short task. > If task p3 runs for 6ms and sleeps for 32ms, p3 should not be a > short-duration task. However, PELT would decay p3's accumulated > running time from 6ms to 3ms, because 32ms is the half-life in PELT. > As a result, p3 would be incorrectly treated as a short task. > > It was found that there was once a similar feature to track the > duration of a task, which is in Commit ad4b78bbcbab ("sched: Add > new wakeup preemption mode: WAKEUP_RUNNING"). Unfortunately, it > was reverted because it was an experiment. So pick the patch up > again, by recording the average duration when a task voluntarily > switches out. Introduce SIS_SHORT to control this strategy. > > The threshold of short duration reuses sysctl_sched_min_granularity, > so it can be tuned by the user. Ideally there should be a dedicated > parameter for the threshold, but that might introduce complexity. > > Suggested-by: Tim Chen <tim.c.chen@intel.com> > Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> > Signed-off-by: Chen Yu <yu.c.chen@intel.com> > --- > include/linux/sched.h | 4 ++++ > kernel/sched/core.c | 2 ++ > kernel/sched/fair.c | 17 +++++++++++++++++ > kernel/sched/features.h | 1 + > 4 files changed, 24 insertions(+) > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index ffb6eb55cd13..64b7acb77a11 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -558,6 +558,10 @@ struct sched_entity { > > u64 nr_migrations; > > + u64 prev_sum_exec_runtime_vol; > + /* average duration of a task */ > + u64 dur_avg; > + > #ifdef CONFIG_FAIR_GROUP_SCHED > int depth; > struct sched_entity *parent; > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index daff72f00385..c5202f1be3f7 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -4348,6 +4348,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) > p->se.prev_sum_exec_runtime = 0; > p->se.nr_migrations = 0; > p->se.vruntime = 0; > + p->se.dur_avg = 0; > + p->se.prev_sum_exec_runtime_vol = 0; > INIT_LIST_HEAD(&p->se.group_node); > > #ifdef CONFIG_FAIR_GROUP_SCHED > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index e4a0b8bd941c..a4b314b664f8 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -6200,6 +6200,16 @@ static int wake_wide(struct task_struct *p) > return 1; > } > > +/* > + * If a task switches in and then voluntarily relinquishes the > + * CPU quickly, it is regarded as a short duration task. > + */ > +static inline int is_short_task(struct task_struct *p) > +{ > + return sched_feat(SIS_SHORT) && > + (p->se.dur_avg <= sysctl_sched_min_granularity); > +} > + Hi Yu, I still have a bit concern about the sysctl_sched_min_granularity stuff.. This grab can be set to different value which will impact the action of this patch and make things not totally under control. Not sure if we can add a new grab for this.. The test result shows good improvement for short task, and with this grab, admins will be able to custom the system base on their own 'short task' view. Thanks, Honglei > /* > * The purpose of wake_affine() is to quickly determine on which CPU we can run > * soonest. For the purpose of speed we only consider the waking and previous > @@ -7680,6 +7690,13 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev) > struct sched_entity *se = &prev->se; > struct cfs_rq *cfs_rq; > > + if (sched_feat(SIS_SHORT) && !prev->on_rq) { > + u64 this_dur = se->sum_exec_runtime - se->prev_sum_exec_runtime_vol; > + > + se->prev_sum_exec_runtime_vol = se->sum_exec_runtime; > + update_avg(&se->dur_avg, this_dur); > + } > + > for_each_sched_entity(se) { > cfs_rq = cfs_rq_of(se); > put_prev_entity(cfs_rq, se); > diff --git a/kernel/sched/features.h b/kernel/sched/features.h > index ee7f23c76bd3..efdc29c42161 100644 > --- a/kernel/sched/features.h > +++ b/kernel/sched/features.h > @@ -62,6 +62,7 @@ SCHED_FEAT(TTWU_QUEUE, true) > */ > SCHED_FEAT(SIS_PROP, false) > SCHED_FEAT(SIS_UTIL, true) > +SCHED_FEAT(SIS_SHORT, true) > > /* > * Issue a WARN when we do multiple update_rq_clock() calls
Hi Honglei, On 2022-12-02 at 15:44:18 +0800, Honglei Wang wrote: > > > On 2022/12/1 16:44, Chen Yu wrote: > > Introduce short-duration task checks, as there is requirement > > to leverage this attribute for better task placement. > > > > There are several choices of metrics that could be used to > > indicate if a task is a short-duration task. > > > > At first thought the (p->se.sum_exec_runtime / p->nvcsw) > > could be used to measure the task duration. However, the > > history long past was factored too heavily in such a formula. > > Ideally, the old activity should decay and not affect > > the current status too much. > > > > Although something based on PELT could be used, se.util_avg might > > not be appropriate to describe the task duration: > > 1. Task p1 and task p2 are doing frequent ping-pong scheduling on > > one CPU, both p1 and p2 have a short duration, but the util_avg > > can be up to 50%. > > 2. Suppose a task lasting less than 4ms is regarded as a short task. > > If task p3 runs for 6ms and sleeps for 32ms, p3 should not be a > > short-duration task. However, PELT would decay p3's accumulated > > running time from 6ms to 3ms, because 32ms is the half-life in PELT. > > As a result, p3 would be incorrectly treated as a short task. > > > > It was found that there was once a similar feature to track the > > duration of a task, which is in Commit ad4b78bbcbab ("sched: Add > > new wakeup preemption mode: WAKEUP_RUNNING"). Unfortunately, it > > was reverted because it was an experiment. So pick the patch up > > again, by recording the average duration when a task voluntarily > > switches out. Introduce SIS_SHORT to control this strategy. > > > > The threshold of short duration reuses sysctl_sched_min_granularity, > > so it can be tuned by the user. Ideally there should be a dedicated > > parameter for the threshold, but that might introduce complexity. > > > > Suggested-by: Tim Chen <tim.c.chen@intel.com> > > Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> > > Signed-off-by: Chen Yu <yu.c.chen@intel.com> > > --- > > include/linux/sched.h | 4 ++++ > > kernel/sched/core.c | 2 ++ > > kernel/sched/fair.c | 17 +++++++++++++++++ > > kernel/sched/features.h | 1 + > > 4 files changed, 24 insertions(+) > > > > diff --git a/include/linux/sched.h b/include/linux/sched.h > > index ffb6eb55cd13..64b7acb77a11 100644 > > --- a/include/linux/sched.h > > +++ b/include/linux/sched.h > > @@ -558,6 +558,10 @@ struct sched_entity { > > u64 nr_migrations; > > + u64 prev_sum_exec_runtime_vol; > > + /* average duration of a task */ > > + u64 dur_avg; > > + > > #ifdef CONFIG_FAIR_GROUP_SCHED > > int depth; > > struct sched_entity *parent; > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > index daff72f00385..c5202f1be3f7 100644 > > --- a/kernel/sched/core.c > > +++ b/kernel/sched/core.c > > @@ -4348,6 +4348,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) > > p->se.prev_sum_exec_runtime = 0; > > p->se.nr_migrations = 0; > > p->se.vruntime = 0; > > + p->se.dur_avg = 0; > > + p->se.prev_sum_exec_runtime_vol = 0; > > INIT_LIST_HEAD(&p->se.group_node); > > #ifdef CONFIG_FAIR_GROUP_SCHED > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index e4a0b8bd941c..a4b314b664f8 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -6200,6 +6200,16 @@ static int wake_wide(struct task_struct *p) > > return 1; > > } > > +/* > > + * If a task switches in and then voluntarily relinquishes the > > + * CPU quickly, it is regarded as a short duration task. > > + */ > > +static inline int is_short_task(struct task_struct *p) > > +{ > > + return sched_feat(SIS_SHORT) && > > + (p->se.dur_avg <= sysctl_sched_min_granularity); > > +} > > + > > Hi Yu, > > I still have a bit concern about the sysctl_sched_min_granularity stuff.. > This grab can be set to different value which will impact the action of this > patch and make things not totally under control. > > Not sure if we can add a new grab for this.. The test result shows good > improvement for short task, and with this grab, admins will be able to > custom the system base on their own 'short task' view. > It would be ideal to have a dedicated parameter to tweak this. For example, something under /sys/kernel/debug/sched/, and initilized to sysctl_sched_min_granularity by default. Hi Peter, Vincent, may I have your opinion if this is applicable? thanks, Chenyu
> On Dec 3, 2022, at 2:50 AM, Chen Yu <yu.c.chen@intel.com> wrote: > > Hi Honglei, >> On 2022-12-02 at 15:44:18 +0800, Honglei Wang wrote: >> >> >>> On 2022/12/1 16:44, Chen Yu wrote: >>> Introduce short-duration task checks, as there is requirement >>> to leverage this attribute for better task placement. >>> >>> There are several choices of metrics that could be used to >>> indicate if a task is a short-duration task. >>> >>> At first thought the (p->se.sum_exec_runtime / p->nvcsw) >>> could be used to measure the task duration. However, the >>> history long past was factored too heavily in such a formula. >>> Ideally, the old activity should decay and not affect >>> the current status too much. >>> >>> Although something based on PELT could be used, se.util_avg might >>> not be appropriate to describe the task duration: >>> 1. Task p1 and task p2 are doing frequent ping-pong scheduling on >>> one CPU, both p1 and p2 have a short duration, but the util_avg >>> can be up to 50%. >>> 2. Suppose a task lasting less than 4ms is regarded as a short task. >>> If task p3 runs for 6ms and sleeps for 32ms, p3 should not be a >>> short-duration task. However, PELT would decay p3's accumulated >>> running time from 6ms to 3ms, because 32ms is the half-life in PELT. >>> As a result, p3 would be incorrectly treated as a short task. >>> >>> It was found that there was once a similar feature to track the >>> duration of a task, which is in Commit ad4b78bbcbab ("sched: Add >>> new wakeup preemption mode: WAKEUP_RUNNING"). Unfortunately, it >>> was reverted because it was an experiment. So pick the patch up >>> again, by recording the average duration when a task voluntarily >>> switches out. Introduce SIS_SHORT to control this strategy. >>> >>> The threshold of short duration reuses sysctl_sched_min_granularity, >>> so it can be tuned by the user. Ideally there should be a dedicated >>> parameter for the threshold, but that might introduce complexity. >>> >>> Suggested-by: Tim Chen <tim.c.chen@intel.com> >>> Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> >>> Signed-off-by: Chen Yu <yu.c.chen@intel.com> >>> --- >>> include/linux/sched.h | 4 ++++ >>> kernel/sched/core.c | 2 ++ >>> kernel/sched/fair.c | 17 +++++++++++++++++ >>> kernel/sched/features.h | 1 + >>> 4 files changed, 24 insertions(+) >>> >>> diff --git a/include/linux/sched.h b/include/linux/sched.h >>> index ffb6eb55cd13..64b7acb77a11 100644 >>> --- a/include/linux/sched.h >>> +++ b/include/linux/sched.h >>> @@ -558,6 +558,10 @@ struct sched_entity { >>> u64 nr_migrations; >>> + u64 prev_sum_exec_runtime_vol; >>> + /* average duration of a task */ >>> + u64 dur_avg; >>> + >>> #ifdef CONFIG_FAIR_GROUP_SCHED >>> int depth; >>> struct sched_entity *parent; >>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c >>> index daff72f00385..c5202f1be3f7 100644 >>> --- a/kernel/sched/core.c >>> +++ b/kernel/sched/core.c >>> @@ -4348,6 +4348,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) >>> p->se.prev_sum_exec_runtime = 0; >>> p->se.nr_migrations = 0; >>> p->se.vruntime = 0; >>> + p->se.dur_avg = 0; >>> + p->se.prev_sum_exec_runtime_vol = 0; >>> INIT_LIST_HEAD(&p->se.group_node); >>> #ifdef CONFIG_FAIR_GROUP_SCHED >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>> index e4a0b8bd941c..a4b314b664f8 100644 >>> --- a/kernel/sched/fair.c >>> +++ b/kernel/sched/fair.c >>> @@ -6200,6 +6200,16 @@ static int wake_wide(struct task_struct *p) >>> return 1; >>> } >>> +/* >>> + * If a task switches in and then voluntarily relinquishes the >>> + * CPU quickly, it is regarded as a short duration task. >>> + */ >>> +static inline int is_short_task(struct task_struct *p) >>> +{ >>> + return sched_feat(SIS_SHORT) && >>> + (p->se.dur_avg <= sysctl_sched_min_granularity); >>> +} >>> + >> >> Hi Yu, >> >> I still have a bit concern about the sysctl_sched_min_granularity stuff.. >> This grab can be set to different value which will impact the action of this >> patch and make things not totally under control. There are already ways to misconfigure sched sysctl to make bad/weird things happen. >> Not sure if we can add a new grab for this.. The test result shows good >> improvement for short task, and with this grab, admins will be able to >> custom the system base on their own 'short task' view. >> > It would be ideal to have a dedicated parameter to tweak this. For example, > something under /sys/kernel/debug/sched/, and initilized to sysctl_sched_min_granularity > by default. It would be nice to not have to introduce a new knob for this. IMO, min_granularity is reasonable. Thanks. > > thanks, > Chenyu
Hi Joel, On 2022-12-03 at 10:35:46 -0500, Joel Fernandes wrote: > > > > On Dec 3, 2022, at 2:50 AM, Chen Yu <yu.c.chen@intel.com> wrote: > > > > Hi Honglei, > >> On 2022-12-02 at 15:44:18 +0800, Honglei Wang wrote: > >> > >> > >>> On 2022/12/1 16:44, Chen Yu wrote: > >>> Introduce short-duration task checks, as there is requirement > >>> to leverage this attribute for better task placement. > >>> > >>> There are several choices of metrics that could be used to > >>> indicate if a task is a short-duration task. > >>> > >>> At first thought the (p->se.sum_exec_runtime / p->nvcsw) > >>> could be used to measure the task duration. However, the > >>> history long past was factored too heavily in such a formula. > >>> Ideally, the old activity should decay and not affect > >>> the current status too much. > >>> > >>> Although something based on PELT could be used, se.util_avg might > >>> not be appropriate to describe the task duration: > >>> 1. Task p1 and task p2 are doing frequent ping-pong scheduling on > >>> one CPU, both p1 and p2 have a short duration, but the util_avg > >>> can be up to 50%. > >>> 2. Suppose a task lasting less than 4ms is regarded as a short task. > >>> If task p3 runs for 6ms and sleeps for 32ms, p3 should not be a > >>> short-duration task. However, PELT would decay p3's accumulated > >>> running time from 6ms to 3ms, because 32ms is the half-life in PELT. > >>> As a result, p3 would be incorrectly treated as a short task. > >>> > >>> It was found that there was once a similar feature to track the > >>> duration of a task, which is in Commit ad4b78bbcbab ("sched: Add > >>> new wakeup preemption mode: WAKEUP_RUNNING"). Unfortunately, it > >>> was reverted because it was an experiment. So pick the patch up > >>> again, by recording the average duration when a task voluntarily > >>> switches out. Introduce SIS_SHORT to control this strategy. > >>> > >>> The threshold of short duration reuses sysctl_sched_min_granularity, > >>> so it can be tuned by the user. Ideally there should be a dedicated > >>> parameter for the threshold, but that might introduce complexity. > >>> > >>> Suggested-by: Tim Chen <tim.c.chen@intel.com> > >>> Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> > >>> Signed-off-by: Chen Yu <yu.c.chen@intel.com> > >>> --- > >>> include/linux/sched.h | 4 ++++ > >>> kernel/sched/core.c | 2 ++ > >>> kernel/sched/fair.c | 17 +++++++++++++++++ > >>> kernel/sched/features.h | 1 + > >>> 4 files changed, 24 insertions(+) > >>> > >>> diff --git a/include/linux/sched.h b/include/linux/sched.h > >>> index ffb6eb55cd13..64b7acb77a11 100644 > >>> --- a/include/linux/sched.h > >>> +++ b/include/linux/sched.h > >>> @@ -558,6 +558,10 @@ struct sched_entity { > >>> u64 nr_migrations; > >>> + u64 prev_sum_exec_runtime_vol; > >>> + /* average duration of a task */ > >>> + u64 dur_avg; > >>> + > >>> #ifdef CONFIG_FAIR_GROUP_SCHED > >>> int depth; > >>> struct sched_entity *parent; > >>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c > >>> index daff72f00385..c5202f1be3f7 100644 > >>> --- a/kernel/sched/core.c > >>> +++ b/kernel/sched/core.c > >>> @@ -4348,6 +4348,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) > >>> p->se.prev_sum_exec_runtime = 0; > >>> p->se.nr_migrations = 0; > >>> p->se.vruntime = 0; > >>> + p->se.dur_avg = 0; > >>> + p->se.prev_sum_exec_runtime_vol = 0; > >>> INIT_LIST_HEAD(&p->se.group_node); > >>> #ifdef CONFIG_FAIR_GROUP_SCHED > >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > >>> index e4a0b8bd941c..a4b314b664f8 100644 > >>> --- a/kernel/sched/fair.c > >>> +++ b/kernel/sched/fair.c > >>> @@ -6200,6 +6200,16 @@ static int wake_wide(struct task_struct *p) > >>> return 1; > >>> } > >>> +/* > >>> + * If a task switches in and then voluntarily relinquishes the > >>> + * CPU quickly, it is regarded as a short duration task. > >>> + */ > >>> +static inline int is_short_task(struct task_struct *p) > >>> +{ > >>> + return sched_feat(SIS_SHORT) && > >>> + (p->se.dur_avg <= sysctl_sched_min_granularity); > >>> +} > >>> + > >> > >> Hi Yu, > >> > >> I still have a bit concern about the sysctl_sched_min_granularity stuff.. > >> This grab can be set to different value which will impact the action of this > >> patch and make things not totally under control. > > There are already ways to misconfigure sched sysctl to make bad/weird things happen. > > >> Not sure if we can add a new grab for this.. The test result shows good > >> improvement for short task, and with this grab, admins will be able to > >> custom the system base on their own 'short task' view. > >> > > It would be ideal to have a dedicated parameter to tweak this. For example, > > something under /sys/kernel/debug/sched/, and initilized to sysctl_sched_min_granularity > > by default. > > It would be nice to not have to introduce a new knob for this. IMO, min_granularity is reasonable. > OK, got it, thanks for the suggestion. thanks, Chenyu
On Mon, 5 Dec 2022 at 09:39, Chen Yu <yu.c.chen@intel.com> wrote: > > Hi Joel, > On 2022-12-03 at 10:35:46 -0500, Joel Fernandes wrote: > > > > > > > On Dec 3, 2022, at 2:50 AM, Chen Yu <yu.c.chen@intel.com> wrote: > > > > > > Hi Honglei, > > >> On 2022-12-02 at 15:44:18 +0800, Honglei Wang wrote: > > >> > > >> > > >>> On 2022/12/1 16:44, Chen Yu wrote: > > >>> Introduce short-duration task checks, as there is requirement > > >>> to leverage this attribute for better task placement. > > >>> > > >>> There are several choices of metrics that could be used to > > >>> indicate if a task is a short-duration task. > > >>> > > >>> At first thought the (p->se.sum_exec_runtime / p->nvcsw) > > >>> could be used to measure the task duration. However, the > > >>> history long past was factored too heavily in such a formula. > > >>> Ideally, the old activity should decay and not affect > > >>> the current status too much. > > >>> > > >>> Although something based on PELT could be used, se.util_avg might > > >>> not be appropriate to describe the task duration: > > >>> 1. Task p1 and task p2 are doing frequent ping-pong scheduling on > > >>> one CPU, both p1 and p2 have a short duration, but the util_avg > > >>> can be up to 50%. > > >>> 2. Suppose a task lasting less than 4ms is regarded as a short task. > > >>> If task p3 runs for 6ms and sleeps for 32ms, p3 should not be a > > >>> short-duration task. However, PELT would decay p3's accumulated > > >>> running time from 6ms to 3ms, because 32ms is the half-life in PELT. > > >>> As a result, p3 would be incorrectly treated as a short task. > > >>> > > >>> It was found that there was once a similar feature to track the > > >>> duration of a task, which is in Commit ad4b78bbcbab ("sched: Add > > >>> new wakeup preemption mode: WAKEUP_RUNNING"). Unfortunately, it > > >>> was reverted because it was an experiment. So pick the patch up > > >>> again, by recording the average duration when a task voluntarily > > >>> switches out. Introduce SIS_SHORT to control this strategy. > > >>> > > >>> The threshold of short duration reuses sysctl_sched_min_granularity, > > >>> so it can be tuned by the user. Ideally there should be a dedicated > > >>> parameter for the threshold, but that might introduce complexity. > > >>> > > >>> Suggested-by: Tim Chen <tim.c.chen@intel.com> > > >>> Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> > > >>> Signed-off-by: Chen Yu <yu.c.chen@intel.com> > > >>> --- > > >>> include/linux/sched.h | 4 ++++ > > >>> kernel/sched/core.c | 2 ++ > > >>> kernel/sched/fair.c | 17 +++++++++++++++++ > > >>> kernel/sched/features.h | 1 + > > >>> 4 files changed, 24 insertions(+) > > >>> > > >>> diff --git a/include/linux/sched.h b/include/linux/sched.h > > >>> index ffb6eb55cd13..64b7acb77a11 100644 > > >>> --- a/include/linux/sched.h > > >>> +++ b/include/linux/sched.h > > >>> @@ -558,6 +558,10 @@ struct sched_entity { > > >>> u64 nr_migrations; > > >>> + u64 prev_sum_exec_runtime_vol; > > >>> + /* average duration of a task */ > > >>> + u64 dur_avg; > > >>> + > > >>> #ifdef CONFIG_FAIR_GROUP_SCHED > > >>> int depth; > > >>> struct sched_entity *parent; > > >>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > >>> index daff72f00385..c5202f1be3f7 100644 > > >>> --- a/kernel/sched/core.c > > >>> +++ b/kernel/sched/core.c > > >>> @@ -4348,6 +4348,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) > > >>> p->se.prev_sum_exec_runtime = 0; > > >>> p->se.nr_migrations = 0; > > >>> p->se.vruntime = 0; > > >>> + p->se.dur_avg = 0; > > >>> + p->se.prev_sum_exec_runtime_vol = 0; > > >>> INIT_LIST_HEAD(&p->se.group_node); > > >>> #ifdef CONFIG_FAIR_GROUP_SCHED > > >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > >>> index e4a0b8bd941c..a4b314b664f8 100644 > > >>> --- a/kernel/sched/fair.c > > >>> +++ b/kernel/sched/fair.c > > >>> @@ -6200,6 +6200,16 @@ static int wake_wide(struct task_struct *p) > > >>> return 1; > > >>> } > > >>> +/* > > >>> + * If a task switches in and then voluntarily relinquishes the > > >>> + * CPU quickly, it is regarded as a short duration task. > > >>> + */ > > >>> +static inline int is_short_task(struct task_struct *p) > > >>> +{ > > >>> + return sched_feat(SIS_SHORT) && > > >>> + (p->se.dur_avg <= sysctl_sched_min_granularity); > > >>> +} > > >>> + > > >> > > >> Hi Yu, > > >> > > >> I still have a bit concern about the sysctl_sched_min_granularity stuff.. > > >> This grab can be set to different value which will impact the action of this > > >> patch and make things not totally under control. > > > > There are already ways to misconfigure sched sysctl to make bad/weird things happen. > > > > >> Not sure if we can add a new grab for this.. The test result shows good > > >> improvement for short task, and with this grab, admins will be able to > > >> custom the system base on their own 'short task' view. > > >> > > > It would be ideal to have a dedicated parameter to tweak this. For example, > > > something under /sys/kernel/debug/sched/, and initilized to sysctl_sched_min_granularity > > > by default. > > > > It would be nice to not have to introduce a new knob for this. IMO, min_granularity is reasonable. > > > OK, got it, thanks for the suggestion. Sorry for the late answer. We don't want to add more dedicated knobs. So using sysctl_sched_min_granularity as you are doing in this patch looks ok > > thanks, > Chenyu
> We don't want to add more dedicated knobs. So using > sysctl_sched_min_granularity as you are doing in this patch looks ok If not a knob, then maybe at least a smaller hardcoded value? Several milliseconds seems like too long for this optimization; the opportunity cost of not doing the idle search likely doesn't make sense if the waker is going to be active for several additional milliseconds. I would have guessed something on the order of 100us. Chen, what is the run duration of tasks in your ping pong example?
Hi Josh, On 2022-12-06 at 18:23:47 -0800, Josh Don wrote: > > We don't want to add more dedicated knobs. So using > > sysctl_sched_min_granularity as you are doing in this patch looks ok > > If not a knob, then maybe at least a smaller hardcoded value? Several > milliseconds seems like too long for this optimization; the > opportunity cost of not doing the idle search likely doesn't make > sense if the waker is going to be active for several additional > milliseconds. I would have guessed something on the order of 100us. > > Chen, what is the run duration of tasks in your ping pong example? OK, I'll measure the duration of all the tests and summarize the data later. thanks, Chenyu
On 2022/12/7 22:24, Chen Yu wrote: > Hi Josh, > On 2022-12-06 at 18:23:47 -0800, Josh Don wrote: >>> We don't want to add more dedicated knobs. So using >>> sysctl_sched_min_granularity as you are doing in this patch looks ok >> >> If not a knob, then maybe at least a smaller hardcoded value? Several >> milliseconds seems like too long for this optimization; the >> opportunity cost of not doing the idle search likely doesn't make >> sense if the waker is going to be active for several additional >> milliseconds. I would have guessed something on the order of 100us. >> >> Chen, what is the run duration of tasks in your ping pong example? > OK, I'll measure the duration of all the tests and summarize the data later. > If we decide to use sysctl_sched_min_granularity as the threshold of a short task, can we also export p->se.dur_avg (somewhere like /proc/[pid]/sched)? It may be helpful to choose a proper value for sysctl_sched_min_granularity. Thanks.
On 2022-12-12 at 19:22:48 +0800, Yicong Yang wrote: > On 2022/12/7 22:24, Chen Yu wrote: > > Hi Josh, > > On 2022-12-06 at 18:23:47 -0800, Josh Don wrote: > >>> We don't want to add more dedicated knobs. So using > >>> sysctl_sched_min_granularity as you are doing in this patch looks ok > >> > >> If not a knob, then maybe at least a smaller hardcoded value? Several > >> milliseconds seems like too long for this optimization; the > >> opportunity cost of not doing the idle search likely doesn't make > >> sense if the waker is going to be active for several additional > >> milliseconds. I would have guessed something on the order of 100us. > >> > >> Chen, what is the run duration of tasks in your ping pong example? > > OK, I'll measure the duration of all the tests and summarize the data later. > > > > If we decide to use sysctl_sched_min_granularity as the threshold of a short > task, can we also export p->se.dur_avg (somewhere like /proc/[pid]/sched)? > It may be helpful to choose a proper value for sysctl_sched_min_granularity. This looks reasonable, I can add one in next version. BTW, I've changed the threshold to (sysctl_sched_min_granularity / 8) in my next version, as this is the value that fit my previous test case and also not to break the case Josh mentioned. Also I've changed the place where dur_avg is updated from put_prev_task() to dequeue_task(), which could fix an issue in v3. Will send v4 out after the test finished. thanks, Chenyu > > Thanks.
> BTW, I've changed the threshold to (sysctl_sched_min_granularity / 8) in my next > version, as this is the value that fit my previous test case and also not to break > the case Josh mentioned. Do you mean a hardcoded value of some number of micros, or literally sched_min_granularity / 8? I don't think the latter is necessary, and indeed can lead to weirdness if min_gran is too small or too large. I don't think the concept of what a short duration task is should expand/contract with min_gran. Best, Josh
On 2022-12-12 at 10:17:35 -0800, Josh Don wrote: > > BTW, I've changed the threshold to (sysctl_sched_min_granularity / 8) in my next > > version, as this is the value that fit my previous test case and also not to break > > the case Josh mentioned. > > Do you mean a hardcoded value of some number of micros, or literally > sched_min_granularity / 8? The latter. According to the test, the average task duration when system is under heavy load: 6 ~ 9 us for netperf 7 ~ 70 us for hackbench 7 ~ 8 us for tbench 13 ~ 20 ms for schbench Overall the duration of the micros are quite small(except for schbench). The default sysctl_sched_min_granularity is 750 us in kernel if no user has changed it. Then 750 / 8 = 93 us, which is close to what you suggested(100us). On the other hand, if someone changes sysctl_sched_min_granularity, then '8' can be viewed as log2(256). That is, if there are 256 CPUs online, and the sysctl_sched_min_granularity is changed to 750 us * log2(256) by the user, we can devide the sysctl_sched_min_granularity by 8 in case the sysctl_sched_min_granularity is too large. My concern of using hardcoded value is that, this value depends on how fast the CPU runs(cpu frequency). The value I measured above is when the CPU is running at 1.9Ghz. If a CPU runs faster, a hard code value might not be appropriate and could not be tuned. > I don't think the latter is necessary, and > indeed can lead to weirdness if min_gran is too small or too large. I > don't think the concept of what a short duration task is should > expand/contract with min_gran. The value of sysctl_sched_min_granularity might indicate how long the user would like a task to run at least. If the user enlarge this value, does it mean the user wants every task in the system to run longer? From this point I found connection between the the definition of short task duration and this value. I'm open to changing this value to a fixed one, may I have more insights on how this value would be set in production environment? thanks, Chenyu > > Best, > Josh
On 2022/12/13 13:46, Chen Yu wrote: > On 2022-12-12 at 10:17:35 -0800, Josh Don wrote: >>> BTW, I've changed the threshold to (sysctl_sched_min_granularity / 8) in my next >>> version, as this is the value that fit my previous test case and also not to break >>> the case Josh mentioned. >> >> Do you mean a hardcoded value of some number of micros, or literally >> sched_min_granularity / 8? > The latter. According to the test, the average task duration when system > is under heavy load: > 6 ~ 9 us for netperf > 7 ~ 70 us for hackbench > 7 ~ 8 us for tbench > 13 ~ 20 ms for schbench > Overall the duration of the micros are quite small(except for schbench). > The default sysctl_sched_min_granularity is 750 us in kernel if no user > has changed it. Then 750 / 8 = 93 us, which is close to what you suggested(100us). > On the other hand, if someone changes sysctl_sched_min_granularity, > then '8' can be viewed as log2(256). That is, if there are 256 CPUs online, > and the sysctl_sched_min_granularity is changed to 750 us * log2(256) by > the user, we can devide the sysctl_sched_min_granularity by 8 in case the > sysctl_sched_min_granularity is too large. > Hi Yu, Seems there is a min_t() call in get_update_sysctl_factor(). In most cases, we'll get 750 us * (1+log2(8)) = 3000 us in default due to sysctl_sched_tunable_scaling is set as '1' default. (Correct me if I misunderstand). For the value in production environment, I've seen 10 ms and 3 ms in different place, FYI. Hope this help. Thanks, Honglei > My concern of using hardcoded value is that, this value depends on how fast > the CPU runs(cpu frequency). The value I measured above is when the > CPU is running at 1.9Ghz. If a CPU runs faster, a hard code value might not > be appropriate and could not be tuned. >> I don't think the latter is necessary, and >> indeed can lead to weirdness if min_gran is too small or too large. I >> don't think the concept of what a short duration task is should >> expand/contract with min_gran. > The value of sysctl_sched_min_granularity might indicate how long the > user would like a task to run at least. If the user enlarge this value, > does it mean the user wants every task in the system to run longer? > From this point I found connection between the the definition of short task > duration and this value. I'm open to changing this value to a fixed one, may > I have more insights on how this value would be set in production environment? > > thanks, > Chenyu >> >> Best, >> Josh
On 2022-12-13 at 18:06:50 +0800, Honglei Wang wrote: > > > On 2022/12/13 13:46, Chen Yu wrote: > > On 2022-12-12 at 10:17:35 -0800, Josh Don wrote: > > > > BTW, I've changed the threshold to (sysctl_sched_min_granularity / 8) in my next > > > > version, as this is the value that fit my previous test case and also not to break > > > > the case Josh mentioned. > > > > > > Do you mean a hardcoded value of some number of micros, or literally > > > sched_min_granularity / 8? > > The latter. According to the test, the average task duration when system > > is under heavy load: > > 6 ~ 9 us for netperf > > 7 ~ 70 us for hackbench > > 7 ~ 8 us for tbench > > 13 ~ 20 ms for schbench > > Overall the duration of the micros are quite small(except for schbench). > > The default sysctl_sched_min_granularity is 750 us in kernel if no user > > has changed it. Then 750 / 8 = 93 us, which is close to what you suggested(100us). > > On the other hand, if someone changes sysctl_sched_min_granularity, > > then '8' can be viewed as log2(256). That is, if there are 256 CPUs online, > > and the sysctl_sched_min_granularity is changed to 750 us * log2(256) by > > the user, we can devide the sysctl_sched_min_granularity by 8 in case the > > sysctl_sched_min_granularity is too large. > > > > Hi Yu, > > Seems there is a min_t() call in get_update_sysctl_factor(). In most cases, > we'll get 750 us * (1+log2(8)) = 3000 us in default due to > sysctl_sched_tunable_scaling is set as '1' default. (Correct me if I > misunderstand). > Thanks for pointing this out! I overlooked this part previously. So (sysctl_sched_min_granularity / 8) is 375 us by default. > For the value in production environment, I've seen 10 ms and 3 ms in > different place, FYI. Hope this help. I see. 10 ms would generate a short duration threshold of 1.25 ms. From my understanding if the user increases the sysctl_sched_min_granularity then it is expected to have longer average duration tasks in the system. And the short duration threshold should also be raised. thanks, Chenyu
On 2022-12-05 at 10:59:06 +0100, Vincent Guittot wrote: > On Mon, 5 Dec 2022 at 09:39, Chen Yu <yu.c.chen@intel.com> wrote: > > > > Hi Joel, > > On 2022-12-03 at 10:35:46 -0500, Joel Fernandes wrote: > > > > > > > > > > On Dec 3, 2022, at 2:50 AM, Chen Yu <yu.c.chen@intel.com> wrote: > > > > > > > > Hi Honglei, > > > >> On 2022-12-02 at 15:44:18 +0800, Honglei Wang wrote: > > > >> > > > >> > > > >>> On 2022/12/1 16:44, Chen Yu wrote: > > > >>> Introduce short-duration task checks, as there is requirement > > > >>> to leverage this attribute for better task placement. > > > >>> > > > >>> There are several choices of metrics that could be used to > > > >>> indicate if a task is a short-duration task. > > > >>> > > > >>> At first thought the (p->se.sum_exec_runtime / p->nvcsw) > > > >>> could be used to measure the task duration. However, the > > > >>> history long past was factored too heavily in such a formula. > > > >>> Ideally, the old activity should decay and not affect > > > >>> the current status too much. > > > >>> > > > >>> Although something based on PELT could be used, se.util_avg might > > > >>> not be appropriate to describe the task duration: > > > >>> 1. Task p1 and task p2 are doing frequent ping-pong scheduling on > > > >>> one CPU, both p1 and p2 have a short duration, but the util_avg > > > >>> can be up to 50%. > > > >>> 2. Suppose a task lasting less than 4ms is regarded as a short task. > > > >>> If task p3 runs for 6ms and sleeps for 32ms, p3 should not be a > > > >>> short-duration task. However, PELT would decay p3's accumulated > > > >>> running time from 6ms to 3ms, because 32ms is the half-life in PELT. > > > >>> As a result, p3 would be incorrectly treated as a short task. > > > >>> > > > >>> It was found that there was once a similar feature to track the > > > >>> duration of a task, which is in Commit ad4b78bbcbab ("sched: Add > > > >>> new wakeup preemption mode: WAKEUP_RUNNING"). Unfortunately, it > > > >>> was reverted because it was an experiment. So pick the patch up > > > >>> again, by recording the average duration when a task voluntarily > > > >>> switches out. Introduce SIS_SHORT to control this strategy. > > > >>> > > > >>> The threshold of short duration reuses sysctl_sched_min_granularity, > > > >>> so it can be tuned by the user. Ideally there should be a dedicated > > > >>> parameter for the threshold, but that might introduce complexity. > > > >>> > > > >>> Suggested-by: Tim Chen <tim.c.chen@intel.com> > > > >>> Suggested-by: Vincent Guittot <vincent.guittot@linaro.org> > > > >>> Signed-off-by: Chen Yu <yu.c.chen@intel.com> > > > >>> --- > > > >>> include/linux/sched.h | 4 ++++ > > > >>> kernel/sched/core.c | 2 ++ > > > >>> kernel/sched/fair.c | 17 +++++++++++++++++ > > > >>> kernel/sched/features.h | 1 + > > > >>> 4 files changed, 24 insertions(+) > > > >>> > > > >>> diff --git a/include/linux/sched.h b/include/linux/sched.h > > > >>> index ffb6eb55cd13..64b7acb77a11 100644 > > > >>> --- a/include/linux/sched.h > > > >>> +++ b/include/linux/sched.h > > > >>> @@ -558,6 +558,10 @@ struct sched_entity { > > > >>> u64 nr_migrations; > > > >>> + u64 prev_sum_exec_runtime_vol; > > > >>> + /* average duration of a task */ > > > >>> + u64 dur_avg; > > > >>> + > > > >>> #ifdef CONFIG_FAIR_GROUP_SCHED > > > >>> int depth; > > > >>> struct sched_entity *parent; > > > >>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > > >>> index daff72f00385..c5202f1be3f7 100644 > > > >>> --- a/kernel/sched/core.c > > > >>> +++ b/kernel/sched/core.c > > > >>> @@ -4348,6 +4348,8 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p) > > > >>> p->se.prev_sum_exec_runtime = 0; > > > >>> p->se.nr_migrations = 0; > > > >>> p->se.vruntime = 0; > > > >>> + p->se.dur_avg = 0; > > > >>> + p->se.prev_sum_exec_runtime_vol = 0; > > > >>> INIT_LIST_HEAD(&p->se.group_node); > > > >>> #ifdef CONFIG_FAIR_GROUP_SCHED > > > >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > > >>> index e4a0b8bd941c..a4b314b664f8 100644 > > > >>> --- a/kernel/sched/fair.c > > > >>> +++ b/kernel/sched/fair.c > > > >>> @@ -6200,6 +6200,16 @@ static int wake_wide(struct task_struct *p) > > > >>> return 1; > > > >>> } > > > >>> +/* > > > >>> + * If a task switches in and then voluntarily relinquishes the > > > >>> + * CPU quickly, it is regarded as a short duration task. > > > >>> + */ > > > >>> +static inline int is_short_task(struct task_struct *p) > > > >>> +{ > > > >>> + return sched_feat(SIS_SHORT) && > > > >>> + (p->se.dur_avg <= sysctl_sched_min_granularity); > > > >>> +} > > > >>> + > > > >> > > > >> Hi Yu, > > > >> > > > >> I still have a bit concern about the sysctl_sched_min_granularity stuff.. > > > >> This grab can be set to different value which will impact the action of this > > > >> patch and make things not totally under control. > > > > > > There are already ways to misconfigure sched sysctl to make bad/weird things happen. > > > > > > >> Not sure if we can add a new grab for this.. The test result shows good > > > >> improvement for short task, and with this grab, admins will be able to > > > >> custom the system base on their own 'short task' view. > > > >> > > > > It would be ideal to have a dedicated parameter to tweak this. For example, > > > > something under /sys/kernel/debug/sched/, and initilized to sysctl_sched_min_granularity > > > > by default. > > > > > > It would be nice to not have to introduce a new knob for this. IMO, min_granularity is reasonable. > > > > > OK, got it, thanks for the suggestion. > > Sorry for the late answer. > We don't want to add more dedicated knobs. So using > sysctl_sched_min_granularity as you are doing in this patch looks ok > I see, thanks Vincent. thanks, Chenyu > > > > thanks, > > Chenyu
© 2016 - 2025 Red Hat, Inc.