[v2] sched/fair: sanitize vruntime of entity being migrated

[PATCH v2] sched/fair: sanitize vruntime of entity being migrated

Posted by Vincent Guittot 2 years, 11 months ago

Commit 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
fixes an overflowing bug, but ignore a case that se->exec_start is reset
after a migration.

For fixing this case, we delay the reset of se->exec_start after
placing the entity which se->exec_start to detect long sleeping task.

In order to take into account a possible divergence between the clock_task
of 2 rqs, we increase the threshold to around 104 days.


Fixes: 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---

My last proposal was not yet correct as the exec_start was not always
reset after migrating a task. I finally found this solution which keeps
the long sleep detection to one place as well as the reset of se->exec_start.

 kernel/sched/fair.c | 57 +++++++++++++++++++++++++++++++++++----------
 1 file changed, 45 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0f499e9a74b5..421173fde351 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4648,11 +4648,33 @@ static void check_spread(struct cfs_rq *cfs_rq, struct sched_entity *se)
 #endif
 }
 
+static inline bool entity_is_long_sleeper(struct sched_entity *se)
+{
+	struct cfs_rq *cfs_rq;
+	u64 sleep_time;
+
+	if (se->exec_start == 0)
+		return false;
+
+	cfs_rq = cfs_rq_of(se);
+
+	sleep_time = rq_clock_task(rq_of(cfs_rq));
+
+	/* Happen while migrating because of clock task divergence */
+	if (sleep_time <= se->exec_start)
+		return false;
+
+	sleep_time -= se->exec_start;
+	if (sleep_time > ((1ULL << 63) / scale_load_down(NICE_0_LOAD)))
+		return true;
+
+	return false;
+}
+
 static void
 place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
 {
 	u64 vruntime = cfs_rq->min_vruntime;
-	u64 sleep_time;
 
 	/*
 	 * The 'current' period is already promised to the current tasks,
@@ -4684,13 +4706,24 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
 
 	/*
 	 * Pull vruntime of the entity being placed to the base level of
-	 * cfs_rq, to prevent boosting it if placed backwards.  If the entity
-	 * slept for a long time, don't even try to compare its vruntime with
-	 * the base as it may be too far off and the comparison may get
-	 * inversed due to s64 overflow.
-	 */
-	sleep_time = rq_clock_task(rq_of(cfs_rq)) - se->exec_start;
-	if ((s64)sleep_time > 60LL * NSEC_PER_SEC)
+	 * cfs_rq, to prevent boosting it if placed backwards.
+	 * However, min_vruntime can advance much faster than real time, with
+	 * the extreme being when an entity with the minimal weight always runs
+	 * on the cfs_rq. If the waking entity slept for a long time, its
+	 * vruntime difference from min_vruntime may overflow s64 and their
+	 * comparison may get inversed, so ignore the entity's original
+	 * vruntime in that case.
+	 * The maximal vruntime speedup is given by the ratio of normal to
+	 * minimal weight: scale_load_down(NICE_0_LOAD) / MIN_SHARES.
+	 * When placing a migrated waking entity, its exec_start has been set
+	 * from a different rq. In order to take into account a possible
+	 * divergence between new and prev rq's clocks task because of irq and
+	 * stolen time, we take an additional margin.
+	 * So, cutting off on the sleep time of
+	 *     2^63 / scale_load_down(NICE_0_LOAD) ~ 104 days
+	 * should be safe.
+	 */
+	if (entity_is_long_sleeper(se))
 		se->vruntime = vruntime;
 	else
 		se->vruntime = max_vruntime(se->vruntime, vruntime);
@@ -4770,6 +4803,9 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 
 	if (flags & ENQUEUE_WAKEUP)
 		place_entity(cfs_rq, se, 0);
+	/* Entity has migrated, no longer consider this task hot */
+	if (flags & ENQUEUE_MIGRATED)
+		se->exec_start = 0;
 
 	check_schedstat_required();
 	update_stats_enqueue_fair(cfs_rq, se, flags);
@@ -7665,9 +7701,6 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
 	/* Tell new CPU we are migrated */
 	se->avg.last_update_time = 0;
 
-	/* We have migrated, no longer consider this task hot */
-	se->exec_start = 0;
-
 	update_scan_period(p, new_cpu);
 }
 
@@ -8701,7 +8734,7 @@ static void attach_task(struct rq *rq, struct task_struct *p)
 	lockdep_assert_rq_held(rq);
 
 	WARN_ON_ONCE(task_rq(p) != rq);
-	activate_task(rq, p, ENQUEUE_NOCLOCK);
+	activate_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_MIGRATED);
 	check_preempt_curr(rq, p, 0);
 }
 
-- 
2.34.1

Re: [PATCH v2] sched/fair: sanitize vruntime of entity being migrated

Posted by Chen Yu 2 years, 10 months ago

On 2023-03-17 at 17:08:10 +0100, Vincent Guittot wrote:
> Commit 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
> fixes an overflowing bug, but ignore a case that se->exec_start is reset
> after a migration.
> 
> For fixing this case, we delay the reset of se->exec_start after
> placing the entity which se->exec_start to detect long sleeping task.
> 
> In order to take into account a possible divergence between the clock_task
> of 2 rqs, we increase the threshold to around 104 days.
> 
> 
> Fixes: 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
> Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>
This patch has been confirmed by 0day to restore the performance which was decreased
by 829c1651e9c4[1], thanks for the fix.

commit:     a2e90611b9f425ad            829c1651e9c4a6f78398d3e6765          a53ce18cacb477dd0513c607f18

hackbench
throughput: 173811           -18.4%     141887                     +1.4%     176324

[1] https://lore.kernel.org/lkml/202303091155.672f546a-yujie.liu@intel.com/

thanks,
Chenyu

Re: [PATCH v2] sched/fair: sanitize vruntime of entity being migrated

Posted by Peter Zijlstra 2 years, 10 months ago

On Fri, Mar 17, 2023 at 05:08:10PM +0100, Vincent Guittot wrote:
> Commit 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
> fixes an overflowing bug, but ignore a case that se->exec_start is reset
> after a migration.
> 
> For fixing this case, we delay the reset of se->exec_start after
> placing the entity which se->exec_start to detect long sleeping task.
> 
> In order to take into account a possible divergence between the clock_task
> of 2 rqs, we increase the threshold to around 104 days.
> 
> 
> Fixes: 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
> Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

Just noticed, that SoB chain isn't valid, should Zhang be something
like: Originally-by or Suggested-by or whatever?

Re: [PATCH v2] sched/fair: sanitize vruntime of entity being migrated

Posted by Vincent Guittot 2 years, 10 months ago

On Tue, 21 Mar 2023 at 13:28, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Mar 17, 2023 at 05:08:10PM +0100, Vincent Guittot wrote:
> > Commit 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
> > fixes an overflowing bug, but ignore a case that se->exec_start is reset
> > after a migration.
> >
> > For fixing this case, we delay the reset of se->exec_start after
> > placing the entity which se->exec_start to detect long sleeping task.
> >
> > In order to take into account a possible divergence between the clock_task
> > of 2 rqs, we increase the threshold to around 104 days.
> >
> >
> > Fixes: 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
> > Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>
> Just noticed, that SoB chain isn't valid, should Zhang be something
> like: Originally-by or Suggested-by or whatever?

Originally-by would be the right tag

Re: [PATCH v2] sched/fair: sanitize vruntime of entity being migrated

Posted by Peter Zijlstra 2 years, 10 months ago

On Fri, Mar 17, 2023 at 05:08:10PM +0100, Vincent Guittot wrote:
> Commit 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
> fixes an overflowing bug, but ignore a case that se->exec_start is reset
> after a migration.
> 
> For fixing this case, we delay the reset of se->exec_start after
> placing the entity which se->exec_start to detect long sleeping task.
> 
> In order to take into account a possible divergence between the clock_task
> of 2 rqs, we increase the threshold to around 104 days.
> 
> 
> Fixes: 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
> Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---

Blergh, this just isn't going to be nice. I'll go queue this for
sched/urgent and then we can forget about this for a little while.

Thanks!

Re: [PATCH v2] sched/fair: sanitize vruntime of entity being migrated

Posted by Dietmar Eggemann 2 years, 10 months ago

On 21/03/2023 11:02, Peter Zijlstra wrote:
> On Fri, Mar 17, 2023 at 05:08:10PM +0100, Vincent Guittot wrote:
>> Commit 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
>> fixes an overflowing bug, but ignore a case that se->exec_start is reset
>> after a migration.
>>
>> For fixing this case, we delay the reset of se->exec_start after
>> placing the entity which se->exec_start to detect long sleeping task.
>>
>> In order to take into account a possible divergence between the clock_task
>> of 2 rqs, we increase the threshold to around 104 days.
>>
>>
>> Fixes: 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
>> Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> ---
> 
> Blergh, this just isn't going to be nice. I'll go queue this for
> sched/urgent and then we can forget about this for a little while.
> 
> Thanks!

Don't we miss setting `se->exec_start = 0` for fair task in
move_queued_task()? ( ... and __migrate_swap_task())

https://lkml.kernel.org/r/df2cccda-1550-b06b-aa74-e0f054e9fb9d@arm.com

Re: [PATCH v2] sched/fair: sanitize vruntime of entity being migrated

Posted by Peter Zijlstra 2 years, 10 months ago

On Tue, Mar 21, 2023 at 11:29:13AM +0100, Dietmar Eggemann wrote:
> On 21/03/2023 11:02, Peter Zijlstra wrote:
> > On Fri, Mar 17, 2023 at 05:08:10PM +0100, Vincent Guittot wrote:
> >> Commit 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
> >> fixes an overflowing bug, but ignore a case that se->exec_start is reset
> >> after a migration.
> >>
> >> For fixing this case, we delay the reset of se->exec_start after
> >> placing the entity which se->exec_start to detect long sleeping task.
> >>
> >> In order to take into account a possible divergence between the clock_task
> >> of 2 rqs, we increase the threshold to around 104 days.
> >>
> >>
> >> Fixes: 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
> >> Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
> >> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> >> ---
> > 
> > Blergh, this just isn't going to be nice. I'll go queue this for
> > sched/urgent and then we can forget about this for a little while.
> > 
> > Thanks!
> 
> Don't we miss setting `se->exec_start = 0` for fair task in
> move_queued_task()? ( ... and __migrate_swap_task())
> 
> https://lkml.kernel.org/r/df2cccda-1550-b06b-aa74-e0f054e9fb9d@arm.com

Ah, I see what you mean now... When I read your and Vincent's replies
earlier today I though you mean to avoid the extra ENQUEUE_MIGRATED use,
but your actual goal was to capure more sites.

Hmm, we could of course go add more ENQUEUE_MIGRATED, but you're right
in that TASK_ON_RQ_MIGRATING already captures that.

An alternative is something like the below, that matches
deactivate_task(), but still uses ENQUEUE_MIGRATED to pass it down into
the class methods.

Hmm?


--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2084,6 +2084,9 @@ static inline void dequeue_task(struct r
 
 void activate_task(struct rq *rq, struct task_struct *p, int flags)
 {
+	if (task_on_rq_migrating(p))
+		flags |= ENQUEUE_MIGRATED;
+
 	enqueue_task(rq, p, flags);
 
 	p->on_rq = TASK_ON_RQ_QUEUED;
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8726,7 +8726,7 @@ static void attach_task(struct rq *rq, s
 	lockdep_assert_rq_held(rq);
 
 	WARN_ON_ONCE(task_rq(p) != rq);
-	activate_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_MIGRATED);
+	activate_task(rq, p, ENQUEUE_NOCLOCK);
 	check_preempt_curr(rq, p, 0);
 }

Re: [PATCH v2] sched/fair: sanitize vruntime of entity being migrated

Posted by Dietmar Eggemann 2 years, 10 months ago

On 21/03/2023 11:49, Peter Zijlstra wrote:
> On Tue, Mar 21, 2023 at 11:29:13AM +0100, Dietmar Eggemann wrote:
>> On 21/03/2023 11:02, Peter Zijlstra wrote:
>>> On Fri, Mar 17, 2023 at 05:08:10PM +0100, Vincent Guittot wrote:
>>>> Commit 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
>>>> fixes an overflowing bug, but ignore a case that se->exec_start is reset
>>>> after a migration.
>>>>
>>>> For fixing this case, we delay the reset of se->exec_start after
>>>> placing the entity which se->exec_start to detect long sleeping task.
>>>>
>>>> In order to take into account a possible divergence between the clock_task
>>>> of 2 rqs, we increase the threshold to around 104 days.
>>>>
>>>>
>>>> Fixes: 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
>>>> Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
>>>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>>>> ---
>>>
>>> Blergh, this just isn't going to be nice. I'll go queue this for
>>> sched/urgent and then we can forget about this for a little while.
>>>
>>> Thanks!
>>
>> Don't we miss setting `se->exec_start = 0` for fair task in
>> move_queued_task()? ( ... and __migrate_swap_task())
>>
>> https://lkml.kernel.org/r/df2cccda-1550-b06b-aa74-e0f054e9fb9d@arm.com
> 
> Ah, I see what you mean now... When I read your and Vincent's replies
> earlier today I though you mean to avoid the extra ENQUEUE_MIGRATED use,
> but your actual goal was to capure more sites.
> 
> Hmm, we could of course go add more ENQUEUE_MIGRATED, but you're right
> in that TASK_ON_RQ_MIGRATING already captures that.

And in case of move_queued_task() this would have to be conditioned on
SCHED_NORMAL.

> An alternative is something like the below, that matches
> deactivate_task(), but still uses ENQUEUE_MIGRATED to pass it down into
> the class methods.
> 
> Hmm?
> 
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2084,6 +2084,9 @@ static inline void dequeue_task(struct r
>  
>  void activate_task(struct rq *rq, struct task_struct *p, int flags)
>  {
> +	if (task_on_rq_migrating(p))
> +		flags |= ENQUEUE_MIGRATED;
> +
>  	enqueue_task(rq, p, flags);
>  
>  	p->on_rq = TASK_ON_RQ_QUEUED;
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8726,7 +8726,7 @@ static void attach_task(struct rq *rq, s
>  	lockdep_assert_rq_held(rq);
>  
>  	WARN_ON_ONCE(task_rq(p) != rq);
> -	activate_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_MIGRATED);
> +	activate_task(rq, p, ENQUEUE_NOCLOCK);
>  	check_preempt_curr(rq, p, 0);
>  }

Would work too.

IMHO, setting `se->exec_start = 0` for task_on_rq_migrating(p) already
in migrate_task_rq_fair() would have the charm that
entity_is_long_sleeper() would bail out early for these tasks.

Re: [PATCH v2] sched/fair: sanitize vruntime of entity being migrated

Posted by Peter Zijlstra 2 years, 10 months ago

On Tue, Mar 21, 2023 at 12:13:06PM +0100, Dietmar Eggemann wrote:
> On 21/03/2023 11:49, Peter Zijlstra wrote:
> > On Tue, Mar 21, 2023 at 11:29:13AM +0100, Dietmar Eggemann wrote:
> >> On 21/03/2023 11:02, Peter Zijlstra wrote:
> >>> On Fri, Mar 17, 2023 at 05:08:10PM +0100, Vincent Guittot wrote:
> >>>> Commit 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
> >>>> fixes an overflowing bug, but ignore a case that se->exec_start is reset
> >>>> after a migration.
> >>>>
> >>>> For fixing this case, we delay the reset of se->exec_start after
> >>>> placing the entity which se->exec_start to detect long sleeping task.
> >>>>
> >>>> In order to take into account a possible divergence between the clock_task
> >>>> of 2 rqs, we increase the threshold to around 104 days.
> >>>>
> >>>>
> >>>> Fixes: 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
> >>>> Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
> >>>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> >>>> ---
> >>>
> >>> Blergh, this just isn't going to be nice. I'll go queue this for
> >>> sched/urgent and then we can forget about this for a little while.
> >>>
> >>> Thanks!
> >>
> >> Don't we miss setting `se->exec_start = 0` for fair task in
> >> move_queued_task()? ( ... and __migrate_swap_task())
> >>
> >> https://lkml.kernel.org/r/df2cccda-1550-b06b-aa74-e0f054e9fb9d@arm.com
> > 
> > Ah, I see what you mean now... When I read your and Vincent's replies
> > earlier today I though you mean to avoid the extra ENQUEUE_MIGRATED use,
> > but your actual goal was to capure more sites.
> > 
> > Hmm, we could of course go add more ENQUEUE_MIGRATED, but you're right
> > in that TASK_ON_RQ_MIGRATING already captures that.
> 
> And in case of move_queued_task() this would have to be conditioned on
> SCHED_NORMAL.

I would prefer to not do that -- keep uniform rules. AFAICT the only
other user of ENQUEUE_MIGRATED is deadline and that needs
ENQUEUE_WAKEUP|ENQUEUE_MIGRATED combination to be effective and I don't
think we've added any of those.

> > An alternative is something like the below, that matches
> > deactivate_task(), but still uses ENQUEUE_MIGRATED to pass it down into
> > the class methods.
> > 
> > Hmm?
> > 
> > 
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -2084,6 +2084,9 @@ static inline void dequeue_task(struct r
> >  
> >  void activate_task(struct rq *rq, struct task_struct *p, int flags)
> >  {
> > +	if (task_on_rq_migrating(p))
> > +		flags |= ENQUEUE_MIGRATED;
> > +
> >  	enqueue_task(rq, p, flags);
> >  
> >  	p->on_rq = TASK_ON_RQ_QUEUED;
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8726,7 +8726,7 @@ static void attach_task(struct rq *rq, s
> >  	lockdep_assert_rq_held(rq);
> >  
> >  	WARN_ON_ONCE(task_rq(p) != rq);
> > -	activate_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_MIGRATED);
> > +	activate_task(rq, p, ENQUEUE_NOCLOCK);
> >  	check_preempt_curr(rq, p, 0);
> >  }
> 
> Would work too.

OK, let me fold this in and then we can always tinker with it later if
we're so motivated :-)

Re: [PATCH v2] sched/fair: sanitize vruntime of entity being migrated

Posted by Vincent Guittot 2 years, 10 months ago

On Tue, 21 Mar 2023 at 11:50, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Mar 21, 2023 at 11:29:13AM +0100, Dietmar Eggemann wrote:
> > On 21/03/2023 11:02, Peter Zijlstra wrote:
> > > On Fri, Mar 17, 2023 at 05:08:10PM +0100, Vincent Guittot wrote:
> > >> Commit 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
> > >> fixes an overflowing bug, but ignore a case that se->exec_start is reset
> > >> after a migration.
> > >>
> > >> For fixing this case, we delay the reset of se->exec_start after
> > >> placing the entity which se->exec_start to detect long sleeping task.
> > >>
> > >> In order to take into account a possible divergence between the clock_task
> > >> of 2 rqs, we increase the threshold to around 104 days.
> > >>
> > >>
> > >> Fixes: 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
> > >> Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
> > >> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > >> ---
> > >
> > > Blergh, this just isn't going to be nice. I'll go queue this for
> > > sched/urgent and then we can forget about this for a little while.
> > >
> > > Thanks!
> >
> > Don't we miss setting `se->exec_start = 0` for fair task in
> > move_queued_task()? ( ... and __migrate_swap_task())
> >
> > https://lkml.kernel.org/r/df2cccda-1550-b06b-aa74-e0f054e9fb9d@arm.com
>
> Ah, I see what you mean now... When I read your and Vincent's replies
> earlier today I though you mean to avoid the extra ENQUEUE_MIGRATED use,
> but your actual goal was to capure more sites.
>
> Hmm, we could of course go add more ENQUEUE_MIGRATED, but you're right
> in that TASK_ON_RQ_MIGRATING already captures that.
>
> An alternative is something like the below, that matches
> deactivate_task(), but still uses ENQUEUE_MIGRATED to pass it down into
> the class methods.
>
> Hmm?

Yes, this seems to be the right way to set ENQUEUE_MIGRATED flags

>
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2084,6 +2084,9 @@ static inline void dequeue_task(struct r
>
>  void activate_task(struct rq *rq, struct task_struct *p, int flags)
>  {
> +       if (task_on_rq_migrating(p))
> +               flags |= ENQUEUE_MIGRATED;
> +
>         enqueue_task(rq, p, flags);
>
>         p->on_rq = TASK_ON_RQ_QUEUED;
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8726,7 +8726,7 @@ static void attach_task(struct rq *rq, s
>         lockdep_assert_rq_held(rq);
>
>         WARN_ON_ONCE(task_rq(p) != rq);
> -       activate_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_MIGRATED);
> +       activate_task(rq, p, ENQUEUE_NOCLOCK);
>         check_preempt_curr(rq, p, 0);
>  }
>

Re: [PATCH v2] sched/fair: sanitize vruntime of entity being migrated

Posted by Zhang Qiao 2 years, 10 months ago


在 2023/3/18 0:08, Vincent Guittot 写道:
> Commit 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
> fixes an overflowing bug, but ignore a case that se->exec_start is reset
> after a migration.
> 
> For fixing this case, we delay the reset of se->exec_start after
> placing the entity which se->exec_start to detect long sleeping task.
> 
> In order to take into account a possible divergence between the clock_task
> of 2 rqs, we increase the threshold to around 104 days.
> 
> 
> Fixes: 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
> Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
> 
> My last proposal was not yet correct as the exec_start was not always
> reset after migrating a task. I finally found this solution which keeps
> the long sleep detection to one place as well as the reset of se->exec_start.
> 

Tested-by: Zhang Qiao <zhangqiao22@huawei.com>

I have retested it with this version, and the result is fine.

-------

 Performance counter stats for 'hackbench -g 44 -f 20 --process --pipe -l 60000 -s 100' (10 runs):

             80.10 +- 1.22 seconds time elapsed  ( +-  1.53% )


Thanks,
Zhang Qiao.

>  kernel/sched/fair.c | 57 +++++++++++++++++++++++++++++++++++----------
>  1 file changed, 45 insertions(+), 12 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0f499e9a74b5..421173fde351 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4648,11 +4648,33 @@ static void check_spread(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  #endif
>  }
>  
> +static inline bool entity_is_long_sleeper(struct sched_entity *se)
> +{
> +	struct cfs_rq *cfs_rq;
> +	u64 sleep_time;
> +
> +	if (se->exec_start == 0)
> +		return false;
> +
> +	cfs_rq = cfs_rq_of(se);
> +
> +	sleep_time = rq_clock_task(rq_of(cfs_rq));
> +
> +	/* Happen while migrating because of clock task divergence */
> +	if (sleep_time <= se->exec_start)
> +		return false;
> +
> +	sleep_time -= se->exec_start;
> +	if (sleep_time > ((1ULL << 63) / scale_load_down(NICE_0_LOAD)))
> +		return true;
> +
> +	return false;
> +}
> +
>  static void
>  place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
>  {
>  	u64 vruntime = cfs_rq->min_vruntime;
> -	u64 sleep_time;
>  
>  	/*
>  	 * The 'current' period is already promised to the current tasks,
> @@ -4684,13 +4706,24 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
>  
>  	/*
>  	 * Pull vruntime of the entity being placed to the base level of
> -	 * cfs_rq, to prevent boosting it if placed backwards.  If the entity
> -	 * slept for a long time, don't even try to compare its vruntime with
> -	 * the base as it may be too far off and the comparison may get
> -	 * inversed due to s64 overflow.
> -	 */
> -	sleep_time = rq_clock_task(rq_of(cfs_rq)) - se->exec_start;
> -	if ((s64)sleep_time > 60LL * NSEC_PER_SEC)
> +	 * cfs_rq, to prevent boosting it if placed backwards.
> +	 * However, min_vruntime can advance much faster than real time, with
> +	 * the extreme being when an entity with the minimal weight always runs
> +	 * on the cfs_rq. If the waking entity slept for a long time, its
> +	 * vruntime difference from min_vruntime may overflow s64 and their
> +	 * comparison may get inversed, so ignore the entity's original
> +	 * vruntime in that case.
> +	 * The maximal vruntime speedup is given by the ratio of normal to
> +	 * minimal weight: scale_load_down(NICE_0_LOAD) / MIN_SHARES.
> +	 * When placing a migrated waking entity, its exec_start has been set
> +	 * from a different rq. In order to take into account a possible
> +	 * divergence between new and prev rq's clocks task because of irq and
> +	 * stolen time, we take an additional margin.
> +	 * So, cutting off on the sleep time of
> +	 *     2^63 / scale_load_down(NICE_0_LOAD) ~ 104 days
> +	 * should be safe.
> +	 */
> +	if (entity_is_long_sleeper(se))
>  		se->vruntime = vruntime;
>  	else
>  		se->vruntime = max_vruntime(se->vruntime, vruntime);
> @@ -4770,6 +4803,9 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  
>  	if (flags & ENQUEUE_WAKEUP)
>  		place_entity(cfs_rq, se, 0);
> +	/* Entity has migrated, no longer consider this task hot */
> +	if (flags & ENQUEUE_MIGRATED)
> +		se->exec_start = 0;
>  
>  	check_schedstat_required();
>  	update_stats_enqueue_fair(cfs_rq, se, flags);
> @@ -7665,9 +7701,6 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
>  	/* Tell new CPU we are migrated */
>  	se->avg.last_update_time = 0;
>  
> -	/* We have migrated, no longer consider this task hot */
> -	se->exec_start = 0;
> -
>  	update_scan_period(p, new_cpu);
>  }
>  
> @@ -8701,7 +8734,7 @@ static void attach_task(struct rq *rq, struct task_struct *p)
>  	lockdep_assert_rq_held(rq);
>  
>  	WARN_ON_ONCE(task_rq(p) != rq);
> -	activate_task(rq, p, ENQUEUE_NOCLOCK);
> +	activate_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_MIGRATED);
>  	check_preempt_curr(rq, p, 0);
>  }
>  
>

Re: [PATCH v2] sched/fair: sanitize vruntime of entity being migrated

Posted by Dietmar Eggemann 2 years, 10 months ago

On 18/03/2023 08:45, Zhang Qiao wrote:
> 
> 
> 在 2023/3/18 0:08, Vincent Guittot 写道:
>> Commit 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
>> fixes an overflowing bug, but ignore a case that se->exec_start is reset
>> after a migration.
>>
>> For fixing this case, we delay the reset of se->exec_start after
>> placing the entity which se->exec_start to detect long sleeping task.
>>
>> In order to take into account a possible divergence between the clock_task
>> of 2 rqs, we increase the threshold to around 104 days.
>>
>>
>> Fixes: 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
>> Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> ---
>>
>> My last proposal was not yet correct as the exec_start was not always
>> reset after migrating a task. I finally found this solution which keeps
>> the long sleep detection to one place as well as the reset of se->exec_start.
>>
> 
> Tested-by: Zhang Qiao <zhangqiao22@huawei.com>
> 
> I have retested it with this version, and the result is fine.
> 
> -------
> 
>  Performance counter stats for 'hackbench -g 44 -f 20 --process --pipe -l 60000 -s 100' (10 runs):
> 
>              80.10 +- 1.22 seconds time elapsed  ( +-  1.53% )

[...]

>> @@ -8701,7 +8734,7 @@ static void attach_task(struct rq *rq, struct task_struct *p)
>>  	lockdep_assert_rq_held(rq);
>>  
>>  	WARN_ON_ONCE(task_rq(p) != rq);
>> -	activate_task(rq, p, ENQUEUE_NOCLOCK);
>> +	activate_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_MIGRATED);
>>  	check_preempt_curr(rq, p, 0);

Why not:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b9bc1ab67aaa..96dd3a62e683 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7674,7 +7674,10 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
 		se->vruntime -= u64_u32_load(cfs_rq->min_vruntime);
 	}
 
-	if (!task_on_rq_migrating(p)) {
+	if (task_on_rq_migrating(p)) {
+		/* We have migrated, no longer consider this task hot */
+		se->exec_start = 0;
+	} else {
 		remove_entity_load_avg(se);
 
 		/*
@@ -8726,7 +8729,7 @@ static void attach_task(struct rq *rq, struct task_struct *p)
 	lockdep_assert_rq_held(rq);
 
 	WARN_ON_ONCE(task_rq(p) != rq);
-	activate_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_MIGRATED);
+	activate_task(rq, p, ENQUEUE_NOCLOCK);
 	check_preempt_curr(rq, p, 0);
 }


entity_is_long_sleeper() will bail early for these rq-migrating tasks
for which a long-sleep test would make little sense anyway.

Plus move_queued_task() (e.g. from sched_exex()) would be covered as well.

Re: [PATCH v2] sched/fair: sanitize vruntime of entity being migrated

Posted by Vincent Guittot 2 years, 10 months ago

On Mon, 20 Mar 2023 at 13:29, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 18/03/2023 08:45, Zhang Qiao wrote:
> >
> >
> > 在 2023/3/18 0:08, Vincent Guittot 写道:
> >> Commit 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
> >> fixes an overflowing bug, but ignore a case that se->exec_start is reset
> >> after a migration.
> >>
> >> For fixing this case, we delay the reset of se->exec_start after
> >> placing the entity which se->exec_start to detect long sleeping task.
> >>
> >> In order to take into account a possible divergence between the clock_task
> >> of 2 rqs, we increase the threshold to around 104 days.
> >>
> >>
> >> Fixes: 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
> >> Signed-off-by: Zhang Qiao <zhangqiao22@huawei.com>
> >> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> >> ---
> >>
> >> My last proposal was not yet correct as the exec_start was not always
> >> reset after migrating a task. I finally found this solution which keeps
> >> the long sleep detection to one place as well as the reset of se->exec_start.
> >>
> >
> > Tested-by: Zhang Qiao <zhangqiao22@huawei.com>
> >
> > I have retested it with this version, and the result is fine.
> >
> > -------
> >
> >  Performance counter stats for 'hackbench -g 44 -f 20 --process --pipe -l 60000 -s 100' (10 runs):
> >
> >              80.10 +- 1.22 seconds time elapsed  ( +-  1.53% )
>
> [...]
>
> >> @@ -8701,7 +8734,7 @@ static void attach_task(struct rq *rq, struct task_struct *p)
> >>      lockdep_assert_rq_held(rq);
> >>
> >>      WARN_ON_ONCE(task_rq(p) != rq);
> >> -    activate_task(rq, p, ENQUEUE_NOCLOCK);
> >> +    activate_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_MIGRATED);
> >>      check_preempt_curr(rq, p, 0);
>
> Why not:
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b9bc1ab67aaa..96dd3a62e683 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7674,7 +7674,10 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
>                 se->vruntime -= u64_u32_load(cfs_rq->min_vruntime);
>         }
>
> -       if (!task_on_rq_migrating(p)) {
> +       if (task_on_rq_migrating(p)) {
> +               /* We have migrated, no longer consider this task hot */
> +               se->exec_start = 0;

mainly to keep the clear of se->exec_start = 0 in one place to ease
the maintenance


> +       } else {
>                 remove_entity_load_avg(se);
>
>                 /*
> @@ -8726,7 +8729,7 @@ static void attach_task(struct rq *rq, struct task_struct *p)
>         lockdep_assert_rq_held(rq);
>
>         WARN_ON_ONCE(task_rq(p) != rq);
> -       activate_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_MIGRATED);
> +       activate_task(rq, p, ENQUEUE_NOCLOCK);
>         check_preempt_curr(rq, p, 0);
>  }
>
>
> entity_is_long_sleeper() will bail early for these rq-migrating tasks
> for which a long-sleep test would make little sense anyway.
>
> Plus move_queued_task() (e.g. from sched_exex()) would be covered as well.

[tip: sched/urgent] sched/fair: Sanitize vruntime of entity being migrated

Posted by tip-bot2 for Vincent Guittot 2 years, 10 months ago

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     a53ce18cacb477dd0513c607f187d16f0fa96f71
Gitweb:        https://git.kernel.org/tip/a53ce18cacb477dd0513c607f187d16f0fa96f71
Author:        Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate:    Fri, 17 Mar 2023 17:08:10 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 21 Mar 2023 14:43:04 +01:00

sched/fair: Sanitize vruntime of entity being migrated

Commit 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
fixes an overflowing bug, but ignore a case that se->exec_start is reset
after a migration.

For fixing this case, we delay the reset of se->exec_start after
placing the entity which se->exec_start to detect long sleeping task.

In order to take into account a possible divergence between the clock_task
of 2 rqs, we increase the threshold to around 104 days.

Fixes: 829c1651e9c4 ("sched/fair: sanitize vruntime of entity being placed")
Originally-by: Zhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Zhang Qiao <zhangqiao22@huawei.com>
Link: https://lore.kernel.org/r/20230317160810.107988-1-vincent.guittot@linaro.org
---
 kernel/sched/core.c |  3 ++-
 kernel/sched/fair.c | 55 +++++++++++++++++++++++++++++++++++---------
 2 files changed, 47 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 488655f..0d18c39 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2084,6 +2084,9 @@ static inline void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
 
 void activate_task(struct rq *rq, struct task_struct *p, int flags)
 {
+	if (task_on_rq_migrating(p))
+		flags |= ENQUEUE_MIGRATED;
+
 	enqueue_task(rq, p, flags);
 
 	p->on_rq = TASK_ON_RQ_QUEUED;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a1b1f8..6986ea3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4648,11 +4648,33 @@ static void check_spread(struct cfs_rq *cfs_rq, struct sched_entity *se)
 #endif
 }
 
+static inline bool entity_is_long_sleeper(struct sched_entity *se)
+{
+	struct cfs_rq *cfs_rq;
+	u64 sleep_time;
+
+	if (se->exec_start == 0)
+		return false;
+
+	cfs_rq = cfs_rq_of(se);
+
+	sleep_time = rq_clock_task(rq_of(cfs_rq));
+
+	/* Happen while migrating because of clock task divergence */
+	if (sleep_time <= se->exec_start)
+		return false;
+
+	sleep_time -= se->exec_start;
+	if (sleep_time > ((1ULL << 63) / scale_load_down(NICE_0_LOAD)))
+		return true;
+
+	return false;
+}
+
 static void
 place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
 {
 	u64 vruntime = cfs_rq->min_vruntime;
-	u64 sleep_time;
 
 	/*
 	 * The 'current' period is already promised to the current tasks,
@@ -4684,13 +4706,24 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
 
 	/*
 	 * Pull vruntime of the entity being placed to the base level of
-	 * cfs_rq, to prevent boosting it if placed backwards.  If the entity
-	 * slept for a long time, don't even try to compare its vruntime with
-	 * the base as it may be too far off and the comparison may get
-	 * inversed due to s64 overflow.
-	 */
-	sleep_time = rq_clock_task(rq_of(cfs_rq)) - se->exec_start;
-	if ((s64)sleep_time > 60LL * NSEC_PER_SEC)
+	 * cfs_rq, to prevent boosting it if placed backwards.
+	 * However, min_vruntime can advance much faster than real time, with
+	 * the extreme being when an entity with the minimal weight always runs
+	 * on the cfs_rq. If the waking entity slept for a long time, its
+	 * vruntime difference from min_vruntime may overflow s64 and their
+	 * comparison may get inversed, so ignore the entity's original
+	 * vruntime in that case.
+	 * The maximal vruntime speedup is given by the ratio of normal to
+	 * minimal weight: scale_load_down(NICE_0_LOAD) / MIN_SHARES.
+	 * When placing a migrated waking entity, its exec_start has been set
+	 * from a different rq. In order to take into account a possible
+	 * divergence between new and prev rq's clocks task because of irq and
+	 * stolen time, we take an additional margin.
+	 * So, cutting off on the sleep time of
+	 *     2^63 / scale_load_down(NICE_0_LOAD) ~ 104 days
+	 * should be safe.
+	 */
+	if (entity_is_long_sleeper(se))
 		se->vruntime = vruntime;
 	else
 		se->vruntime = max_vruntime(se->vruntime, vruntime);
@@ -4770,6 +4803,9 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 
 	if (flags & ENQUEUE_WAKEUP)
 		place_entity(cfs_rq, se, 0);
+	/* Entity has migrated, no longer consider this task hot */
+	if (flags & ENQUEUE_MIGRATED)
+		se->exec_start = 0;
 
 	check_schedstat_required();
 	update_stats_enqueue_fair(cfs_rq, se, flags);
@@ -7657,9 +7693,6 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
 	/* Tell new CPU we are migrated */
 	se->avg.last_update_time = 0;
 
-	/* We have migrated, no longer consider this task hot */
-	se->exec_start = 0;
-
 	update_scan_period(p, new_cpu);
 }