sched/fair: fixes in presence of lot of sched_idle tasks

[PATCH 1/4] sched/fair: make sure to try to detach at least one movable task

Posted by Vincent Guittot 3 years, 7 months ago

During load balance, we try at most env->loop_max time to move a task.
But it can happen that the loop_max LRU tasks (ie tail of
the cfs_tasks list) can't be moved to dst_cpu because of affinity.
In this case, loop in the list until we found at least one.

The maximum of detached tasks remained the same as before.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index da388657d5ac..02b7b808e186 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8052,8 +8052,12 @@ static int detach_tasks(struct lb_env *env)
 		p = list_last_entry(tasks, struct task_struct, se.group_node);
 
 		env->loop++;
-		/* We've more or less seen every task there is, call it quits */
-		if (env->loop > env->loop_max)
+		/*
+		 * We've more or less seen every task there is, call it quits
+		 * unless we haven't found any movable task yet.
+		 */
+		if (env->loop > env->loop_max &&
+		    !(env->flags & LBF_ALL_PINNED))
 			break;
 
 		/* take a breather every nr_migrate tasks */
@@ -10182,7 +10186,9 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 
 		if (env.flags & LBF_NEED_BREAK) {
 			env.flags &= ~LBF_NEED_BREAK;
-			goto more_balance;
+			/* Stop if we tried all running tasks */
+			if (env.loop < busiest->nr_running)
+				goto more_balance;
 		}
 
 		/*
-- 
2.17.1

Re: [PATCH 1/4] sched/fair: make sure to try to detach at least one movable task

Posted by Josh Don 2 years, 1 month ago

Hi Vincent,

On Thu, Aug 25, 2022 at 5:27 AM Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> During load balance, we try at most env->loop_max time to move a task.
> But it can happen that the loop_max LRU tasks (ie tail of
> the cfs_tasks list) can't be moved to dst_cpu because of affinity.
> In this case, loop in the list until we found at least one.

We had a user recently trigger a hard lockup which we believe is due
to this patch. The user in question had O(10k) threads affinitized to
a cpu; seems like the process had an out of control thread spawning
issue, and was in the middle of getting killed. However, that was
being slowed down due to the fact that load balance was iterating all
these threads and bouncing the rq lock (and making no progress due to
ALL_PINNED). Before this patch, load balance would quit after hitting
loop_max.

Even ignoring that specific instance, it seems pretty easy for this
patch to cause a softlockup due to a buggy or malicious process.

For the tradeoff you were trying to make in this patch (spend more
time searching in the hopes that there's something migratable further
in the list), perhaps it would be better to adjust
sysctl.sched_nr_migrate instead of baking this into the kernel?

Best,
Josh

>
> The maximum of detached tasks remained the same as before.
>
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/fair.c | 12 +++++++++---
>  1 file changed, 9 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index da388657d5ac..02b7b808e186 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8052,8 +8052,12 @@ static int detach_tasks(struct lb_env *env)
>                 p = list_last_entry(tasks, struct task_struct, se.group_node);
>
>                 env->loop++;
> -               /* We've more or less seen every task there is, call it quits */
> -               if (env->loop > env->loop_max)
> +               /*
> +                * We've more or less seen every task there is, call it quits
> +                * unless we haven't found any movable task yet.
> +                */
> +               if (env->loop > env->loop_max &&
> +                   !(env->flags & LBF_ALL_PINNED))
>                         break;
>
>                 /* take a breather every nr_migrate tasks */
> @@ -10182,7 +10186,9 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>
>                 if (env.flags & LBF_NEED_BREAK) {
>                         env.flags &= ~LBF_NEED_BREAK;
> -                       goto more_balance;
> +                       /* Stop if we tried all running tasks */
> +                       if (env.loop < busiest->nr_running)
> +                               goto more_balance;
>                 }
>
>                 /*
> --
> 2.17.1
>

Re: [PATCH 1/4] sched/fair: make sure to try to detach at least one movable task

Posted by Vincent Guittot 2 years ago

Hi Josh,

Sorry for the late reply.

On Mon, 12 Feb 2024 at 21:29, Josh Don <joshdon@google.com> wrote:
>
> Hi Vincent,
>
> On Thu, Aug 25, 2022 at 5:27 AM Vincent Guittot
> <vincent.guittot@linaro.org> wrote:
> >
> > During load balance, we try at most env->loop_max time to move a task.
> > But it can happen that the loop_max LRU tasks (ie tail of
> > the cfs_tasks list) can't be moved to dst_cpu because of affinity.
> > In this case, loop in the list until we found at least one.
>
> We had a user recently trigger a hard lockup which we believe is due
> to this patch. The user in question had O(10k) threads affinitized to
> a cpu; seems like the process had an out of control thread spawning
> issue, and was in the middle of getting killed. However, that was
> being slowed down due to the fact that load balance was iterating all

Does it mean that it was progressing but not as fast as you would like

> these threads and bouncing the rq lock (and making no progress due to
> ALL_PINNED). Before this patch, load balance would quit after hitting
> loop_max.
>
> Even ignoring that specific instance, it seems pretty easy for this
> patch to cause a softlockup due to a buggy or malicious process.

The fact that the rq is released regularly should prevent a
softlockup. And we could even fasten can_migrate() which does a lot of
useless stuff for task affined to 1 cpu.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e8270e2e15cb..15bc1067c69d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8920,6 +8920,8 @@ int can_migrate_task(struct task_struct *p,
struct lb_env *env)

        lockdep_assert_rq_held(env->src_rq);

+       if (p->nr_cpus_allowed == 1)
+               return 0;
        /*
         * We do not migrate tasks that are:
         * 1) throttled_lb_pair, or

>
> For the tradeoff you were trying to make in this patch (spend more
> time searching in the hopes that there's something migratable further
> in the list), perhaps it would be better to adjust
> sysctl.sched_nr_migrate instead of baking this into the kernel?

That could be a solution but this increases the iterations for all
cases including those which are more time consuming to sort out and
the number of tasks that you will migrate in one lb. The latter is the
one which consumes time

Vincent

>
> Best,
> Josh
>
> >
> > The maximum of detached tasks remained the same as before.
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >  kernel/sched/fair.c | 12 +++++++++---
> >  1 file changed, 9 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index da388657d5ac..02b7b808e186 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8052,8 +8052,12 @@ static int detach_tasks(struct lb_env *env)
> >                 p = list_last_entry(tasks, struct task_struct, se.group_node);
> >
> >                 env->loop++;
> > -               /* We've more or less seen every task there is, call it quits */
> > -               if (env->loop > env->loop_max)
> > +               /*
> > +                * We've more or less seen every task there is, call it quits
> > +                * unless we haven't found any movable task yet.
> > +                */
> > +               if (env->loop > env->loop_max &&
> > +                   !(env->flags & LBF_ALL_PINNED))
> >                         break;
> >
> >                 /* take a breather every nr_migrate tasks */
> > @@ -10182,7 +10186,9 @@ static int load_balance(int this_cpu, struct rq *this_rq,
> >
> >                 if (env.flags & LBF_NEED_BREAK) {
> >                         env.flags &= ~LBF_NEED_BREAK;
> > -                       goto more_balance;
> > +                       /* Stop if we tried all running tasks */
> > +                       if (env.loop < busiest->nr_running)
> > +                               goto more_balance;
> >                 }
> >
> >                 /*
> > --
> > 2.17.1
> >

Re: [PATCH 1/4] sched/fair: make sure to try to detach at least one movable task

Posted by Josh Don 2 years ago

On Wed, Mar 20, 2024 at 9:58 AM Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> Hi Josh,
>
> Sorry for the late reply.

No worries, responses to your comments inline below.

> > We had a user recently trigger a hard lockup which we believe is due
> > to this patch. The user in question had O(10k) threads affinitized to
> > a cpu; seems like the process had an out of control thread spawning
> > issue, and was in the middle of getting killed. However, that was
> > being slowed down due to the fact that load balance was iterating all
>
> Does it mean that it was progressing but not as fast as you would like

It was a hard lockup, so it's more than just "not fast enough".
Indeed it was progressing, but not at a rate sufficient to avoid
lockup.

> > these threads and bouncing the rq lock (and making no progress due to
> > ALL_PINNED). Before this patch, load balance would quit after hitting
> > loop_max.
> >
> > Even ignoring that specific instance, it seems pretty easy for this
> > patch to cause a softlockup due to a buggy or malicious process.
>
> The fact that the rq is released regularly should prevent a
> softlockup.

That doesn't prevent a softlockup; kernel is stuck looping over a long
list of tasks for too long, regardless of whether it is releasing and
re-acquiring the rq locks.

Note also that load balance can come from softirq in a context where
we have IRQ disabled, which can lead to hard lockup as well.

> And we could even fasten can_migrate() which does a lot of
> useless stuff for task affined to 1 cpu.

That seems like a useful optimization, but not really relevant? It
doesn't matter how small we make the constant factor, we still have an
O(n) operation in kernel mode here.

> > For the tradeoff you were trying to make in this patch (spend more
> > time searching in the hopes that there's something migratable further
> > in the list), perhaps it would be better to adjust
> > sysctl.sched_nr_migrate instead of baking this into the kernel?
>
> That could be a solution but this increases the iterations for all
> cases including those which are more time consuming to sort out and
> the number of tasks that you will migrate in one lb. The latter is the
> one which consumes time

Is is really that bad? loop_max will be unchanged for most cases since
it gets min'd with nr_running anyway. And, even if loop_max ends up
larger in some other instances, we still terminate the iteration after
fixing up env->imbalance (granted, we'll migrate more tasks to achieve
a better balance with a larger loop_max, which I think is your point).

Another idea then: what about separating the number of tasks we can
move from the number of tasks we can search? You effectively want to
keep the number of tasks that can be migrated small (nr_migrate), but
be able to search deeper in the list for things to pull (a larger
search_depth).

- Josh

Re: [PATCH 1/4] sched/fair: make sure to try to detach at least one movable task

Posted by Vincent Guittot 2 years ago

On Thu, 21 Mar 2024 at 21:25, Josh Don <joshdon@google.com> wrote:
>
> On Wed, Mar 20, 2024 at 9:58 AM Vincent Guittot
> <vincent.guittot@linaro.org> wrote:
> >
> > Hi Josh,
> >
> > Sorry for the late reply.
>
> No worries, responses to your comments inline below.
>
> > > We had a user recently trigger a hard lockup which we believe is due
> > > to this patch. The user in question had O(10k) threads affinitized to
> > > a cpu; seems like the process had an out of control thread spawning
> > > issue, and was in the middle of getting killed. However, that was
> > > being slowed down due to the fact that load balance was iterating all
> >
> > Does it mean that it was progressing but not as fast as you would like
>
> It was a hard lockup, so it's more than just "not fast enough".
> Indeed it was progressing, but not at a rate sufficient to avoid
> lockup.
>
> > > these threads and bouncing the rq lock (and making no progress due to
> > > ALL_PINNED). Before this patch, load balance would quit after hitting
> > > loop_max.
> > >
> > > Even ignoring that specific instance, it seems pretty easy for this
> > > patch to cause a softlockup due to a buggy or malicious process.
> >
> > The fact that the rq is released regularly should prevent a
> > softlockup.
>
> That doesn't prevent a softlockup; kernel is stuck looping over a long
> list of tasks for too long, regardless of whether it is releasing and
> re-acquiring the rq locks.
>
> Note also that load balance can come from softirq in a context where
> we have IRQ disabled, which can lead to hard lockup as well.

fair enough

>
> > And we could even fasten can_migrate() which does a lot of
> > useless stuff for task affined to 1 cpu.
>
> That seems like a useful optimization, but not really relevant? It
> doesn't matter how small we make the constant factor, we still have an
> O(n) operation in kernel mode here.
>
> > > For the tradeoff you were trying to make in this patch (spend more
> > > time searching in the hopes that there's something migratable further
> > > in the list), perhaps it would be better to adjust
> > > sysctl.sched_nr_migrate instead of baking this into the kernel?
> >
> > That could be a solution but this increases the iterations for all
> > cases including those which are more time consuming to sort out and
> > the number of tasks that you will migrate in one lb. The latter is the
> > one which consumes time
>
> Is is really that bad? loop_max will be unchanged for most cases since
> it gets min'd with nr_running anyway. And, even if loop_max ends up
> larger in some other instances, we still terminate the iteration after
> fixing up env->imbalance (granted, we'll migrate more tasks to achieve
> a better balance with a larger loop_max, which I think is your point).

Yes, my point is that load of a task can be quite small, especially
with cgroups, so we can end up detaching/attaching a very large number
of tasks which is far more time consuming that checking if we can
migrate it or not
>
>
> Another idea then: what about separating the number of tasks we can
> move from the number of tasks we can search? You effectively want to
> keep the number of tasks that can be migrated small (nr_migrate), but
> be able to search deeper in the list for things to pull (a larger
> search_depth).

That could be a solution indeed

>
> - Josh

Re: [PATCH 1/4] sched/fair: make sure to try to detach at least one movable task

Posted by Josh Don 2 years ago

> > Another idea then: what about separating the number of tasks we can
> > move from the number of tasks we can search? You effectively want to
> > keep the number of tasks that can be migrated small (nr_migrate), but
> > be able to search deeper in the list for things to pull (a larger
> > search_depth).
>
> That could be a solution indeed

Cool, I'll try spinning up a patch for that then.

Best,
Josh

Re: [PATCH 1/4] sched/fair: make sure to try to detach at least one movable task

Posted by Dietmar Eggemann 3 years, 6 months ago

On 25/08/2022 14:27, Vincent Guittot wrote:

s/sched/fair: make/sched/fair: Make

> During load balance, we try at most env->loop_max time to move a task.
> But it can happen that the loop_max LRU tasks (ie tail of
> the cfs_tasks list) can't be moved to dst_cpu because of affinity.
> In this case, loop in the list until we found at least one.
> 
> The maximum of detached tasks remained the same as before.

Not sure how this relates to the patch? Isn't this given by the
`env->imbalance <= 0` check at the end of detach_tasks()?

> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/fair.c | 12 +++++++++---
>  1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index da388657d5ac..02b7b808e186 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8052,8 +8052,12 @@ static int detach_tasks(struct lb_env *env)
>  		p = list_last_entry(tasks, struct task_struct, se.group_node);
>  
>  		env->loop++;
> -		/* We've more or less seen every task there is, call it quits */
> -		if (env->loop > env->loop_max)
> +		/*
> +		 * We've more or less seen every task there is, call it quits

I never understood this `more or less`. Either we have seen all tasks or
not?

> +		 * unless we haven't found any movable task yet.
> +		 */
> +		if (env->loop > env->loop_max &&
> +		    !(env->flags & LBF_ALL_PINNED))
>  			break;
>  
>  		/* take a breather every nr_migrate tasks */
> @@ -10182,7 +10186,9 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>  
>  		if (env.flags & LBF_NEED_BREAK) {
>  			env.flags &= ~LBF_NEED_BREAK;
> -			goto more_balance;
> +			/* Stop if we tried all running tasks */

Would say s/running/runnable but I see that we do use running/runnable
interchangeably.

> +			if (env.loop < busiest->nr_running)
> +				goto more_balance;
>  		}
>  
>  		/*

IMHO, there will be some interaction with the `All tasks on this
runqueue were pinned by CPU affinity` check at the end of load_balance()?

Re: [PATCH 1/4] sched/fair: make sure to try to detach at least one movable task

Posted by Vincent Guittot 3 years, 6 months ago

On Mon, 12 Sept 2022 at 10:44, Dietmar Eggemann
<dietmar.eggemann@arm.com> wrote:
>
> On 25/08/2022 14:27, Vincent Guittot wrote:
>
> s/sched/fair: make/sched/fair: Make
>
> > During load balance, we try at most env->loop_max time to move a task.
> > But it can happen that the loop_max LRU tasks (ie tail of
> > the cfs_tasks list) can't be moved to dst_cpu because of affinity.
> > In this case, loop in the list until we found at least one.
> >
> > The maximum of detached tasks remained the same as before.
>
> Not sure how this relates to the patch? Isn't this given by the
> `env->imbalance <= 0` check at the end of detach_tasks()?

The number of detached tasks can't be higher than  loop_max in
detached_tasks() and it remains the same with this patch as we will
continue to loop only if we didn't find task that can move to the cpu

>
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >  kernel/sched/fair.c | 12 +++++++++---
> >  1 file changed, 9 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index da388657d5ac..02b7b808e186 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8052,8 +8052,12 @@ static int detach_tasks(struct lb_env *env)
> >               p = list_last_entry(tasks, struct task_struct, se.group_node);
> >
> >               env->loop++;
> > -             /* We've more or less seen every task there is, call it quits */
> > -             if (env->loop > env->loop_max)
> > +             /*
> > +              * We've more or less seen every task there is, call it quits
>
> I never understood this `more or less`. Either we have seen all tasks or
> not?
>
> > +              * unless we haven't found any movable task yet.
> > +              */
> > +             if (env->loop > env->loop_max &&
> > +                 !(env->flags & LBF_ALL_PINNED))
> >                       break;
> >
> >               /* take a breather every nr_migrate tasks */
> > @@ -10182,7 +10186,9 @@ static int load_balance(int this_cpu, struct rq *this_rq,
> >
> >               if (env.flags & LBF_NEED_BREAK) {
> >                       env.flags &= ~LBF_NEED_BREAK;
> > -                     goto more_balance;
> > +                     /* Stop if we tried all running tasks */
>
> Would say s/running/runnable but I see that we do use running/runnable
> interchangeably.
>
> > +                     if (env.loop < busiest->nr_running)
> > +                             goto more_balance;
> >               }
> >
> >               /*
>
> IMHO, there will be some interaction with the `All tasks on this
> runqueue were pinned by CPU affinity` check at the end of load_balance()?

[tip: sched/core] sched/fair: Make sure to try to detach at least one movable task

Posted by tip-bot2 for Vincent Guittot 3 years, 6 months ago

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     b0defa7ae03ecf91b8bfd10ede430cff12fcbd06
Gitweb:        https://git.kernel.org/tip/b0defa7ae03ecf91b8bfd10ede430cff12fcbd06
Author:        Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate:    Thu, 25 Aug 2022 14:27:23 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 15 Sep 2022 16:13:51 +02:00

sched/fair: Make sure to try to detach at least one movable task

During load balance, we try at most env->loop_max time to move a task.
But it can happen that the loop_max LRU tasks (ie tail of
the cfs_tasks list) can't be moved to dst_cpu because of affinity.
In this case, loop in the list until we found at least one.

The maximum of detached tasks remained the same as before.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220825122726.20819-2-vincent.guittot@linaro.org
---
 kernel/sched/fair.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4e5b171..dae3bfa 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8049,8 +8049,12 @@ static int detach_tasks(struct lb_env *env)
 		p = list_last_entry(tasks, struct task_struct, se.group_node);
 
 		env->loop++;
-		/* We've more or less seen every task there is, call it quits */
-		if (env->loop > env->loop_max)
+		/*
+		 * We've more or less seen every task there is, call it quits
+		 * unless we haven't found any movable task yet.
+		 */
+		if (env->loop > env->loop_max &&
+		    !(env->flags & LBF_ALL_PINNED))
 			break;
 
 		/* take a breather every nr_migrate tasks */
@@ -10179,7 +10183,9 @@ more_balance:
 
 		if (env.flags & LBF_NEED_BREAK) {
 			env.flags &= ~LBF_NEED_BREAK;
-			goto more_balance;
+			/* Stop if we tried all running tasks */
+			if (env.loop < busiest->nr_running)
+				goto more_balance;
 		}
 
 		/*