[v1] sched/fair: Do not scan non-movable tasks several times

[PATCH] sched/fair: Do not scan non-movable tasks several times

Posted by Konstantin Khorenko 2 years, 1 month ago

If busiest rq is small, nr_running < SCHED_NR_MIGRATE_BREAK and all
tasks are not movable, detach_tasks() should not iterate more than tasks
available in the busiest rq.

Previously the (env->loop > env->loop_max) condition prevented us from
scanning non-movable tasks more than rq size times, but after we start
checking the LBF_ALL_PINNED flag, the "all tasks are not movable" case
is under threat.

Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com>
---
 kernel/sched/fair.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d7a3c63a2171..faa2a765e899 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11219,7 +11219,6 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		.dst_rq		= this_rq,
 		.dst_grpmask    = group_balance_mask(sd->groups),
 		.idle		= idle,
-		.loop_break	= SCHED_NR_MIGRATE_BREAK,
 		.cpus		= cpus,
 		.fbq_type	= all,
 		.tasks		= LIST_HEAD_INIT(env.tasks),
@@ -11265,6 +11264,12 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		 * correctly treated as an imbalance.
 		 */
 		env.loop_max  = min(sysctl_sched_nr_migrate, busiest->nr_running);
+		/*
+		 * If busiest rq is small, nr_running < SCHED_NR_MIGRATE_BREAK
+		 * and all tasks are not movable, detach_tasks() should not
+		 * iterate more than tasks available in rq.
+		 */
+		env.loop_break = min(SCHED_NR_MIGRATE_BREAK, busiest->nr_running);
 
 more_balance:
 		rq_lock_irqsave(busiest, &rf);
-- 
2.39.3

Re: [PATCH] sched/fair: Do not scan non-movable tasks several times

Posted by Vincent Guittot 2 years, 1 month ago

On Thu, 14 Dec 2023 at 15:42, Konstantin Khorenko
<khorenko@virtuozzo.com> wrote:
>
> If busiest rq is small, nr_running < SCHED_NR_MIGRATE_BREAK and all
> tasks are not movable, detach_tasks() should not iterate more than tasks
> available in the busiest rq.
>
> Previously the (env->loop > env->loop_max) condition prevented us from

It's usually better to give the commit directly when we know it :
Before commit : b0defa7ae03e ("sched/fair: Make sure to try to detach
at least one movable task"),
the (env->loop > env->loop_max) condition prevented us from ...

> scanning non-movable tasks more than rq size times, but after we start
> checking the LBF_ALL_PINNED flag, the "all tasks are not movable" case
> is under threat.
>

Fixes: b0defa7ae03e ("sched/fair: Make sure to try to detach at least
one movable task")

> Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com>
> ---
>  kernel/sched/fair.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d7a3c63a2171..faa2a765e899 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11219,7 +11219,6 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>                 .dst_rq         = this_rq,
>                 .dst_grpmask    = group_balance_mask(sd->groups),
>                 .idle           = idle,
> -               .loop_break     = SCHED_NR_MIGRATE_BREAK,
>                 .cpus           = cpus,
>                 .fbq_type       = all,
>                 .tasks          = LIST_HEAD_INIT(env.tasks),
> @@ -11265,6 +11264,12 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>                  * correctly treated as an imbalance.
>                  */
>                 env.loop_max  = min(sysctl_sched_nr_migrate, busiest->nr_running);
> +               /*
> +                * If busiest rq is small, nr_running < SCHED_NR_MIGRATE_BREAK
> +                * and all tasks are not movable, detach_tasks() should not
> +                * iterate more than tasks available in rq.
> +                */
> +               env.loop_break = min(SCHED_NR_MIGRATE_BREAK, busiest->nr_running);

Should it be after more_balance: ?
In case we do "more_balance:" on a new_dst_cpu and it ends up that
finally there is no more movable task as we released the lock of
busiest rq in the meantime ?

Also you can remove one more superfluous init of loop_break:

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11361,7 +11361,6 @@ static int load_balance(int this_cpu, struct
rq *this_rq,
                         */
                        if (!cpumask_subset(cpus, env.dst_grpmask)) {
                                env.loop = 0;
-                               env.loop_break = SCHED_NR_MIGRATE_BREAK;
                                goto redo;
                        }
                        goto out_all_pinned;

>
>  more_balance:
>                 rq_lock_irqsave(busiest, &rf);
> --
> 2.39.3
>

Re: [PATCH] sched/fair: Do not scan non-movable tasks several times

Posted by Konstantin Khorenko 2 years, 1 month ago

On 18.12.2023 17:48, Vincent Guittot wrote:
> On Thu, 14 Dec 2023 at 15:42, Konstantin Khorenko
> <khorenko@virtuozzo.com> wrote:
>>
>> If busiest rq is small, nr_running < SCHED_NR_MIGRATE_BREAK and all
>> tasks are not movable, detach_tasks() should not iterate more than tasks
>> available in the busiest rq.
>>
>> Previously the (env->loop > env->loop_max) condition prevented us from
> 
> It's usually better to give the commit directly when we know it :
> Before commit : b0defa7ae03e ("sched/fair: Make sure to try to detach
> at least one movable task"),
> the (env->loop > env->loop_max) condition prevented us from ...

You are definitely right, added.

>> scanning non-movable tasks more than rq size times, but after we start
>> checking the LBF_ALL_PINNED flag, the "all tasks are not movable" case
>> is under threat.
>>
> 
> Fixes: b0defa7ae03e ("sched/fair: Make sure to try to detach at least
> one movable task")

Added this too, thank you.

>> Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com>
>> ---
>>   kernel/sched/fair.c | 7 ++++++-
>>   1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index d7a3c63a2171..faa2a765e899 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -11219,7 +11219,6 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>>                  .dst_rq         = this_rq,
>>                  .dst_grpmask    = group_balance_mask(sd->groups),
>>                  .idle           = idle,
>> -               .loop_break     = SCHED_NR_MIGRATE_BREAK,
>>                  .cpus           = cpus,
>>                  .fbq_type       = all,
>>                  .tasks          = LIST_HEAD_INIT(env.tasks),
>> @@ -11265,6 +11264,12 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>>                   * correctly treated as an imbalance.
>>                   */
>>                  env.loop_max  = min(sysctl_sched_nr_migrate, busiest->nr_running);
>> +               /*
>> +                * If busiest rq is small, nr_running < SCHED_NR_MIGRATE_BREAK
>> +                * and all tasks are not movable, detach_tasks() should not
>> +                * iterate more than tasks available in rq.
>> +                */
>> +               env.loop_break = min(SCHED_NR_MIGRATE_BREAK, busiest->nr_running);
> 
> Should it be after more_balance: ?
> In case we do "more_balance:" on a new_dst_cpu and it ends up that
> finally there is no more movable task as we released the lock of
> busiest rq in the meantime ?

Well, both yes and no here.
If we simply move the env.loop_break assignment after "more_balance" label,
we will break the handling LBF_NEED_BREAK case.

But you are right, the new_dst_cpu case should also not just reset loop_break to 
SCHED_NR_MIGRATE_BREAK, but do the same min() calculation.

So i've added one more label before the env.loop_break = min(...) and used it in
the new_dst_cpu case.

> Also you can remove one more superfluous init of loop_break:

Yep, dropped it, thank you.

> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11361,7 +11361,6 @@ static int load_balance(int this_cpu, struct
> rq *this_rq,
>                           */
>                          if (!cpumask_subset(cpus, env.dst_grpmask)) {
>                                  env.loop = 0;
> -                               env.loop_break = SCHED_NR_MIGRATE_BREAK;
>                                  goto redo;
>                          }
>                          goto out_all_pinned;
> 
>>
>>   more_balance:
>>                  rq_lock_irqsave(busiest, &rf);
>> --
>> 2.39.3
>>

[PATCH v2] sched/fair: Do not scan non-movable tasks several times

Posted by Konstantin Khorenko 2 years, 1 month ago

If busiest rq is small, nr_running < SCHED_NR_MIGRATE_BREAK and all
tasks are not movable, detach_tasks() should not iterate more than tasks
available in the busiest rq.

Before commit: b0defa7ae03e ("sched/fair: Make sure to try to detach at
least one movable task"), the (env->loop > env->loop_max) condition
prevented us from scanning non-movable tasks more than rq size times,
but after we start checking the LBF_ALL_PINNED flag, the "all tasks are
not movable" case is under threat.

Note: in case all tasks in the rq could not be moved in detach_tasks()
we always increase loop_break by SCHED_NR_MIGRATE_BREAK, so we can step
over loop_max, but i think it's a rare case and does not worth adding
here extra check for rq->nr_running overlimit.

Fixes: b0defa7ae03e ("sched/fair: Make sure to try to detach at least
one movable task")

Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com>
---
Changes:
v1->v2:
 * added the exact commit id caused the unefficiency + Fixes: tag
 * dropped a couple of extra redundunt env.loop_break assignments in
   load_balance()

 kernel/sched/fair.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d7a3c63a2171..bd69c33fe9b4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11219,7 +11219,6 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		.dst_rq		= this_rq,
 		.dst_grpmask    = group_balance_mask(sd->groups),
 		.idle		= idle,
-		.loop_break	= SCHED_NR_MIGRATE_BREAK,
 		.cpus		= cpus,
 		.fbq_type	= all,
 		.tasks		= LIST_HEAD_INIT(env.tasks),
@@ -11266,6 +11265,14 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		 */
 		env.loop_max  = min(sysctl_sched_nr_migrate, busiest->nr_running);
 
+more_balance_reset_break:
+		/*
+		 * If busiest rq is small, nr_running < SCHED_NR_MIGRATE_BREAK
+		 * and all tasks are not movable, detach_tasks() should not
+		 * iterate more than tasks available in rq.
+		 */
+		env.loop_break = min(SCHED_NR_MIGRATE_BREAK, busiest->nr_running);
+
 more_balance:
 		rq_lock_irqsave(busiest, &rf);
 		update_rq_clock(busiest);
@@ -11328,13 +11335,12 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 			env.dst_cpu	 = env.new_dst_cpu;
 			env.flags	&= ~LBF_DST_PINNED;
 			env.loop	 = 0;
-			env.loop_break	 = SCHED_NR_MIGRATE_BREAK;
 
 			/*
 			 * Go back to "more_balance" rather than "redo" since we
 			 * need to continue with same src_cpu.
 			 */
-			goto more_balance;
+			goto more_balance_reset_break;
 		}
 
 		/*
@@ -11360,7 +11366,6 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 			 */
 			if (!cpumask_subset(cpus, env.dst_grpmask)) {
 				env.loop = 0;
-				env.loop_break = SCHED_NR_MIGRATE_BREAK;
 				goto redo;
 			}
 			goto out_all_pinned;
-- 
2.39.3