[v4] sched/fair: do not scan twice in detach_tasks()

[PATCH v4] sched/fair: do not scan twice in detach_tasks()

Posted by Huang Shijie 2 months, 2 weeks ago

detach_tasks() uses struct lb_env.loop_max as an env.src_rq->cfs_tasks
iteration count limit. It is however set without the source RQ lock held,
and besides detach_tasks() can be re-invoked after releasing and
re-acquiring the RQ lock per LBF_NEED_BREAK.

This means that env.loop_max and the actual length of env.src_rq->cfs_tasks
as observed within detach_tasks() can differ. This can cause some tasks to
be unnecessarily iterated over more than once, for instance:

  env.loop_max := 4
  detach_tasks()
    // Here env.src->cfs_tasks only contains two tasks which can't be
    // migrated anywhere, so they're put back in the list each time.
    env.src->cfs_tasks := [p1, p0]
    // The iteration goes:
    p0; cfs_tasks := [p0, p1]
    p1; cfs_tasks := [p1, p0]
    p0; cfs_tasks := [p0, p1]
    p1; cfs_tasks := [p1, p0]

    // IOW we iterate over each task twice

In the Specjbb test, the similar issues can be caught many times.
(Over 330,000 times in a 30-minites Specjbb test)

The patch introduces "first_back" to record the first task which
is put back to the task list. If we get a task which is equal to
first_back, we break the loop, and avoid to scan twice for it.

Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
---
v3 --> v4:
    Changed the commit message suggested by Valentin Schneider.
    v3: https://lore.kernel.org/all/20250718063523.9232-1-shijie@os.amperecomputing.com/

v2 --> v3:
    Fix a typo in the commit message.
    v2: https://lore.kernel.org/all/20250718054709.8781-1-shijie@os.amperecomputing.com/

v1 --> v2:
    Add more comment from Valentin Schneider
    v1: https://lore.kernel.org/all/20250707083636.38380-1-shijie@os.amperecomputing.com/
---
 kernel/sched/fair.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7e2963efe800..7cc9d50e3e11 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9443,6 +9443,7 @@ static int detach_tasks(struct lb_env *env)
 {
 	struct list_head *tasks = &env->src_rq->cfs_tasks;
 	unsigned long util, load;
+	struct task_struct *first_back = NULL;
 	struct task_struct *p;
 	int detached = 0;
 
@@ -9481,6 +9482,12 @@ static int detach_tasks(struct lb_env *env)
 		}
 
 		p = list_last_entry(tasks, struct task_struct, se.group_node);
+		/*
+		 * We're back to an already visited task that couldn't be
+		 * detached, we've seen all there is to see.
+		 */
+		if (p == first_back)
+			break;
 
 		if (!can_migrate_task(p, env))
 			goto next;
@@ -9562,6 +9569,8 @@ static int detach_tasks(struct lb_env *env)
 			schedstat_inc(p->stats.nr_failed_migrations_hot);
 
 		list_move(&p->se.group_node, tasks);
+		if (!first_back)
+			first_back = p;
 	}
 
 	/*
-- 
2.40.1

Re: [PATCH v4] sched/fair: do not scan twice in detach_tasks()

Posted by Vincent Guittot 2 months, 2 weeks ago

On Mon, 21 Jul 2025 at 04:40, Huang Shijie
<shijie@os.amperecomputing.com> wrote:
>
> detach_tasks() uses struct lb_env.loop_max as an env.src_rq->cfs_tasks
> iteration count limit. It is however set without the source RQ lock held,
> and besides detach_tasks() can be re-invoked after releasing and
> re-acquiring the RQ lock per LBF_NEED_BREAK.
>
> This means that env.loop_max and the actual length of env.src_rq->cfs_tasks
> as observed within detach_tasks() can differ. This can cause some tasks to

why not setting env.loop_max only once rq lock is taken in this case ?

side note : by default loop_max <= loop_break

> be unnecessarily iterated over more than once, for instance:
>
>   env.loop_max := 4
>   detach_tasks()
>     // Here env.src->cfs_tasks only contains two tasks which can't be
>     // migrated anywhere, so they're put back in the list each time.
>     env.src->cfs_tasks := [p1, p0]
>     // The iteration goes:
>     p0; cfs_tasks := [p0, p1]
>     p1; cfs_tasks := [p1, p0]
>     p0; cfs_tasks := [p0, p1]
>     p1; cfs_tasks := [p1, p0]
>
>     // IOW we iterate over each task twice
>
> In the Specjbb test, the similar issues can be caught many times.
> (Over 330,000 times in a 30-minites Specjbb test)
>
> The patch introduces "first_back" to record the first task which
> is put back to the task list. If we get a task which is equal to
> first_back, we break the loop, and avoid to scan twice for it.
>
> Reviewed-by: Valentin Schneider <vschneid@redhat.com>
> Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
> ---
> v3 --> v4:
>     Changed the commit message suggested by Valentin Schneider.
>     v3: https://lore.kernel.org/all/20250718063523.9232-1-shijie@os.amperecomputing.com/
>
> v2 --> v3:
>     Fix a typo in the commit message.
>     v2: https://lore.kernel.org/all/20250718054709.8781-1-shijie@os.amperecomputing.com/
>
> v1 --> v2:
>     Add more comment from Valentin Schneider
>     v1: https://lore.kernel.org/all/20250707083636.38380-1-shijie@os.amperecomputing.com/
> ---
>  kernel/sched/fair.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7e2963efe800..7cc9d50e3e11 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9443,6 +9443,7 @@ static int detach_tasks(struct lb_env *env)
>  {
>         struct list_head *tasks = &env->src_rq->cfs_tasks;
>         unsigned long util, load;
> +       struct task_struct *first_back = NULL;
>         struct task_struct *p;
>         int detached = 0;
>
> @@ -9481,6 +9482,12 @@ static int detach_tasks(struct lb_env *env)
>                 }
>
>                 p = list_last_entry(tasks, struct task_struct, se.group_node);
> +               /*
> +                * We're back to an already visited task that couldn't be
> +                * detached, we've seen all there is to see.
> +                */
> +               if (p == first_back)
> +                       break;
>
>                 if (!can_migrate_task(p, env))
>                         goto next;
> @@ -9562,6 +9569,8 @@ static int detach_tasks(struct lb_env *env)
>                         schedstat_inc(p->stats.nr_failed_migrations_hot);
>
>                 list_move(&p->se.group_node, tasks);
> +               if (!first_back)
> +                       first_back = p;
>         }
>
>         /*
> --
> 2.40.1
>

Re: [PATCH v4] sched/fair: do not scan twice in detach_tasks()

Posted by Shijie Huang 2 months, 2 weeks ago

On 2025/7/21 17:40, Vincent Guittot wrote:
> On Mon, 21 Jul 2025 at 04:40, Huang Shijie
> <shijie@os.amperecomputing.com> wrote:
>> detach_tasks() uses struct lb_env.loop_max as an env.src_rq->cfs_tasks
>> iteration count limit. It is however set without the source RQ lock held,
>> and besides detach_tasks() can be re-invoked after releasing and
>> re-acquiring the RQ lock per LBF_NEED_BREAK.
>>
>> This means that env.loop_max and the actual length of env.src_rq->cfs_tasks
>> as observed within detach_tasks() can differ. This can cause some tasks to
> why not setting env.loop_max only once rq lock is taken in this case ?

Yes.  we do it in this way.


Thanks

Huang Shijie

Re: [PATCH v4] sched/fair: do not scan twice in detach_tasks()

Posted by Valentin Schneider 2 months, 2 weeks ago

On 21/07/25 11:40, Vincent Guittot wrote:
> On Mon, 21 Jul 2025 at 04:40, Huang Shijie
> <shijie@os.amperecomputing.com> wrote:
>>
>> detach_tasks() uses struct lb_env.loop_max as an env.src_rq->cfs_tasks
>> iteration count limit. It is however set without the source RQ lock held,
>> and besides detach_tasks() can be re-invoked after releasing and
>> re-acquiring the RQ lock per LBF_NEED_BREAK.
>>
>> This means that env.loop_max and the actual length of env.src_rq->cfs_tasks
>> as observed within detach_tasks() can differ. This can cause some tasks to
>
> why not setting env.loop_max only once rq lock is taken in this case ?
>
> side note : by default loop_max <= loop_break
>

I thought so too and dismissed that due to LBF_NEED_BREAK, but I guess we
could still do something like:

---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b9b4bbbf0af6f..eef3a0d341661 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11643,6 +11643,7 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
 		.dst_grpmask    = group_balance_mask(sd->groups),
 		.idle		= idle,
 		.loop_break	= SCHED_NR_MIGRATE_BREAK,
+		.loop_max       = UINT_MAX,
 		.cpus		= cpus,
 		.fbq_type	= all,
 		.tasks		= LIST_HEAD_INIT(env.tasks),
@@ -11681,18 +11682,19 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
 	/* Clear this flag as soon as we find a pullable task */
 	env.flags |= LBF_ALL_PINNED;
 	if (busiest->nr_running > 1) {
+more_balance:
 		/*
 		 * Attempt to move tasks. If sched_balance_find_src_group has found
 		 * an imbalance but busiest->nr_running <= 1, the group is
 		 * still unbalanced. ld_moved simply stays zero, so it is
 		 * correctly treated as an imbalance.
 		 */
-		env.loop_max  = min(sysctl_sched_nr_migrate, busiest->nr_running);
-
-more_balance:
 		rq_lock_irqsave(busiest, &rf);
 		update_rq_clock(busiest);
 
+
+		env.loop_max = min3(env.loop_max, sysctl_sched_nr_migrate, busiest->h_nr_running);
+
 		/*
 		 * cur_ld_moved - load moved in current iteration
 		 * ld_moved     - cumulative load moved across iterations

Re: [PATCH v4] sched/fair: do not scan twice in detach_tasks()

Posted by Vincent Guittot 2 months, 2 weeks ago

On Mon, 21 Jul 2025 at 13:25, Valentin Schneider <vschneid@redhat.com> wrote:
>
> On 21/07/25 11:40, Vincent Guittot wrote:
> > On Mon, 21 Jul 2025 at 04:40, Huang Shijie
> > <shijie@os.amperecomputing.com> wrote:
> >>
> >> detach_tasks() uses struct lb_env.loop_max as an env.src_rq->cfs_tasks
> >> iteration count limit. It is however set without the source RQ lock held,
> >> and besides detach_tasks() can be re-invoked after releasing and
> >> re-acquiring the RQ lock per LBF_NEED_BREAK.
> >>
> >> This means that env.loop_max and the actual length of env.src_rq->cfs_tasks
> >> as observed within detach_tasks() can differ. This can cause some tasks to
> >
> > why not setting env.loop_max only once rq lock is taken in this case ?
> >
> > side note : by default loop_max <= loop_break
> >
>
> I thought so too and dismissed that due to LBF_NEED_BREAK, but I guess we
> could still do something like:
>
> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b9b4bbbf0af6f..eef3a0d341661 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11643,6 +11643,7 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>                 .dst_grpmask    = group_balance_mask(sd->groups),
>                 .idle           = idle,
>                 .loop_break     = SCHED_NR_MIGRATE_BREAK,
> +               .loop_max       = UINT_MAX,
>                 .cpus           = cpus,
>                 .fbq_type       = all,
>                 .tasks          = LIST_HEAD_INIT(env.tasks),
> @@ -11681,18 +11682,19 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>         /* Clear this flag as soon as we find a pullable task */
>         env.flags |= LBF_ALL_PINNED;
>         if (busiest->nr_running > 1) {
> +more_balance:
>                 /*
>                  * Attempt to move tasks. If sched_balance_find_src_group has found
>                  * an imbalance but busiest->nr_running <= 1, the group is
>                  * still unbalanced. ld_moved simply stays zero, so it is
>                  * correctly treated as an imbalance.
>                  */
> -               env.loop_max  = min(sysctl_sched_nr_migrate, busiest->nr_running);
> -
> -more_balance:
>                 rq_lock_irqsave(busiest, &rf);
>                 update_rq_clock(busiest);
>
> +
> +               env.loop_max = min3(env.loop_max, sysctl_sched_nr_migrate, busiest->h_nr_running);
> +
>                 /*
>                  * cur_ld_moved - load moved in current iteration
>                  * ld_moved     - cumulative load moved across iterations
>

I would prefer something like below:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1b3879850a9e..636798d53798 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11702,12 +11702,16 @@ static int sched_balance_rq(int this_cpu,
struct rq *this_rq,
                 * still unbalanced. ld_moved simply stays zero, so it is
                 * correctly treated as an imbalance.
                 */
-               env.loop_max  = min(sysctl_sched_nr_migrate,
busiest->nr_running);

 more_balance:
                rq_lock_irqsave(busiest, &rf);
                update_rq_clock(busiest);

+               if (!env.loop_max)
+                       env.loop_max = min(sysctl_sched_nr_migrate,
busiest->cfs.h_nr_runnable);
+               else
+                       env.loop_max = min(env.loop_max,
busiest->cfs.h_nr_runnable);
+
                /*
                 * cur_ld_moved - load moved in current iteration
                 * ld_moved     - cumulative load moved across iterations

Re: [PATCH v4] sched/fair: do not scan twice in detach_tasks()

Posted by Vincent Guittot 2 months, 2 weeks ago

On Tue, 22 Jul 2025 at 09:49, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> On Mon, 21 Jul 2025 at 13:25, Valentin Schneider <vschneid@redhat.com> wrote:
> >
> > On 21/07/25 11:40, Vincent Guittot wrote:
> > > On Mon, 21 Jul 2025 at 04:40, Huang Shijie
> > > <shijie@os.amperecomputing.com> wrote:
> > >>
> > >> detach_tasks() uses struct lb_env.loop_max as an env.src_rq->cfs_tasks
> > >> iteration count limit. It is however set without the source RQ lock held,
> > >> and besides detach_tasks() can be re-invoked after releasing and
> > >> re-acquiring the RQ lock per LBF_NEED_BREAK.
> > >>
> > >> This means that env.loop_max and the actual length of env.src_rq->cfs_tasks
> > >> as observed within detach_tasks() can differ. This can cause some tasks to
> > >
> > > why not setting env.loop_max only once rq lock is taken in this case ?
> > >
> > > side note : by default loop_max <= loop_break
> > >
> >
> > I thought so too and dismissed that due to LBF_NEED_BREAK, but I guess we
> > could still do something like:
> >
> > ---
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index b9b4bbbf0af6f..eef3a0d341661 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -11643,6 +11643,7 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
> >                 .dst_grpmask    = group_balance_mask(sd->groups),
> >                 .idle           = idle,
> >                 .loop_break     = SCHED_NR_MIGRATE_BREAK,
> > +               .loop_max       = UINT_MAX,
> >                 .cpus           = cpus,
> >                 .fbq_type       = all,
> >                 .tasks          = LIST_HEAD_INIT(env.tasks),
> > @@ -11681,18 +11682,19 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
> >         /* Clear this flag as soon as we find a pullable task */
> >         env.flags |= LBF_ALL_PINNED;
> >         if (busiest->nr_running > 1) {
> > +more_balance:
> >                 /*
> >                  * Attempt to move tasks. If sched_balance_find_src_group has found
> >                  * an imbalance but busiest->nr_running <= 1, the group is
> >                  * still unbalanced. ld_moved simply stays zero, so it is
> >                  * correctly treated as an imbalance.
> >                  */
> > -               env.loop_max  = min(sysctl_sched_nr_migrate, busiest->nr_running);
> > -
> > -more_balance:
> >                 rq_lock_irqsave(busiest, &rf);
> >                 update_rq_clock(busiest);
> >
> > +
> > +               env.loop_max = min3(env.loop_max, sysctl_sched_nr_migrate, busiest->h_nr_running);
> > +
> >                 /*
> >                  * cur_ld_moved - load moved in current iteration
> >                  * ld_moved     - cumulative load moved across iterations
> >
>
> I would prefer something like below:
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1b3879850a9e..636798d53798 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11702,12 +11702,16 @@ static int sched_balance_rq(int this_cpu,
> struct rq *this_rq,
>                  * still unbalanced. ld_moved simply stays zero, so it is
>                  * correctly treated as an imbalance.
>                  */
> -               env.loop_max  = min(sysctl_sched_nr_migrate,
> busiest->nr_running);
>
>  more_balance:
>                 rq_lock_irqsave(busiest, &rf);
>                 update_rq_clock(busiest);
>
> +               if (!env.loop_max)
> +                       env.loop_max = min(sysctl_sched_nr_migrate,
> busiest->cfs.h_nr_runnable);

it should be h_nr_queued as mentioned by Huang and my patch has been
messed up by my web browser

> +               else
> +                       env.loop_max = min(env.loop_max,
> busiest->cfs.h_nr_runnable);
> +
>                 /*
>                  * cur_ld_moved - load moved in current iteration
>                  * ld_moved     - cumulative load moved across iterations

Re: [PATCH v4] sched/fair: do not scan twice in detach_tasks()

Posted by Shijie Huang 2 months, 2 weeks ago

On 2025/7/22 15:53, Vincent Guittot wrote:
>>   more_balance:
>>                  rq_lock_irqsave(busiest, &rf);
>>                  update_rq_clock(busiest);
>>
>> +               if (!env.loop_max)
>> +                       env.loop_max = min(sysctl_sched_nr_migrate,
>> busiest->cfs.h_nr_runnable);
> it should be h_nr_queued as mentioned by Huang and my patch has been
> messed up by my web browser

okay, I can create a patch later, and send it out after it passes the 
Specjbb test.


Thanks

Huang Shijie

Re: [PATCH v4] sched/fair: do not scan twice in detach_tasks()

Posted by Valentin Schneider 2 months, 2 weeks ago

On 22/07/25 09:53, Vincent Guittot wrote:
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 1b3879850a9e..636798d53798 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -11702,12 +11702,16 @@ static int sched_balance_rq(int this_cpu,
>> struct rq *this_rq,
>>                  * still unbalanced. ld_moved simply stays zero, so it is
>>                  * correctly treated as an imbalance.
>>                  */
>> -               env.loop_max  = min(sysctl_sched_nr_migrate,
>> busiest->nr_running);
>>
>>  more_balance:
>>                 rq_lock_irqsave(busiest, &rf);
>>                 update_rq_clock(busiest);
>>
>> +               if (!env.loop_max)
>> +                       env.loop_max = min(sysctl_sched_nr_migrate,
>> busiest->cfs.h_nr_runnable);
>
> it should be h_nr_queued as mentioned by Huang and my patch has been
> messed up by my web browser
>

Works for me.

Re: [PATCH v4] sched/fair: do not scan twice in detach_tasks()

Posted by Shijie Huang 2 months, 2 weeks ago

On 2025/7/21 19:25, Valentin Schneider wrote:
> On 21/07/25 11:40, Vincent Guittot wrote:
>> On Mon, 21 Jul 2025 at 04:40, Huang Shijie
>> <shijie@os.amperecomputing.com> wrote:
>>> detach_tasks() uses struct lb_env.loop_max as an env.src_rq->cfs_tasks
>>> iteration count limit. It is however set without the source RQ lock held,
>>> and besides detach_tasks() can be re-invoked after releasing and
>>> re-acquiring the RQ lock per LBF_NEED_BREAK.
>>>
>>> This means that env.loop_max and the actual length of env.src_rq->cfs_tasks
>>> as observed within detach_tasks() can differ. This can cause some tasks to
>> why not setting env.loop_max only once rq lock is taken in this case ?
>>
>> side note : by default loop_max <= loop_break
>>
> I thought so too and dismissed that due to LBF_NEED_BREAK, but I guess we
> could still do something like:
>
> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b9b4bbbf0af6f..eef3a0d341661 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11643,6 +11643,7 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>   		.dst_grpmask    = group_balance_mask(sd->groups),
>   		.idle		= idle,
>   		.loop_break	= SCHED_NR_MIGRATE_BREAK,
> +		.loop_max       = UINT_MAX,
>   		.cpus		= cpus,
>   		.fbq_type	= all,
>   		.tasks		= LIST_HEAD_INIT(env.tasks),
> @@ -11681,18 +11682,19 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>   	/* Clear this flag as soon as we find a pullable task */
>   	env.flags |= LBF_ALL_PINNED;
>   	if (busiest->nr_running > 1) {
> +more_balance:
>   		/*
>   		 * Attempt to move tasks. If sched_balance_find_src_group has found
>   		 * an imbalance but busiest->nr_running <= 1, the group is
>   		 * still unbalanced. ld_moved simply stays zero, so it is
>   		 * correctly treated as an imbalance.
>   		 */
> -		env.loop_max  = min(sysctl_sched_nr_migrate, busiest->nr_running);
> -
> -more_balance:
>   		rq_lock_irqsave(busiest, &rf);
>   		update_rq_clock(busiest);
>   
> +
> +		env.loop_max = min3(env.loop_max, sysctl_sched_nr_migrate, busiest->h_nr_running);

It should be busiest->nr_running? or businest->cfs.h_nr_queued?


do you mind I create a patch based on this one? or You create an 
official patch?


Thanks

Huang Shijie