[v1] sched/fair: do not scan twice in detach_tasks()

[PATCH] sched/fair: do not scan twice in detach_tasks()

Posted by Huang Shijie 3 months ago

When detach_tasks() scans the src_cpu's task list, the task list
may shrink during the scanning. For example, the task list
may have four tasks at the beginning, it may becomes to two
during the scanning in detach_tasks():
    Task list at beginning : "ABCD"
    Task list in scanning  : "CD"

    (ABCD stands for differnt tasks.)

In this scenario, the env->loop_max is still four, so
detach_tasks() may scan twice for some tasks:
    the scanning order maybe : "DCDC"

The patch introduces "first_back" to record the first task which
is put back to the task list. If we get a task which is equal to
first_back, we break the loop, and avoid to scan twice for it.

Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
---
 kernel/sched/fair.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7e2963efe800..0e9c8ae68cc2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9443,6 +9443,7 @@ static int detach_tasks(struct lb_env *env)
 {
 	struct list_head *tasks = &env->src_rq->cfs_tasks;
 	unsigned long util, load;
+	struct task_struct *first_back = NULL;
 	struct task_struct *p;
 	int detached = 0;
 
@@ -9481,6 +9482,8 @@ static int detach_tasks(struct lb_env *env)
 		}
 
 		p = list_last_entry(tasks, struct task_struct, se.group_node);
+		if (p == first_back)
+			break;
 
 		if (!can_migrate_task(p, env))
 			goto next;
@@ -9562,6 +9565,8 @@ static int detach_tasks(struct lb_env *env)
 			schedstat_inc(p->stats.nr_failed_migrations_hot);
 
 		list_move(&p->se.group_node, tasks);
+		if (!first_back)
+			first_back = p;
 	}
 
 	/*
-- 
2.40.1

Re: [PATCH] sched/fair: do not scan twice in detach_tasks()

Posted by Valentin Schneider 2 months, 3 weeks ago

On 07/07/25 16:36, Huang Shijie wrote:
> When detach_tasks() scans the src_cpu's task list, the task list
> may shrink during the scanning. For example, the task list
> may have four tasks at the beginning, it may becomes to two
> during the scanning in detach_tasks():
>     Task list at beginning : "ABCD"
>     Task list in scanning  : "CD"
>
>     (ABCD stands for differnt tasks.)
>
> In this scenario, the env->loop_max is still four, so
> detach_tasks() may scan twice for some tasks:
>     the scanning order maybe : "DCDC"
>


Huh, a quick hacky test suggests this isn't /too/ hard to trigger; I get
about one occurrence every two default hackbench run (~200ms) on my dummy
QEMU setup.

Have you seen this happen on your workloads or did you find this while
staring at the code?

> The patch introduces "first_back" to record the first task which
> is put back to the task list. If we get a task which is equal to
> first_back, we break the loop, and avoid to scan twice for it.
>

Potentially more than twice, no?

> Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
> ---
>  kernel/sched/fair.c | 5 +++++
>  1 file changed, 5 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7e2963efe800..0e9c8ae68cc2 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9443,6 +9443,7 @@ static int detach_tasks(struct lb_env *env)
>  {
>       struct list_head *tasks = &env->src_rq->cfs_tasks;
>       unsigned long util, load;
> +	struct task_struct *first_back = NULL;
>       struct task_struct *p;
>       int detached = 0;
>
> @@ -9481,6 +9482,8 @@ static int detach_tasks(struct lb_env *env)
>               }
>
>               p = list_last_entry(tasks, struct task_struct, se.group_node);

Small comment nit:

                /*
                 * We're back to an already visited task that couldn't be
                 * detached, we've seen all there is to see.
                 */

> +		if (p == first_back)
> +			break;
>
>               if (!can_migrate_task(p, env))
>                       goto next;
> @@ -9562,6 +9565,8 @@ static int detach_tasks(struct lb_env *env)
>                       schedstat_inc(p->stats.nr_failed_migrations_hot);
>
>               list_move(&p->se.group_node, tasks);
> +		if (!first_back)
> +			first_back = p;
>       }
>
>       /*
> --
> 2.40.1

Re: [PATCH] sched/fair: do not scan twice in detach_tasks()

Posted by Shijie Huang 2 months, 3 weeks ago

On 2025/7/16 1:04, Valentin Schneider wrote:
> On 07/07/25 16:36, Huang Shijie wrote:
>> When detach_tasks() scans the src_cpu's task list, the task list
>> may shrink during the scanning. For example, the task list
>> may have four tasks at the beginning, it may becomes to two
>> during the scanning in detach_tasks():
>>      Task list at beginning : "ABCD"
>>      Task list in scanning  : "CD"
>>
>>      (ABCD stands for differnt tasks.)
>>
>> In this scenario, the env->loop_max is still four, so
>> detach_tasks() may scan twice for some tasks:
>>      the scanning order maybe : "DCDC"
>>
>
> Huh, a quick hacky test suggests this isn't /too/ hard to trigger; I get
> about one occurrence every two default hackbench run (~200ms) on my dummy
> QEMU setup.
>
> Have you seen this happen on your workloads or did you find this while
> staring at the code?

I found this issue in my Specjbb2015 test.  It is very easy to trigger.

I noticed many times in the test.

I even found that:

         At the beginning: env->loop_max is 12.

        When the detach_tasks() scans: the real task list shrink to 5.


>> The patch introduces "first_back" to record the first task which
>> is put back to the task list. If we get a task which is equal to
>> first_back, we break the loop, and avoid to scan twice for it.
>>
> Potentially more than twice, no?

yes.  it maybe more then twice in theory.



>
>> Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
>> ---
>>   kernel/sched/fair.c | 5 +++++
>>   1 file changed, 5 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 7e2963efe800..0e9c8ae68cc2 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -9443,6 +9443,7 @@ static int detach_tasks(struct lb_env *env)
>>   {
>>        struct list_head *tasks = &env->src_rq->cfs_tasks;
>>        unsigned long util, load;
>> +	struct task_struct *first_back = NULL;
>>        struct task_struct *p;
>>        int detached = 0;
>>
>> @@ -9481,6 +9482,8 @@ static int detach_tasks(struct lb_env *env)
>>                }
>>
>>                p = list_last_entry(tasks, struct task_struct, se.group_node);
> Small comment nit:
>
>                  /*
>                   * We're back to an already visited task that couldn't be
>                   * detached, we've seen all there is to see.
>                   */

Thanks, will use it in next version.


Thanks

Huang Shijie

Re: [PATCH] sched/fair: do not scan twice in detach_tasks()

Posted by Valentin Schneider 2 months, 3 weeks ago

On 16/07/25 10:13, Shijie Huang wrote:
> On 2025/7/16 1:04, Valentin Schneider wrote:
>> On 07/07/25 16:36, Huang Shijie wrote:
>>> When detach_tasks() scans the src_cpu's task list, the task list
>>> may shrink during the scanning. For example, the task list
>>> may have four tasks at the beginning, it may becomes to two
>>> during the scanning in detach_tasks():
>>>      Task list at beginning : "ABCD"
>>>      Task list in scanning  : "CD"
>>>
>>>      (ABCD stands for differnt tasks.)
>>>
>>> In this scenario, the env->loop_max is still four, so
>>> detach_tasks() may scan twice for some tasks:
>>>      the scanning order maybe : "DCDC"
>>>
>>
>> Huh, a quick hacky test suggests this isn't /too/ hard to trigger; I get
>> about one occurrence every two default hackbench run (~200ms) on my dummy
>> QEMU setup.
>>
>> Have you seen this happen on your workloads or did you find this while
>> staring at the code?
>
> I found this issue in my Specjbb2015 test.  It is very easy to trigger.
>

That would be good to include in the changelog.

> I noticed many times in the test.
>
> I even found that:
>
>          At the beginning: env->loop_max is 12.
>
>         When the detach_tasks() scans: the real task list shrink to 5.
>

That's set using rq->nr_running which includes more than just the CFS
tasks, and looking at the git history it looks like it's almost always been
the case.

Looks like env->loop_max really is only used for detach_tasks(), so perhaps
a "better" fix would be something like the below, so that we can't iterate
more than length(env->src_rq->cfs_tasks). That is, assuming

  rq->cfs.h_nr_queued == length(rq->cfs_tasks)

---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b9b4bbbf0af6f..32ae24aa377ca 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11687,7 +11687,7 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
 		 * still unbalanced. ld_moved simply stays zero, so it is
 		 * correctly treated as an imbalance.
 		 */
-		env.loop_max  = min(sysctl_sched_nr_migrate, busiest->nr_running);
+		env.loop_max  = min(sysctl_sched_nr_migrate, busiest->cfs.h_nr_queued);
 
 more_balance:
 		rq_lock_irqsave(busiest, &rf);

Re: [PATCH] sched/fair: do not scan twice in detach_tasks()

Posted by Shijie Huang 2 months, 3 weeks ago

On 2025/7/16 23:08, Valentin Schneider wrote:
> On 16/07/25 10:13, Shijie Huang wrote:
>> On 2025/7/16 1:04, Valentin Schneider wrote:
>>> On 07/07/25 16:36, Huang Shijie wrote:
>>>> When detach_tasks() scans the src_cpu's task list, the task list
>>>> may shrink during the scanning. For example, the task list
>>>> may have four tasks at the beginning, it may becomes to two
>>>> during the scanning in detach_tasks():
>>>>       Task list at beginning : "ABCD"
>>>>       Task list in scanning  : "CD"
>>>>
>>>>       (ABCD stands for differnt tasks.)
>>>>
>>>> In this scenario, the env->loop_max is still four, so
>>>> detach_tasks() may scan twice for some tasks:
>>>>       the scanning order maybe : "DCDC"
>>>>
>>> Huh, a quick hacky test suggests this isn't /too/ hard to trigger; I get
>>> about one occurrence every two default hackbench run (~200ms) on my dummy
>>> QEMU setup.
>>>
>>> Have you seen this happen on your workloads or did you find this while
>>> staring at the code?
>> I found this issue in my Specjbb2015 test.  It is very easy to trigger.
>>
> That would be good to include in the changelog.
okay.
>> I noticed many times in the test.
>>
>> I even found that:
>>
>>           At the beginning: env->loop_max is 12.
>>
>>          When the detach_tasks() scans: the real task list shrink to 5.
>>
> That's set using rq->nr_running which includes more than just the CFS
> tasks, and looking at the git history it looks like it's almost always been
> the case.
>
> Looks like env->loop_max really is only used for detach_tasks(), so perhaps
> a "better" fix would be something like the below, so that we can't iterate
> more than length(env->src_rq->cfs_tasks). That is, assuming
>
>    rq->cfs.h_nr_queued == length(rq->cfs_tasks)
>
> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b9b4bbbf0af6f..32ae24aa377ca 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11687,7 +11687,7 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>   		 * still unbalanced. ld_moved simply stays zero, so it is
>   		 * correctly treated as an imbalance.
>   		 */
> -		env.loop_max  = min(sysctl_sched_nr_migrate, busiest->nr_running);
> +		env.loop_max  = min(sysctl_sched_nr_migrate, busiest->cfs.h_nr_queued);

I tested this patch, it did not work. I still can catch lots of 
occurrences of this issue in Specjbb test.


IMHO, the root cause of this issue is env.loop_max is set out of the 
rq's lock.

Even we set env.loop_max to busiest->cfs.h_nr_queued, the real tasks 
length still can shrink in

other places.


Btw, the real tasks's length can also bigger then env.loop_max.

  For example, we set env.loop_max with 5, when detach_tasks() runs, the 
real tasks' length can be changed to 7.


Thanks

Huang Shijie

>   
>   more_balance:
>   		rq_lock_irqsave(busiest, &rf);
>

Re: [PATCH] sched/fair: do not scan twice in detach_tasks()

Posted by Valentin Schneider 2 months, 3 weeks ago

On 17/07/25 10:56, Shijie Huang wrote:
> On 2025/7/16 23:08, Valentin Schneider wrote:
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index b9b4bbbf0af6f..32ae24aa377ca 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -11687,7 +11687,7 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>>   		 * still unbalanced. ld_moved simply stays zero, so it is
>>   		 * correctly treated as an imbalance.
>>   		 */
>> -		env.loop_max  = min(sysctl_sched_nr_migrate, busiest->nr_running);
>> +		env.loop_max  = min(sysctl_sched_nr_migrate, busiest->cfs.h_nr_queued);
>
> I tested this patch, it did not work. I still can catch lots of 
> occurrences of this issue in Specjbb test.
>
>
> IMHO, the root cause of this issue is env.loop_max is set out of the 
> rq's lock.
>
> Even we set env.loop_max to busiest->cfs.h_nr_queued, the real tasks 
> length still can shrink in
>
> other places.
>

Ah right, and updating the max in detach_tasks() itself isn't a complete
solution if we re-enter it due to LBF_NEED_BREAK. Nevermind then :-)