kernel/sched/fair.c | 5 +++++ 1 file changed, 5 insertions(+)
When detach_tasks() scans the src_cpu's task list, the task list
may shrink during the scanning. For example, the task list
may have four tasks at the beginning, it may becomes to two
during the scanning in detach_tasks():
Task list at beginning : "ABCD"
Task list in scanning : "CD"
(ABCD stands for differnt tasks.)
In this scenario, the env->loop_max is still four, so
detach_tasks() may scan twice for some tasks:
the scanning order maybe : "DCDC"
The patch introduces "first_back" to record the first task which
is put back to the task list. If we get a task which is equal to
first_back, we break the loop, and avoid to scan twice for it.
Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
---
kernel/sched/fair.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7e2963efe800..0e9c8ae68cc2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9443,6 +9443,7 @@ static int detach_tasks(struct lb_env *env)
{
struct list_head *tasks = &env->src_rq->cfs_tasks;
unsigned long util, load;
+ struct task_struct *first_back = NULL;
struct task_struct *p;
int detached = 0;
@@ -9481,6 +9482,8 @@ static int detach_tasks(struct lb_env *env)
}
p = list_last_entry(tasks, struct task_struct, se.group_node);
+ if (p == first_back)
+ break;
if (!can_migrate_task(p, env))
goto next;
@@ -9562,6 +9565,8 @@ static int detach_tasks(struct lb_env *env)
schedstat_inc(p->stats.nr_failed_migrations_hot);
list_move(&p->se.group_node, tasks);
+ if (!first_back)
+ first_back = p;
}
/*
--
2.40.1
On 07/07/25 16:36, Huang Shijie wrote: > When detach_tasks() scans the src_cpu's task list, the task list > may shrink during the scanning. For example, the task list > may have four tasks at the beginning, it may becomes to two > during the scanning in detach_tasks(): > Task list at beginning : "ABCD" > Task list in scanning : "CD" > > (ABCD stands for differnt tasks.) > > In this scenario, the env->loop_max is still four, so > detach_tasks() may scan twice for some tasks: > the scanning order maybe : "DCDC" > Huh, a quick hacky test suggests this isn't /too/ hard to trigger; I get about one occurrence every two default hackbench run (~200ms) on my dummy QEMU setup. Have you seen this happen on your workloads or did you find this while staring at the code? > The patch introduces "first_back" to record the first task which > is put back to the task list. If we get a task which is equal to > first_back, we break the loop, and avoid to scan twice for it. > Potentially more than twice, no? > Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com> > --- > kernel/sched/fair.c | 5 +++++ > 1 file changed, 5 insertions(+) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 7e2963efe800..0e9c8ae68cc2 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -9443,6 +9443,7 @@ static int detach_tasks(struct lb_env *env) > { > struct list_head *tasks = &env->src_rq->cfs_tasks; > unsigned long util, load; > + struct task_struct *first_back = NULL; > struct task_struct *p; > int detached = 0; > > @@ -9481,6 +9482,8 @@ static int detach_tasks(struct lb_env *env) > } > > p = list_last_entry(tasks, struct task_struct, se.group_node); Small comment nit: /* * We're back to an already visited task that couldn't be * detached, we've seen all there is to see. */ > + if (p == first_back) > + break; > > if (!can_migrate_task(p, env)) > goto next; > @@ -9562,6 +9565,8 @@ static int detach_tasks(struct lb_env *env) > schedstat_inc(p->stats.nr_failed_migrations_hot); > > list_move(&p->se.group_node, tasks); > + if (!first_back) > + first_back = p; > } > > /* > -- > 2.40.1
On 2025/7/16 1:04, Valentin Schneider wrote: > On 07/07/25 16:36, Huang Shijie wrote: >> When detach_tasks() scans the src_cpu's task list, the task list >> may shrink during the scanning. For example, the task list >> may have four tasks at the beginning, it may becomes to two >> during the scanning in detach_tasks(): >> Task list at beginning : "ABCD" >> Task list in scanning : "CD" >> >> (ABCD stands for differnt tasks.) >> >> In this scenario, the env->loop_max is still four, so >> detach_tasks() may scan twice for some tasks: >> the scanning order maybe : "DCDC" >> > > Huh, a quick hacky test suggests this isn't /too/ hard to trigger; I get > about one occurrence every two default hackbench run (~200ms) on my dummy > QEMU setup. > > Have you seen this happen on your workloads or did you find this while > staring at the code? I found this issue in my Specjbb2015 test. It is very easy to trigger. I noticed many times in the test. I even found that: At the beginning: env->loop_max is 12. When the detach_tasks() scans: the real task list shrink to 5. >> The patch introduces "first_back" to record the first task which >> is put back to the task list. If we get a task which is equal to >> first_back, we break the loop, and avoid to scan twice for it. >> > Potentially more than twice, no? yes. it maybe more then twice in theory. > >> Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com> >> --- >> kernel/sched/fair.c | 5 +++++ >> 1 file changed, 5 insertions(+) >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 7e2963efe800..0e9c8ae68cc2 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -9443,6 +9443,7 @@ static int detach_tasks(struct lb_env *env) >> { >> struct list_head *tasks = &env->src_rq->cfs_tasks; >> unsigned long util, load; >> + struct task_struct *first_back = NULL; >> struct task_struct *p; >> int detached = 0; >> >> @@ -9481,6 +9482,8 @@ static int detach_tasks(struct lb_env *env) >> } >> >> p = list_last_entry(tasks, struct task_struct, se.group_node); > Small comment nit: > > /* > * We're back to an already visited task that couldn't be > * detached, we've seen all there is to see. > */ Thanks, will use it in next version. Thanks Huang Shijie
On 16/07/25 10:13, Shijie Huang wrote: > On 2025/7/16 1:04, Valentin Schneider wrote: >> On 07/07/25 16:36, Huang Shijie wrote: >>> When detach_tasks() scans the src_cpu's task list, the task list >>> may shrink during the scanning. For example, the task list >>> may have four tasks at the beginning, it may becomes to two >>> during the scanning in detach_tasks(): >>> Task list at beginning : "ABCD" >>> Task list in scanning : "CD" >>> >>> (ABCD stands for differnt tasks.) >>> >>> In this scenario, the env->loop_max is still four, so >>> detach_tasks() may scan twice for some tasks: >>> the scanning order maybe : "DCDC" >>> >> >> Huh, a quick hacky test suggests this isn't /too/ hard to trigger; I get >> about one occurrence every two default hackbench run (~200ms) on my dummy >> QEMU setup. >> >> Have you seen this happen on your workloads or did you find this while >> staring at the code? > > I found this issue in my Specjbb2015 test. It is very easy to trigger. > That would be good to include in the changelog. > I noticed many times in the test. > > I even found that: > > At the beginning: env->loop_max is 12. > > When the detach_tasks() scans: the real task list shrink to 5. > That's set using rq->nr_running which includes more than just the CFS tasks, and looking at the git history it looks like it's almost always been the case. Looks like env->loop_max really is only used for detach_tasks(), so perhaps a "better" fix would be something like the below, so that we can't iterate more than length(env->src_rq->cfs_tasks). That is, assuming rq->cfs.h_nr_queued == length(rq->cfs_tasks) --- diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b9b4bbbf0af6f..32ae24aa377ca 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -11687,7 +11687,7 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq, * still unbalanced. ld_moved simply stays zero, so it is * correctly treated as an imbalance. */ - env.loop_max = min(sysctl_sched_nr_migrate, busiest->nr_running); + env.loop_max = min(sysctl_sched_nr_migrate, busiest->cfs.h_nr_queued); more_balance: rq_lock_irqsave(busiest, &rf);
On 2025/7/16 23:08, Valentin Schneider wrote: > On 16/07/25 10:13, Shijie Huang wrote: >> On 2025/7/16 1:04, Valentin Schneider wrote: >>> On 07/07/25 16:36, Huang Shijie wrote: >>>> When detach_tasks() scans the src_cpu's task list, the task list >>>> may shrink during the scanning. For example, the task list >>>> may have four tasks at the beginning, it may becomes to two >>>> during the scanning in detach_tasks(): >>>> Task list at beginning : "ABCD" >>>> Task list in scanning : "CD" >>>> >>>> (ABCD stands for differnt tasks.) >>>> >>>> In this scenario, the env->loop_max is still four, so >>>> detach_tasks() may scan twice for some tasks: >>>> the scanning order maybe : "DCDC" >>>> >>> Huh, a quick hacky test suggests this isn't /too/ hard to trigger; I get >>> about one occurrence every two default hackbench run (~200ms) on my dummy >>> QEMU setup. >>> >>> Have you seen this happen on your workloads or did you find this while >>> staring at the code? >> I found this issue in my Specjbb2015 test. It is very easy to trigger. >> > That would be good to include in the changelog. okay. >> I noticed many times in the test. >> >> I even found that: >> >> At the beginning: env->loop_max is 12. >> >> When the detach_tasks() scans: the real task list shrink to 5. >> > That's set using rq->nr_running which includes more than just the CFS > tasks, and looking at the git history it looks like it's almost always been > the case. > > Looks like env->loop_max really is only used for detach_tasks(), so perhaps > a "better" fix would be something like the below, so that we can't iterate > more than length(env->src_rq->cfs_tasks). That is, assuming > > rq->cfs.h_nr_queued == length(rq->cfs_tasks) > > --- > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index b9b4bbbf0af6f..32ae24aa377ca 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -11687,7 +11687,7 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq, > * still unbalanced. ld_moved simply stays zero, so it is > * correctly treated as an imbalance. > */ > - env.loop_max = min(sysctl_sched_nr_migrate, busiest->nr_running); > + env.loop_max = min(sysctl_sched_nr_migrate, busiest->cfs.h_nr_queued); I tested this patch, it did not work. I still can catch lots of occurrences of this issue in Specjbb test. IMHO, the root cause of this issue is env.loop_max is set out of the rq's lock. Even we set env.loop_max to busiest->cfs.h_nr_queued, the real tasks length still can shrink in other places. Btw, the real tasks's length can also bigger then env.loop_max. For example, we set env.loop_max with 5, when detach_tasks() runs, the real tasks' length can be changed to 7. Thanks Huang Shijie > > more_balance: > rq_lock_irqsave(busiest, &rf); >
On 17/07/25 10:56, Shijie Huang wrote: > On 2025/7/16 23:08, Valentin Schneider wrote: >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index b9b4bbbf0af6f..32ae24aa377ca 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -11687,7 +11687,7 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq, >> * still unbalanced. ld_moved simply stays zero, so it is >> * correctly treated as an imbalance. >> */ >> - env.loop_max = min(sysctl_sched_nr_migrate, busiest->nr_running); >> + env.loop_max = min(sysctl_sched_nr_migrate, busiest->cfs.h_nr_queued); > > I tested this patch, it did not work. I still can catch lots of > occurrences of this issue in Specjbb test. > > > IMHO, the root cause of this issue is env.loop_max is set out of the > rq's lock. > > Even we set env.loop_max to busiest->cfs.h_nr_queued, the real tasks > length still can shrink in > > other places. > Ah right, and updating the max in detach_tasks() itself isn't a complete solution if we re-enter it due to LBF_NEED_BREAK. Nevermind then :-)
© 2016 - 2025 Red Hat, Inc.