kernel/sched/fair.c | 9 +++++++++ 1 file changed, 9 insertions(+)
detach_tasks() uses struct lb_env.loop_max as an env.src_rq->cfs_tasks
iteration count limit. It is however set without the source RQ lock held,
and besides detach_tasks() can be re-invoked after releasing and
re-acquiring the RQ lock per LBF_NEED_BREAK.
This means that env.loop_max and the actual length of env.src_rq->cfs_tasks
as observed within detach_tasks() can differ. This can cause some tasks to
be unnecessarily iterated over more than once, for instance:
env.loop_max := 4
detach_tasks()
// Here env.src->cfs_tasks only contains two tasks which can't be
// migrated anywhere, so they're put back in the list each time.
env.src->cfs_tasks := [p1, p0]
// The iteration goes:
p0; cfs_tasks := [p0, p1]
p1; cfs_tasks := [p1, p0]
p0; cfs_tasks := [p0, p1]
p1; cfs_tasks := [p1, p0]
// IOW we iterate over each task twice
In the Specjbb test, the similar issues can be caught many times.
(Over 330,000 times in a 30-minites Specjbb test)
The patch introduces "first_back" to record the first task which
is put back to the task list. If we get a task which is equal to
first_back, we break the loop, and avoid to scan twice for it.
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
---
v3 --> v4:
Changed the commit message suggested by Valentin Schneider.
v3: https://lore.kernel.org/all/20250718063523.9232-1-shijie@os.amperecomputing.com/
v2 --> v3:
Fix a typo in the commit message.
v2: https://lore.kernel.org/all/20250718054709.8781-1-shijie@os.amperecomputing.com/
v1 --> v2:
Add more comment from Valentin Schneider
v1: https://lore.kernel.org/all/20250707083636.38380-1-shijie@os.amperecomputing.com/
---
kernel/sched/fair.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7e2963efe800..7cc9d50e3e11 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9443,6 +9443,7 @@ static int detach_tasks(struct lb_env *env)
{
struct list_head *tasks = &env->src_rq->cfs_tasks;
unsigned long util, load;
+ struct task_struct *first_back = NULL;
struct task_struct *p;
int detached = 0;
@@ -9481,6 +9482,12 @@ static int detach_tasks(struct lb_env *env)
}
p = list_last_entry(tasks, struct task_struct, se.group_node);
+ /*
+ * We're back to an already visited task that couldn't be
+ * detached, we've seen all there is to see.
+ */
+ if (p == first_back)
+ break;
if (!can_migrate_task(p, env))
goto next;
@@ -9562,6 +9569,8 @@ static int detach_tasks(struct lb_env *env)
schedstat_inc(p->stats.nr_failed_migrations_hot);
list_move(&p->se.group_node, tasks);
+ if (!first_back)
+ first_back = p;
}
/*
--
2.40.1
On Mon, 21 Jul 2025 at 04:40, Huang Shijie <shijie@os.amperecomputing.com> wrote: > > detach_tasks() uses struct lb_env.loop_max as an env.src_rq->cfs_tasks > iteration count limit. It is however set without the source RQ lock held, > and besides detach_tasks() can be re-invoked after releasing and > re-acquiring the RQ lock per LBF_NEED_BREAK. > > This means that env.loop_max and the actual length of env.src_rq->cfs_tasks > as observed within detach_tasks() can differ. This can cause some tasks to why not setting env.loop_max only once rq lock is taken in this case ? side note : by default loop_max <= loop_break > be unnecessarily iterated over more than once, for instance: > > env.loop_max := 4 > detach_tasks() > // Here env.src->cfs_tasks only contains two tasks which can't be > // migrated anywhere, so they're put back in the list each time. > env.src->cfs_tasks := [p1, p0] > // The iteration goes: > p0; cfs_tasks := [p0, p1] > p1; cfs_tasks := [p1, p0] > p0; cfs_tasks := [p0, p1] > p1; cfs_tasks := [p1, p0] > > // IOW we iterate over each task twice > > In the Specjbb test, the similar issues can be caught many times. > (Over 330,000 times in a 30-minites Specjbb test) > > The patch introduces "first_back" to record the first task which > is put back to the task list. If we get a task which is equal to > first_back, we break the loop, and avoid to scan twice for it. > > Reviewed-by: Valentin Schneider <vschneid@redhat.com> > Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com> > --- > v3 --> v4: > Changed the commit message suggested by Valentin Schneider. > v3: https://lore.kernel.org/all/20250718063523.9232-1-shijie@os.amperecomputing.com/ > > v2 --> v3: > Fix a typo in the commit message. > v2: https://lore.kernel.org/all/20250718054709.8781-1-shijie@os.amperecomputing.com/ > > v1 --> v2: > Add more comment from Valentin Schneider > v1: https://lore.kernel.org/all/20250707083636.38380-1-shijie@os.amperecomputing.com/ > --- > kernel/sched/fair.c | 9 +++++++++ > 1 file changed, 9 insertions(+) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 7e2963efe800..7cc9d50e3e11 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -9443,6 +9443,7 @@ static int detach_tasks(struct lb_env *env) > { > struct list_head *tasks = &env->src_rq->cfs_tasks; > unsigned long util, load; > + struct task_struct *first_back = NULL; > struct task_struct *p; > int detached = 0; > > @@ -9481,6 +9482,12 @@ static int detach_tasks(struct lb_env *env) > } > > p = list_last_entry(tasks, struct task_struct, se.group_node); > + /* > + * We're back to an already visited task that couldn't be > + * detached, we've seen all there is to see. > + */ > + if (p == first_back) > + break; > > if (!can_migrate_task(p, env)) > goto next; > @@ -9562,6 +9569,8 @@ static int detach_tasks(struct lb_env *env) > schedstat_inc(p->stats.nr_failed_migrations_hot); > > list_move(&p->se.group_node, tasks); > + if (!first_back) > + first_back = p; > } > > /* > -- > 2.40.1 >
On 2025/7/21 17:40, Vincent Guittot wrote: > On Mon, 21 Jul 2025 at 04:40, Huang Shijie > <shijie@os.amperecomputing.com> wrote: >> detach_tasks() uses struct lb_env.loop_max as an env.src_rq->cfs_tasks >> iteration count limit. It is however set without the source RQ lock held, >> and besides detach_tasks() can be re-invoked after releasing and >> re-acquiring the RQ lock per LBF_NEED_BREAK. >> >> This means that env.loop_max and the actual length of env.src_rq->cfs_tasks >> as observed within detach_tasks() can differ. This can cause some tasks to > why not setting env.loop_max only once rq lock is taken in this case ? Yes. we do it in this way. Thanks Huang Shijie
On 21/07/25 11:40, Vincent Guittot wrote: > On Mon, 21 Jul 2025 at 04:40, Huang Shijie > <shijie@os.amperecomputing.com> wrote: >> >> detach_tasks() uses struct lb_env.loop_max as an env.src_rq->cfs_tasks >> iteration count limit. It is however set without the source RQ lock held, >> and besides detach_tasks() can be re-invoked after releasing and >> re-acquiring the RQ lock per LBF_NEED_BREAK. >> >> This means that env.loop_max and the actual length of env.src_rq->cfs_tasks >> as observed within detach_tasks() can differ. This can cause some tasks to > > why not setting env.loop_max only once rq lock is taken in this case ? > > side note : by default loop_max <= loop_break > I thought so too and dismissed that due to LBF_NEED_BREAK, but I guess we could still do something like: --- diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b9b4bbbf0af6f..eef3a0d341661 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -11643,6 +11643,7 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq, .dst_grpmask = group_balance_mask(sd->groups), .idle = idle, .loop_break = SCHED_NR_MIGRATE_BREAK, + .loop_max = UINT_MAX, .cpus = cpus, .fbq_type = all, .tasks = LIST_HEAD_INIT(env.tasks), @@ -11681,18 +11682,19 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq, /* Clear this flag as soon as we find a pullable task */ env.flags |= LBF_ALL_PINNED; if (busiest->nr_running > 1) { +more_balance: /* * Attempt to move tasks. If sched_balance_find_src_group has found * an imbalance but busiest->nr_running <= 1, the group is * still unbalanced. ld_moved simply stays zero, so it is * correctly treated as an imbalance. */ - env.loop_max = min(sysctl_sched_nr_migrate, busiest->nr_running); - -more_balance: rq_lock_irqsave(busiest, &rf); update_rq_clock(busiest); + + env.loop_max = min3(env.loop_max, sysctl_sched_nr_migrate, busiest->h_nr_running); + /* * cur_ld_moved - load moved in current iteration * ld_moved - cumulative load moved across iterations
On Mon, 21 Jul 2025 at 13:25, Valentin Schneider <vschneid@redhat.com> wrote: > > On 21/07/25 11:40, Vincent Guittot wrote: > > On Mon, 21 Jul 2025 at 04:40, Huang Shijie > > <shijie@os.amperecomputing.com> wrote: > >> > >> detach_tasks() uses struct lb_env.loop_max as an env.src_rq->cfs_tasks > >> iteration count limit. It is however set without the source RQ lock held, > >> and besides detach_tasks() can be re-invoked after releasing and > >> re-acquiring the RQ lock per LBF_NEED_BREAK. > >> > >> This means that env.loop_max and the actual length of env.src_rq->cfs_tasks > >> as observed within detach_tasks() can differ. This can cause some tasks to > > > > why not setting env.loop_max only once rq lock is taken in this case ? > > > > side note : by default loop_max <= loop_break > > > > I thought so too and dismissed that due to LBF_NEED_BREAK, but I guess we > could still do something like: > > --- > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index b9b4bbbf0af6f..eef3a0d341661 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -11643,6 +11643,7 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq, > .dst_grpmask = group_balance_mask(sd->groups), > .idle = idle, > .loop_break = SCHED_NR_MIGRATE_BREAK, > + .loop_max = UINT_MAX, > .cpus = cpus, > .fbq_type = all, > .tasks = LIST_HEAD_INIT(env.tasks), > @@ -11681,18 +11682,19 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq, > /* Clear this flag as soon as we find a pullable task */ > env.flags |= LBF_ALL_PINNED; > if (busiest->nr_running > 1) { > +more_balance: > /* > * Attempt to move tasks. If sched_balance_find_src_group has found > * an imbalance but busiest->nr_running <= 1, the group is > * still unbalanced. ld_moved simply stays zero, so it is > * correctly treated as an imbalance. > */ > - env.loop_max = min(sysctl_sched_nr_migrate, busiest->nr_running); > - > -more_balance: > rq_lock_irqsave(busiest, &rf); > update_rq_clock(busiest); > > + > + env.loop_max = min3(env.loop_max, sysctl_sched_nr_migrate, busiest->h_nr_running); > + > /* > * cur_ld_moved - load moved in current iteration > * ld_moved - cumulative load moved across iterations > I would prefer something like below: diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 1b3879850a9e..636798d53798 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -11702,12 +11702,16 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq, * still unbalanced. ld_moved simply stays zero, so it is * correctly treated as an imbalance. */ - env.loop_max = min(sysctl_sched_nr_migrate, busiest->nr_running); more_balance: rq_lock_irqsave(busiest, &rf); update_rq_clock(busiest); + if (!env.loop_max) + env.loop_max = min(sysctl_sched_nr_migrate, busiest->cfs.h_nr_runnable); + else + env.loop_max = min(env.loop_max, busiest->cfs.h_nr_runnable); + /* * cur_ld_moved - load moved in current iteration * ld_moved - cumulative load moved across iterations
On Tue, 22 Jul 2025 at 09:49, Vincent Guittot <vincent.guittot@linaro.org> wrote: > > On Mon, 21 Jul 2025 at 13:25, Valentin Schneider <vschneid@redhat.com> wrote: > > > > On 21/07/25 11:40, Vincent Guittot wrote: > > > On Mon, 21 Jul 2025 at 04:40, Huang Shijie > > > <shijie@os.amperecomputing.com> wrote: > > >> > > >> detach_tasks() uses struct lb_env.loop_max as an env.src_rq->cfs_tasks > > >> iteration count limit. It is however set without the source RQ lock held, > > >> and besides detach_tasks() can be re-invoked after releasing and > > >> re-acquiring the RQ lock per LBF_NEED_BREAK. > > >> > > >> This means that env.loop_max and the actual length of env.src_rq->cfs_tasks > > >> as observed within detach_tasks() can differ. This can cause some tasks to > > > > > > why not setting env.loop_max only once rq lock is taken in this case ? > > > > > > side note : by default loop_max <= loop_break > > > > > > > I thought so too and dismissed that due to LBF_NEED_BREAK, but I guess we > > could still do something like: > > > > --- > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index b9b4bbbf0af6f..eef3a0d341661 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -11643,6 +11643,7 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq, > > .dst_grpmask = group_balance_mask(sd->groups), > > .idle = idle, > > .loop_break = SCHED_NR_MIGRATE_BREAK, > > + .loop_max = UINT_MAX, > > .cpus = cpus, > > .fbq_type = all, > > .tasks = LIST_HEAD_INIT(env.tasks), > > @@ -11681,18 +11682,19 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq, > > /* Clear this flag as soon as we find a pullable task */ > > env.flags |= LBF_ALL_PINNED; > > if (busiest->nr_running > 1) { > > +more_balance: > > /* > > * Attempt to move tasks. If sched_balance_find_src_group has found > > * an imbalance but busiest->nr_running <= 1, the group is > > * still unbalanced. ld_moved simply stays zero, so it is > > * correctly treated as an imbalance. > > */ > > - env.loop_max = min(sysctl_sched_nr_migrate, busiest->nr_running); > > - > > -more_balance: > > rq_lock_irqsave(busiest, &rf); > > update_rq_clock(busiest); > > > > + > > + env.loop_max = min3(env.loop_max, sysctl_sched_nr_migrate, busiest->h_nr_running); > > + > > /* > > * cur_ld_moved - load moved in current iteration > > * ld_moved - cumulative load moved across iterations > > > > I would prefer something like below: > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 1b3879850a9e..636798d53798 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -11702,12 +11702,16 @@ static int sched_balance_rq(int this_cpu, > struct rq *this_rq, > * still unbalanced. ld_moved simply stays zero, so it is > * correctly treated as an imbalance. > */ > - env.loop_max = min(sysctl_sched_nr_migrate, > busiest->nr_running); > > more_balance: > rq_lock_irqsave(busiest, &rf); > update_rq_clock(busiest); > > + if (!env.loop_max) > + env.loop_max = min(sysctl_sched_nr_migrate, > busiest->cfs.h_nr_runnable); it should be h_nr_queued as mentioned by Huang and my patch has been messed up by my web browser > + else > + env.loop_max = min(env.loop_max, > busiest->cfs.h_nr_runnable); > + > /* > * cur_ld_moved - load moved in current iteration > * ld_moved - cumulative load moved across iterations
On 2025/7/22 15:53, Vincent Guittot wrote: >> more_balance: >> rq_lock_irqsave(busiest, &rf); >> update_rq_clock(busiest); >> >> + if (!env.loop_max) >> + env.loop_max = min(sysctl_sched_nr_migrate, >> busiest->cfs.h_nr_runnable); > it should be h_nr_queued as mentioned by Huang and my patch has been > messed up by my web browser okay, I can create a patch later, and send it out after it passes the Specjbb test. Thanks Huang Shijie
On 22/07/25 09:53, Vincent Guittot wrote: >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 1b3879850a9e..636798d53798 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -11702,12 +11702,16 @@ static int sched_balance_rq(int this_cpu, >> struct rq *this_rq, >> * still unbalanced. ld_moved simply stays zero, so it is >> * correctly treated as an imbalance. >> */ >> - env.loop_max = min(sysctl_sched_nr_migrate, >> busiest->nr_running); >> >> more_balance: >> rq_lock_irqsave(busiest, &rf); >> update_rq_clock(busiest); >> >> + if (!env.loop_max) >> + env.loop_max = min(sysctl_sched_nr_migrate, >> busiest->cfs.h_nr_runnable); > > it should be h_nr_queued as mentioned by Huang and my patch has been > messed up by my web browser > Works for me.
On 2025/7/21 19:25, Valentin Schneider wrote: > On 21/07/25 11:40, Vincent Guittot wrote: >> On Mon, 21 Jul 2025 at 04:40, Huang Shijie >> <shijie@os.amperecomputing.com> wrote: >>> detach_tasks() uses struct lb_env.loop_max as an env.src_rq->cfs_tasks >>> iteration count limit. It is however set without the source RQ lock held, >>> and besides detach_tasks() can be re-invoked after releasing and >>> re-acquiring the RQ lock per LBF_NEED_BREAK. >>> >>> This means that env.loop_max and the actual length of env.src_rq->cfs_tasks >>> as observed within detach_tasks() can differ. This can cause some tasks to >> why not setting env.loop_max only once rq lock is taken in this case ? >> >> side note : by default loop_max <= loop_break >> > I thought so too and dismissed that due to LBF_NEED_BREAK, but I guess we > could still do something like: > > --- > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index b9b4bbbf0af6f..eef3a0d341661 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -11643,6 +11643,7 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq, > .dst_grpmask = group_balance_mask(sd->groups), > .idle = idle, > .loop_break = SCHED_NR_MIGRATE_BREAK, > + .loop_max = UINT_MAX, > .cpus = cpus, > .fbq_type = all, > .tasks = LIST_HEAD_INIT(env.tasks), > @@ -11681,18 +11682,19 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq, > /* Clear this flag as soon as we find a pullable task */ > env.flags |= LBF_ALL_PINNED; > if (busiest->nr_running > 1) { > +more_balance: > /* > * Attempt to move tasks. If sched_balance_find_src_group has found > * an imbalance but busiest->nr_running <= 1, the group is > * still unbalanced. ld_moved simply stays zero, so it is > * correctly treated as an imbalance. > */ > - env.loop_max = min(sysctl_sched_nr_migrate, busiest->nr_running); > - > -more_balance: > rq_lock_irqsave(busiest, &rf); > update_rq_clock(busiest); > > + > + env.loop_max = min3(env.loop_max, sysctl_sched_nr_migrate, busiest->h_nr_running); It should be busiest->nr_running? or businest->cfs.h_nr_queued? do you mind I create a patch based on this one? or You create an official patch? Thanks Huang Shijie
© 2016 - 2025 Red Hat, Inc.