kernel/sched/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
Hagar reported a 30% drop in UnixBench spawn test with commit
eff6c8ce8d4d ("sched/core: Reduce cost of sched_move_task when config
autogroup") on a m6g.xlarge AWS EC2 instance with 4 vCPUs and 16 GiB RAM
(aarch64) (single level MC sched domain) [1].
There is an early bail from sched_move_task() if p->sched_task_group is
equal to p's 'cpu cgroup' (sched_get_task_group()). E.g. both are
pointing to taskgroup '/user.slice/user-1000.slice/session-1.scope'
(Ubuntu '22.04.5 LTS').
So in:
do_exit()
sched_autogroup_exit_task()
sched_move_task()
if sched_get_task_group(p) == p->sched_task_group
return
/* p is enqueued */
dequeue_task() \
sched_change_group() |
task_change_group_fair() |
detach_task_cfs_rq() | (1)
set_task_rq() |
attach_task_cfs_rq() |
enqueue_task() /
(1) isn't called for p anymore.
Turns out that the regression is related to sgs->group_util in
group_is_overloaded() and group_has_capacity(). If (1) isn't called for
all the 'spawn' tasks then sgs->group_util is ~900 and
sgs->group_capacity = 1024 (single CPU sched domain) and this leads to
group_is_overloaded() returning true (2) and group_has_capacity() false
(3) much more often compared to the case when (1) is called.
I.e. there are much more cases of 'group_is_overloaded' and
'group_fully_busy' in WF_FORK wakeup sched_balance_find_dst_cpu() which
then returns much more often a CPU != smp_processor_id() (5).
This isn't good for these extremely short running tasks (FORK + EXIT)
and also involves calling sched_balance_find_dst_group_cpu() unnecessary
(single CPU sched domain).
Instead if (1) is called for 'p->flags & PF_EXITING' then the path
(4),(6) is taken much more often.
select_task_rq_fair(..., wake_flags = WF_FORK)
cpu = smp_processor_id()
new_cpu = sched_balance_find_dst_cpu(..., cpu, ...)
group = sched_balance_find_dst_group(..., cpu)
do {
update_sg_wakeup_stats()
sgs->group_type = group_classify()
if group_is_overloaded() (2)
return group_overloaded
if !group_has_capacity() (3)
return group_fully_busy
return group_has_spare (4)
} while group
if local_sgs.group_type > idlest_sgs.group_type
return idlest (5)
case group_has_spare:
if local_sgs.idle_cpus >= idlest_sgs.idle_cpus
return NULL (6)
Unixbench Tests './Run -c 4 spawn' on:
(a) VM AWS instance (m7gd.16xlarge) with v6.13 ('maxcpus=4 nr_cpus=4')
and Ubuntu 22.04.5 LTS (aarch64).
Shell & test run in '/user.slice/user-1000.slice/session-1.scope'.
w/o patch w/ patch
21005 27120
(b) i7-13700K with tip/sched/core ('nosmt maxcpus=8 nr_cpus=8') and
Ubuntu 22.04.5 LTS (x86_64).
Shell & test run in '/A'.
w/o patch w/ patch
67675 88806
CONFIG_SCHED_AUTOGROUP=y & /sys/proc/kernel/sched_autogroup_enabled equal
0 or 1.
[1] https://lkml.kernel.org/r/20250205151026.13061-1-hagarhem@amazon.com
Reported-by: Hagar Hemdan <hagarhem@amazon.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
kernel/sched/core.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b00f884701a6..ca0e3c2eb94a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9064,7 +9064,7 @@ void sched_move_task(struct task_struct *tsk)
* group changes.
*/
group = sched_get_task_group(tsk);
- if (group == tsk->sched_task_group)
+ if ((group == tsk->sched_task_group) && !(tsk->flags & PF_EXITING))
return;
update_rq_clock(rq);
--
2.34.1
On Thu, 6 Mar 2025 at 17:26, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> Hagar reported a 30% drop in UnixBench spawn test with commit
> eff6c8ce8d4d ("sched/core: Reduce cost of sched_move_task when config
> autogroup") on a m6g.xlarge AWS EC2 instance with 4 vCPUs and 16 GiB RAM
> (aarch64) (single level MC sched domain) [1].
>
> There is an early bail from sched_move_task() if p->sched_task_group is
> equal to p's 'cpu cgroup' (sched_get_task_group()). E.g. both are
> pointing to taskgroup '/user.slice/user-1000.slice/session-1.scope'
> (Ubuntu '22.04.5 LTS').
Isn't this same use case that has been used by commit eff6c8ce8d4d to
show the benefit of adding the test if ((group ==
tsk->sched_task_group) ?
Adding Wuchi who added the condition
>
> So in:
>
> do_exit()
>
> sched_autogroup_exit_task()
>
> sched_move_task()
>
> if sched_get_task_group(p) == p->sched_task_group
> return
>
> /* p is enqueued */
> dequeue_task() \
> sched_change_group() |
> task_change_group_fair() |
> detach_task_cfs_rq() | (1)
> set_task_rq() |
> attach_task_cfs_rq() |
> enqueue_task() /
>
> (1) isn't called for p anymore.
>
> Turns out that the regression is related to sgs->group_util in
> group_is_overloaded() and group_has_capacity(). If (1) isn't called for
> all the 'spawn' tasks then sgs->group_util is ~900 and
> sgs->group_capacity = 1024 (single CPU sched domain) and this leads to
> group_is_overloaded() returning true (2) and group_has_capacity() false
> (3) much more often compared to the case when (1) is called.
>
> I.e. there are much more cases of 'group_is_overloaded' and
> 'group_fully_busy' in WF_FORK wakeup sched_balance_find_dst_cpu() which
> then returns much more often a CPU != smp_processor_id() (5).
>
> This isn't good for these extremely short running tasks (FORK + EXIT)
> and also involves calling sched_balance_find_dst_group_cpu() unnecessary
> (single CPU sched domain).
>
> Instead if (1) is called for 'p->flags & PF_EXITING' then the path
> (4),(6) is taken much more often.
>
> select_task_rq_fair(..., wake_flags = WF_FORK)
>
> cpu = smp_processor_id()
>
> new_cpu = sched_balance_find_dst_cpu(..., cpu, ...)
>
> group = sched_balance_find_dst_group(..., cpu)
>
> do {
>
> update_sg_wakeup_stats()
>
> sgs->group_type = group_classify()
>
> if group_is_overloaded() (2)
> return group_overloaded
>
> if !group_has_capacity() (3)
> return group_fully_busy
>
> return group_has_spare (4)
>
> } while group
>
> if local_sgs.group_type > idlest_sgs.group_type
> return idlest (5)
>
> case group_has_spare:
>
> if local_sgs.idle_cpus >= idlest_sgs.idle_cpus
> return NULL (6)
>
> Unixbench Tests './Run -c 4 spawn' on:
>
> (a) VM AWS instance (m7gd.16xlarge) with v6.13 ('maxcpus=4 nr_cpus=4')
> and Ubuntu 22.04.5 LTS (aarch64).
>
> Shell & test run in '/user.slice/user-1000.slice/session-1.scope'.
>
> w/o patch w/ patch
> 21005 27120
>
> (b) i7-13700K with tip/sched/core ('nosmt maxcpus=8 nr_cpus=8') and
> Ubuntu 22.04.5 LTS (x86_64).
>
> Shell & test run in '/A'.
>
> w/o patch w/ patch
> 67675 88806
>
> CONFIG_SCHED_AUTOGROUP=y & /sys/proc/kernel/sched_autogroup_enabled equal
> 0 or 1.
>
> [1] https://lkml.kernel.org/r/20250205151026.13061-1-hagarhem@amazon.com
>
> Reported-by: Hagar Hemdan <hagarhem@amazon.com>
> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> ---
> kernel/sched/core.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b00f884701a6..ca0e3c2eb94a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -9064,7 +9064,7 @@ void sched_move_task(struct task_struct *tsk)
> * group changes.
> */
> group = sched_get_task_group(tsk);
> - if (group == tsk->sched_task_group)
> + if ((group == tsk->sched_task_group) && !(tsk->flags & PF_EXITING))
> return;
>
> update_rq_clock(rq);
> --
> 2.34.1
>
On 10/03/2025 14:59, Vincent Guittot wrote:
> On Thu, 6 Mar 2025 at 17:26, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>
>> Hagar reported a 30% drop in UnixBench spawn test with commit
>> eff6c8ce8d4d ("sched/core: Reduce cost of sched_move_task when config
>> autogroup") on a m6g.xlarge AWS EC2 instance with 4 vCPUs and 16 GiB RAM
>> (aarch64) (single level MC sched domain) [1].
>>
>> There is an early bail from sched_move_task() if p->sched_task_group is
>> equal to p's 'cpu cgroup' (sched_get_task_group()). E.g. both are
>> pointing to taskgroup '/user.slice/user-1000.slice/session-1.scope'
>> (Ubuntu '22.04.5 LTS').
>
> Isn't this same use case that has been used by commit eff6c8ce8d4d to
> show the benefit of adding the test if ((group ==
> tsk->sched_task_group) ?
> Adding Wuchi who added the condition
IMHO, UnixBench spawn reports a performance number according to how many
tasks could be spawned whereas, IIUC, commit eff6c8ce8d4d was reporting
the time spend in sched_move_task().
[...]
On Mon, 10 Mar 2025 at 16:29, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 10/03/2025 14:59, Vincent Guittot wrote:
> > On Thu, 6 Mar 2025 at 17:26, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> >>
> >> Hagar reported a 30% drop in UnixBench spawn test with commit
> >> eff6c8ce8d4d ("sched/core: Reduce cost of sched_move_task when config
> >> autogroup") on a m6g.xlarge AWS EC2 instance with 4 vCPUs and 16 GiB RAM
> >> (aarch64) (single level MC sched domain) [1].
> >>
> >> There is an early bail from sched_move_task() if p->sched_task_group is
> >> equal to p's 'cpu cgroup' (sched_get_task_group()). E.g. both are
> >> pointing to taskgroup '/user.slice/user-1000.slice/session-1.scope'
> >> (Ubuntu '22.04.5 LTS').
> >
> > Isn't this same use case that has been used by commit eff6c8ce8d4d to
> > show the benefit of adding the test if ((group ==
> > tsk->sched_task_group) ?
> > Adding Wuchi who added the condition
>
> IMHO, UnixBench spawn reports a performance number according to how many
> tasks could be spawned whereas, IIUC, commit eff6c8ce8d4d was reporting
> the time spend in sched_move_task().
But does not your patch revert the benefits shown in the figures of
commit eff6c8ce8d4d ? It skipped sched_move task in do_exit autogroup
and you adds it back
>
> [...]
On 11/03/2025 17:35, Vincent Guittot wrote:
> On Mon, 10 Mar 2025 at 16:29, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>
>> On 10/03/2025 14:59, Vincent Guittot wrote:
>>> On Thu, 6 Mar 2025 at 17:26, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>>>
>>>> Hagar reported a 30% drop in UnixBench spawn test with commit
>>>> eff6c8ce8d4d ("sched/core: Reduce cost of sched_move_task when config
>>>> autogroup") on a m6g.xlarge AWS EC2 instance with 4 vCPUs and 16 GiB RAM
>>>> (aarch64) (single level MC sched domain) [1].
>>>>
>>>> There is an early bail from sched_move_task() if p->sched_task_group is
>>>> equal to p's 'cpu cgroup' (sched_get_task_group()). E.g. both are
>>>> pointing to taskgroup '/user.slice/user-1000.slice/session-1.scope'
>>>> (Ubuntu '22.04.5 LTS').
>>>
>>> Isn't this same use case that has been used by commit eff6c8ce8d4d to
>>> show the benefit of adding the test if ((group ==
>>> tsk->sched_task_group) ?
>>> Adding Wuchi who added the condition
>>
>> IMHO, UnixBench spawn reports a performance number according to how many
>> tasks could be spawned whereas, IIUC, commit eff6c8ce8d4d was reporting
>> the time spend in sched_move_task().
>
> But does not your patch revert the benefits shown in the figures of
> commit eff6c8ce8d4d ? It skipped sched_move task in do_exit autogroup
> and you adds it back
Yeah, we do need the PELT update in sched_change_group()
(task_change_group_fair()) in the do_exit() path to get the 30% score
back in 'UnixBench spawn'. Even that means we need more time due to this
in sched_move_task().
I retested this and it turns out that 'group == tsk->sched_task_group'
is only true when sched_move_task() is called from exit.
So to get the score back for 'UnixBench spawn' we should rather revert
commit eff6c8ce8d4d.
The analysis in my patch still holds though.
If you guys agree I can send the revert with my analysis in the
patch-header.
On Wed, Mar 12, 2025 at 03:41:40PM +0100, Dietmar Eggemann wrote:
> On 11/03/2025 17:35, Vincent Guittot wrote:
> > On Mon, 10 Mar 2025 at 16:29, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> >>
> >> On 10/03/2025 14:59, Vincent Guittot wrote:
> >>> On Thu, 6 Mar 2025 at 17:26, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> >>>>
> >>>> Hagar reported a 30% drop in UnixBench spawn test with commit
> >>>> eff6c8ce8d4d ("sched/core: Reduce cost of sched_move_task when config
> >>>> autogroup") on a m6g.xlarge AWS EC2 instance with 4 vCPUs and 16 GiB RAM
> >>>> (aarch64) (single level MC sched domain) [1].
> >>>>
> >>>> There is an early bail from sched_move_task() if p->sched_task_group is
> >>>> equal to p's 'cpu cgroup' (sched_get_task_group()). E.g. both are
> >>>> pointing to taskgroup '/user.slice/user-1000.slice/session-1.scope'
> >>>> (Ubuntu '22.04.5 LTS').
> >>>
> >>> Isn't this same use case that has been used by commit eff6c8ce8d4d to
> >>> show the benefit of adding the test if ((group ==
> >>> tsk->sched_task_group) ?
> >>> Adding Wuchi who added the condition
> >>
> >> IMHO, UnixBench spawn reports a performance number according to how many
> >> tasks could be spawned whereas, IIUC, commit eff6c8ce8d4d was reporting
> >> the time spend in sched_move_task().
> >
> > But does not your patch revert the benefits shown in the figures of
> > commit eff6c8ce8d4d ? It skipped sched_move task in do_exit autogroup
> > and you adds it back
>
> Yeah, we do need the PELT update in sched_change_group()
> (task_change_group_fair()) in the do_exit() path to get the 30% score
> back in 'UnixBench spawn'. Even that means we need more time due to this
> in sched_move_task().
>
> I retested this and it turns out that 'group == tsk->sched_task_group'
> is only true when sched_move_task() is called from exit.
>
> So to get the score back for 'UnixBench spawn' we should rather revert
> commit eff6c8ce8d4d.
>
> The analysis in my patch still holds though.
>
> If you guys agree I can send the revert with my analysis in the
> patch-header.
Agree. The follow up commit fa614b4feb5a ("sched: Simplify sched_move_task()")
needs to be reverted as well.
On Thu, 13 Mar 2025 at 10:21, Hagar Hemdan <hagarhem@amazon.com> wrote:
>
> On Wed, Mar 12, 2025 at 03:41:40PM +0100, Dietmar Eggemann wrote:
> > On 11/03/2025 17:35, Vincent Guittot wrote:
> > > On Mon, 10 Mar 2025 at 16:29, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> > >>
> > >> On 10/03/2025 14:59, Vincent Guittot wrote:
> > >>> On Thu, 6 Mar 2025 at 17:26, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> > >>>>
> > >>>> Hagar reported a 30% drop in UnixBench spawn test with commit
> > >>>> eff6c8ce8d4d ("sched/core: Reduce cost of sched_move_task when config
> > >>>> autogroup") on a m6g.xlarge AWS EC2 instance with 4 vCPUs and 16 GiB RAM
> > >>>> (aarch64) (single level MC sched domain) [1].
> > >>>>
> > >>>> There is an early bail from sched_move_task() if p->sched_task_group is
> > >>>> equal to p's 'cpu cgroup' (sched_get_task_group()). E.g. both are
> > >>>> pointing to taskgroup '/user.slice/user-1000.slice/session-1.scope'
> > >>>> (Ubuntu '22.04.5 LTS').
> > >>>
> > >>> Isn't this same use case that has been used by commit eff6c8ce8d4d to
> > >>> show the benefit of adding the test if ((group ==
> > >>> tsk->sched_task_group) ?
> > >>> Adding Wuchi who added the condition
> > >>
> > >> IMHO, UnixBench spawn reports a performance number according to how many
> > >> tasks could be spawned whereas, IIUC, commit eff6c8ce8d4d was reporting
> > >> the time spend in sched_move_task().
> > >
> > > But does not your patch revert the benefits shown in the figures of
> > > commit eff6c8ce8d4d ? It skipped sched_move task in do_exit autogroup
> > > and you adds it back
> >
> > Yeah, we do need the PELT update in sched_change_group()
> > (task_change_group_fair()) in the do_exit() path to get the 30% score
> > back in 'UnixBench spawn'. Even that means we need more time due to this
> > in sched_move_task().
> >
> > I retested this and it turns out that 'group == tsk->sched_task_group'
> > is only true when sched_move_task() is called from exit.
> >
> > So to get the score back for 'UnixBench spawn' we should rather revert
> > commit eff6c8ce8d4d.
> >
> > The analysis in my patch still holds though.
> >
> > If you guys agree I can send the revert with my analysis in the
> > patch-header.
> Agree. The follow up commit fa614b4feb5a ("sched: Simplify sched_move_task()")
> needs to be reverted as well.
Why do you think it should be reverted as well ?
On Fri, Mar 14, 2025 at 05:06:50PM +0100, Vincent Guittot wrote:
> On Thu, 13 Mar 2025 at 10:21, Hagar Hemdan <hagarhem@amazon.com> wrote:
> >
> > On Wed, Mar 12, 2025 at 03:41:40PM +0100, Dietmar Eggemann wrote:
> > > On 11/03/2025 17:35, Vincent Guittot wrote:
> > > > On Mon, 10 Mar 2025 at 16:29, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> > > >>
> > > >> On 10/03/2025 14:59, Vincent Guittot wrote:
> > > >>> On Thu, 6 Mar 2025 at 17:26, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> > > >>>>
> > > >>>> Hagar reported a 30% drop in UnixBench spawn test with commit
> > > >>>> eff6c8ce8d4d ("sched/core: Reduce cost of sched_move_task when config
> > > >>>> autogroup") on a m6g.xlarge AWS EC2 instance with 4 vCPUs and 16 GiB RAM
> > > >>>> (aarch64) (single level MC sched domain) [1].
> > > >>>>
> > > >>>> There is an early bail from sched_move_task() if p->sched_task_group is
> > > >>>> equal to p's 'cpu cgroup' (sched_get_task_group()). E.g. both are
> > > >>>> pointing to taskgroup '/user.slice/user-1000.slice/session-1.scope'
> > > >>>> (Ubuntu '22.04.5 LTS').
> > > >>>
> > > >>> Isn't this same use case that has been used by commit eff6c8ce8d4d to
> > > >>> show the benefit of adding the test if ((group ==
> > > >>> tsk->sched_task_group) ?
> > > >>> Adding Wuchi who added the condition
> > > >>
> > > >> IMHO, UnixBench spawn reports a performance number according to how many
> > > >> tasks could be spawned whereas, IIUC, commit eff6c8ce8d4d was reporting
> > > >> the time spend in sched_move_task().
> > > >
> > > > But does not your patch revert the benefits shown in the figures of
> > > > commit eff6c8ce8d4d ? It skipped sched_move task in do_exit autogroup
> > > > and you adds it back
> > >
> > > Yeah, we do need the PELT update in sched_change_group()
> > > (task_change_group_fair()) in the do_exit() path to get the 30% score
> > > back in 'UnixBench spawn'. Even that means we need more time due to this
> > > in sched_move_task().
> > >
> > > I retested this and it turns out that 'group == tsk->sched_task_group'
> > > is only true when sched_move_task() is called from exit.
> > >
> > > So to get the score back for 'UnixBench spawn' we should rather revert
> > > commit eff6c8ce8d4d.
> > >
> > > The analysis in my patch still holds though.
> > >
> > > If you guys agree I can send the revert with my analysis in the
> > > patch-header.
> > Agree. The follow up commit fa614b4feb5a ("sched: Simplify sched_move_task()")
> > needs to be reverted as well.
>
> Why do you think it should be reverted as well ?
I meant the revert of eff6c8ce8d4d7 requires fa614b4feb5a to be
reverted first. Dietmar has already done this in his revert
https://lore.kernel.org/all/20250314151345.275739-1-dietmar.eggemann@arm.com/,
so it's all good now.
On Wed, 12 Mar 2025 at 15:41, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 11/03/2025 17:35, Vincent Guittot wrote:
> > On Mon, 10 Mar 2025 at 16:29, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> >>
> >> On 10/03/2025 14:59, Vincent Guittot wrote:
> >>> On Thu, 6 Mar 2025 at 17:26, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> >>>>
> >>>> Hagar reported a 30% drop in UnixBench spawn test with commit
> >>>> eff6c8ce8d4d ("sched/core: Reduce cost of sched_move_task when config
> >>>> autogroup") on a m6g.xlarge AWS EC2 instance with 4 vCPUs and 16 GiB RAM
> >>>> (aarch64) (single level MC sched domain) [1].
> >>>>
> >>>> There is an early bail from sched_move_task() if p->sched_task_group is
> >>>> equal to p's 'cpu cgroup' (sched_get_task_group()). E.g. both are
> >>>> pointing to taskgroup '/user.slice/user-1000.slice/session-1.scope'
> >>>> (Ubuntu '22.04.5 LTS').
> >>>
> >>> Isn't this same use case that has been used by commit eff6c8ce8d4d to
> >>> show the benefit of adding the test if ((group ==
> >>> tsk->sched_task_group) ?
> >>> Adding Wuchi who added the condition
> >>
> >> IMHO, UnixBench spawn reports a performance number according to how many
> >> tasks could be spawned whereas, IIUC, commit eff6c8ce8d4d was reporting
> >> the time spend in sched_move_task().
> >
> > But does not your patch revert the benefits shown in the figures of
> > commit eff6c8ce8d4d ? It skipped sched_move task in do_exit autogroup
> > and you adds it back
>
> Yeah, we do need the PELT update in sched_change_group()
> (task_change_group_fair()) in the do_exit() path to get the 30% score
> back in 'UnixBench spawn'. Even that means we need more time due to this
> in sched_move_task().
>
> I retested this and it turns out that 'group == tsk->sched_task_group'
> is only true when sched_move_task() is called from exit.
>
> So to get the score back for 'UnixBench spawn' we should rather revert
> commit eff6c8ce8d4d.
>
> The analysis in my patch still holds though.
>
> If you guys agree I can send the revert with my analysis in the
> patch-header.
This seems to be the best option for me
On Thu, Mar 06, 2025 at 05:26:35PM +0100, Dietmar Eggemann wrote:
> Hagar reported a 30% drop in UnixBench spawn test with commit
> eff6c8ce8d4d ("sched/core: Reduce cost of sched_move_task when config
> autogroup") on a m6g.xlarge AWS EC2 instance with 4 vCPUs and 16 GiB RAM
> (aarch64) (single level MC sched domain) [1].
>
> There is an early bail from sched_move_task() if p->sched_task_group is
> equal to p's 'cpu cgroup' (sched_get_task_group()). E.g. both are
> pointing to taskgroup '/user.slice/user-1000.slice/session-1.scope'
> (Ubuntu '22.04.5 LTS').
>
> So in:
>
> do_exit()
>
> sched_autogroup_exit_task()
>
> sched_move_task()
>
> if sched_get_task_group(p) == p->sched_task_group
> return
>
> /* p is enqueued */
> dequeue_task() \
> sched_change_group() |
> task_change_group_fair() |
> detach_task_cfs_rq() | (1)
> set_task_rq() |
> attach_task_cfs_rq() |
> enqueue_task() /
>
> (1) isn't called for p anymore.
>
> Turns out that the regression is related to sgs->group_util in
> group_is_overloaded() and group_has_capacity(). If (1) isn't called for
> all the 'spawn' tasks then sgs->group_util is ~900 and
> sgs->group_capacity = 1024 (single CPU sched domain) and this leads to
> group_is_overloaded() returning true (2) and group_has_capacity() false
> (3) much more often compared to the case when (1) is called.
>
> I.e. there are much more cases of 'group_is_overloaded' and
> 'group_fully_busy' in WF_FORK wakeup sched_balance_find_dst_cpu() which
> then returns much more often a CPU != smp_processor_id() (5).
>
> This isn't good for these extremely short running tasks (FORK + EXIT)
> and also involves calling sched_balance_find_dst_group_cpu() unnecessary
> (single CPU sched domain).
>
> Instead if (1) is called for 'p->flags & PF_EXITING' then the path
> (4),(6) is taken much more often.
>
> select_task_rq_fair(..., wake_flags = WF_FORK)
>
> cpu = smp_processor_id()
>
> new_cpu = sched_balance_find_dst_cpu(..., cpu, ...)
>
> group = sched_balance_find_dst_group(..., cpu)
>
> do {
>
> update_sg_wakeup_stats()
>
> sgs->group_type = group_classify()
>
> if group_is_overloaded() (2)
> return group_overloaded
>
> if !group_has_capacity() (3)
> return group_fully_busy
>
> return group_has_spare (4)
>
> } while group
>
> if local_sgs.group_type > idlest_sgs.group_type
> return idlest (5)
>
> case group_has_spare:
>
> if local_sgs.idle_cpus >= idlest_sgs.idle_cpus
> return NULL (6)
>
> Unixbench Tests './Run -c 4 spawn' on:
>
> (a) VM AWS instance (m7gd.16xlarge) with v6.13 ('maxcpus=4 nr_cpus=4')
> and Ubuntu 22.04.5 LTS (aarch64).
>
> Shell & test run in '/user.slice/user-1000.slice/session-1.scope'.
>
> w/o patch w/ patch
> 21005 27120
>
> (b) i7-13700K with tip/sched/core ('nosmt maxcpus=8 nr_cpus=8') and
> Ubuntu 22.04.5 LTS (x86_64).
>
> Shell & test run in '/A'.
>
> w/o patch w/ patch
> 67675 88806
>
> CONFIG_SCHED_AUTOGROUP=y & /sys/proc/kernel/sched_autogroup_enabled equal
> 0 or 1.
>
> [1] https://lkml.kernel.org/r/20250205151026.13061-1-hagarhem@amazon.com
>
> Reported-by: Hagar Hemdan <hagarhem@amazon.com>
> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> ---
> kernel/sched/core.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b00f884701a6..ca0e3c2eb94a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -9064,7 +9064,7 @@ void sched_move_task(struct task_struct *tsk)
> * group changes.
> */
> group = sched_get_task_group(tsk);
> - if (group == tsk->sched_task_group)
> + if ((group == tsk->sched_task_group) && !(tsk->flags & PF_EXITING))
> return;
>
> update_rq_clock(rq);
> --
> 2.34.1
>
Thank you very much for submitting the fix and for all the explanations.
Could you please add the "Fixes:" tag for commit eff6c8ce8d4d to your patch? So that it is backported to the stable 6.12.
And actually this has been discovered internally by <abuehaze@amazon> so please add Reported-by: Hazem Mohamed Abuelfotoh <abuehaze@amazon.com> and Tested-by: Hagar Hemdan <hagarhem@amazon.com>.
Thanks,
Hagar
© 2016 - 2026 Red Hat, Inc.