[PATCH] /sched/core: Fix Unixbench spawn test regression

Dietmar Eggemann posted 1 patch 11 months, 1 week ago
kernel/sched/core.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
[PATCH] /sched/core: Fix Unixbench spawn test regression
Posted by Dietmar Eggemann 11 months, 1 week ago
Hagar reported a 30% drop in UnixBench spawn test with commit
eff6c8ce8d4d ("sched/core: Reduce cost of sched_move_task when config
autogroup") on a m6g.xlarge AWS EC2 instance with 4 vCPUs and 16 GiB RAM
(aarch64) (single level MC sched domain) [1].

There is an early bail from sched_move_task() if p->sched_task_group is
equal to p's 'cpu cgroup' (sched_get_task_group()). E.g. both are
pointing to taskgroup '/user.slice/user-1000.slice/session-1.scope'
(Ubuntu '22.04.5 LTS').

So in:

  do_exit()

    sched_autogroup_exit_task()

      sched_move_task()

        if sched_get_task_group(p) == p->sched_task_group
          return

        /* p is enqueued */
        dequeue_task()              \
        sched_change_group()        |
          task_change_group_fair()  |
            detach_task_cfs_rq()    |                              (1)
            set_task_rq()           |
            attach_task_cfs_rq()    |
        enqueue_task()              /

(1) isn't called for p anymore.

Turns out that the regression is related to sgs->group_util in
group_is_overloaded() and group_has_capacity(). If (1) isn't called for
all the 'spawn' tasks then sgs->group_util is ~900 and
sgs->group_capacity = 1024 (single CPU sched domain) and this leads to
group_is_overloaded() returning true (2) and group_has_capacity() false
(3) much more often compared to the case when (1) is called.

I.e. there are much more cases of 'group_is_overloaded' and
'group_fully_busy' in WF_FORK wakeup sched_balance_find_dst_cpu() which
then returns much more often a CPU != smp_processor_id() (5).

This isn't good for these extremely short running tasks (FORK + EXIT)
and also involves calling sched_balance_find_dst_group_cpu() unnecessary
(single CPU sched domain).

Instead if (1) is called for 'p->flags & PF_EXITING' then the path
(4),(6) is taken much more often.

  select_task_rq_fair(..., wake_flags = WF_FORK)

    cpu = smp_processor_id()

    new_cpu = sched_balance_find_dst_cpu(..., cpu, ...)

      group = sched_balance_find_dst_group(..., cpu)

        do {

          update_sg_wakeup_stats()

            sgs->group_type = group_classify()

              if group_is_overloaded()                             (2)
                return group_overloaded

              if !group_has_capacity()                             (3)
                return group_fully_busy

              return group_has_spare                               (4)

        } while group

        if local_sgs.group_type > idlest_sgs.group_type
          return idlest                                            (5)

        case group_has_spare:

          if local_sgs.idle_cpus >= idlest_sgs.idle_cpus
            return NULL                                            (6)

Unixbench Tests './Run -c 4 spawn' on:

(a) VM AWS instance (m7gd.16xlarge) with v6.13 ('maxcpus=4 nr_cpus=4')
    and Ubuntu 22.04.5 LTS (aarch64).

    Shell & test run in '/user.slice/user-1000.slice/session-1.scope'.

    w/o patch	w/ patch
    21005	27120

(b) i7-13700K with tip/sched/core ('nosmt maxcpus=8 nr_cpus=8') and
    Ubuntu 22.04.5 LTS (x86_64).

    Shell & test run in '/A'.

    w/o patch	w/ patch
    67675	88806

CONFIG_SCHED_AUTOGROUP=y & /sys/proc/kernel/sched_autogroup_enabled equal
0 or 1.

[1] https://lkml.kernel.org/r/20250205151026.13061-1-hagarhem@amazon.com

Reported-by: Hagar Hemdan <hagarhem@amazon.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 kernel/sched/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b00f884701a6..ca0e3c2eb94a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9064,7 +9064,7 @@ void sched_move_task(struct task_struct *tsk)
 	 * group changes.
 	 */
 	group = sched_get_task_group(tsk);
-	if (group == tsk->sched_task_group)
+	if ((group == tsk->sched_task_group) && !(tsk->flags & PF_EXITING))
 		return;
 
 	update_rq_clock(rq);
-- 
2.34.1
Re: [PATCH] /sched/core: Fix Unixbench spawn test regression
Posted by Vincent Guittot 11 months ago
On Thu, 6 Mar 2025 at 17:26, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> Hagar reported a 30% drop in UnixBench spawn test with commit
> eff6c8ce8d4d ("sched/core: Reduce cost of sched_move_task when config
> autogroup") on a m6g.xlarge AWS EC2 instance with 4 vCPUs and 16 GiB RAM
> (aarch64) (single level MC sched domain) [1].
>
> There is an early bail from sched_move_task() if p->sched_task_group is
> equal to p's 'cpu cgroup' (sched_get_task_group()). E.g. both are
> pointing to taskgroup '/user.slice/user-1000.slice/session-1.scope'
> (Ubuntu '22.04.5 LTS').

Isn't this same use case that has been used by commit eff6c8ce8d4d to
show the benefit of adding the test if ((group ==
tsk->sched_task_group) ?
Adding Wuchi who added the condition

>
> So in:
>
>   do_exit()
>
>     sched_autogroup_exit_task()
>
>       sched_move_task()
>
>         if sched_get_task_group(p) == p->sched_task_group
>           return
>
>         /* p is enqueued */
>         dequeue_task()              \
>         sched_change_group()        |
>           task_change_group_fair()  |
>             detach_task_cfs_rq()    |                              (1)
>             set_task_rq()           |
>             attach_task_cfs_rq()    |
>         enqueue_task()              /
>
> (1) isn't called for p anymore.
>
> Turns out that the regression is related to sgs->group_util in
> group_is_overloaded() and group_has_capacity(). If (1) isn't called for
> all the 'spawn' tasks then sgs->group_util is ~900 and
> sgs->group_capacity = 1024 (single CPU sched domain) and this leads to
> group_is_overloaded() returning true (2) and group_has_capacity() false
> (3) much more often compared to the case when (1) is called.
>
> I.e. there are much more cases of 'group_is_overloaded' and
> 'group_fully_busy' in WF_FORK wakeup sched_balance_find_dst_cpu() which
> then returns much more often a CPU != smp_processor_id() (5).
>
> This isn't good for these extremely short running tasks (FORK + EXIT)
> and also involves calling sched_balance_find_dst_group_cpu() unnecessary
> (single CPU sched domain).
>
> Instead if (1) is called for 'p->flags & PF_EXITING' then the path
> (4),(6) is taken much more often.
>
>   select_task_rq_fair(..., wake_flags = WF_FORK)
>
>     cpu = smp_processor_id()
>
>     new_cpu = sched_balance_find_dst_cpu(..., cpu, ...)
>
>       group = sched_balance_find_dst_group(..., cpu)
>
>         do {
>
>           update_sg_wakeup_stats()
>
>             sgs->group_type = group_classify()
>
>               if group_is_overloaded()                             (2)
>                 return group_overloaded
>
>               if !group_has_capacity()                             (3)
>                 return group_fully_busy
>
>               return group_has_spare                               (4)
>
>         } while group
>
>         if local_sgs.group_type > idlest_sgs.group_type
>           return idlest                                            (5)
>
>         case group_has_spare:
>
>           if local_sgs.idle_cpus >= idlest_sgs.idle_cpus
>             return NULL                                            (6)
>
> Unixbench Tests './Run -c 4 spawn' on:
>
> (a) VM AWS instance (m7gd.16xlarge) with v6.13 ('maxcpus=4 nr_cpus=4')
>     and Ubuntu 22.04.5 LTS (aarch64).
>
>     Shell & test run in '/user.slice/user-1000.slice/session-1.scope'.
>
>     w/o patch   w/ patch
>     21005       27120
>
> (b) i7-13700K with tip/sched/core ('nosmt maxcpus=8 nr_cpus=8') and
>     Ubuntu 22.04.5 LTS (x86_64).
>
>     Shell & test run in '/A'.
>
>     w/o patch   w/ patch
>     67675       88806
>
> CONFIG_SCHED_AUTOGROUP=y & /sys/proc/kernel/sched_autogroup_enabled equal
> 0 or 1.
>
> [1] https://lkml.kernel.org/r/20250205151026.13061-1-hagarhem@amazon.com
>
> Reported-by: Hagar Hemdan <hagarhem@amazon.com>
> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> ---
>  kernel/sched/core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b00f884701a6..ca0e3c2eb94a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -9064,7 +9064,7 @@ void sched_move_task(struct task_struct *tsk)
>          * group changes.
>          */
>         group = sched_get_task_group(tsk);
> -       if (group == tsk->sched_task_group)
> +       if ((group == tsk->sched_task_group) && !(tsk->flags & PF_EXITING))
>                 return;
>
>         update_rq_clock(rq);
> --
> 2.34.1
>
Re: [PATCH] /sched/core: Fix Unixbench spawn test regression
Posted by Dietmar Eggemann 11 months ago
On 10/03/2025 14:59, Vincent Guittot wrote:
> On Thu, 6 Mar 2025 at 17:26, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>
>> Hagar reported a 30% drop in UnixBench spawn test with commit
>> eff6c8ce8d4d ("sched/core: Reduce cost of sched_move_task when config
>> autogroup") on a m6g.xlarge AWS EC2 instance with 4 vCPUs and 16 GiB RAM
>> (aarch64) (single level MC sched domain) [1].
>>
>> There is an early bail from sched_move_task() if p->sched_task_group is
>> equal to p's 'cpu cgroup' (sched_get_task_group()). E.g. both are
>> pointing to taskgroup '/user.slice/user-1000.slice/session-1.scope'
>> (Ubuntu '22.04.5 LTS').
> 
> Isn't this same use case that has been used by commit eff6c8ce8d4d to
> show the benefit of adding the test if ((group ==
> tsk->sched_task_group) ?
> Adding Wuchi who added the condition

IMHO, UnixBench spawn reports a performance number according to how many
tasks could be spawned whereas, IIUC, commit eff6c8ce8d4d was reporting
the time spend in sched_move_task().

[...]
Re: [PATCH] /sched/core: Fix Unixbench spawn test regression
Posted by Vincent Guittot 11 months ago
On Mon, 10 Mar 2025 at 16:29, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 10/03/2025 14:59, Vincent Guittot wrote:
> > On Thu, 6 Mar 2025 at 17:26, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> >>
> >> Hagar reported a 30% drop in UnixBench spawn test with commit
> >> eff6c8ce8d4d ("sched/core: Reduce cost of sched_move_task when config
> >> autogroup") on a m6g.xlarge AWS EC2 instance with 4 vCPUs and 16 GiB RAM
> >> (aarch64) (single level MC sched domain) [1].
> >>
> >> There is an early bail from sched_move_task() if p->sched_task_group is
> >> equal to p's 'cpu cgroup' (sched_get_task_group()). E.g. both are
> >> pointing to taskgroup '/user.slice/user-1000.slice/session-1.scope'
> >> (Ubuntu '22.04.5 LTS').
> >
> > Isn't this same use case that has been used by commit eff6c8ce8d4d to
> > show the benefit of adding the test if ((group ==
> > tsk->sched_task_group) ?
> > Adding Wuchi who added the condition
>
> IMHO, UnixBench spawn reports a performance number according to how many
> tasks could be spawned whereas, IIUC, commit eff6c8ce8d4d was reporting
> the time spend in sched_move_task().

But does not your patch revert the benefits shown in the figures of
commit eff6c8ce8d4d ? It skipped sched_move task in do_exit autogroup
and you adds it back


>
> [...]
Re: [PATCH] /sched/core: Fix Unixbench spawn test regression
Posted by Dietmar Eggemann 11 months ago
On 11/03/2025 17:35, Vincent Guittot wrote:
> On Mon, 10 Mar 2025 at 16:29, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>
>> On 10/03/2025 14:59, Vincent Guittot wrote:
>>> On Thu, 6 Mar 2025 at 17:26, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>>>
>>>> Hagar reported a 30% drop in UnixBench spawn test with commit
>>>> eff6c8ce8d4d ("sched/core: Reduce cost of sched_move_task when config
>>>> autogroup") on a m6g.xlarge AWS EC2 instance with 4 vCPUs and 16 GiB RAM
>>>> (aarch64) (single level MC sched domain) [1].
>>>>
>>>> There is an early bail from sched_move_task() if p->sched_task_group is
>>>> equal to p's 'cpu cgroup' (sched_get_task_group()). E.g. both are
>>>> pointing to taskgroup '/user.slice/user-1000.slice/session-1.scope'
>>>> (Ubuntu '22.04.5 LTS').
>>>
>>> Isn't this same use case that has been used by commit eff6c8ce8d4d to
>>> show the benefit of adding the test if ((group ==
>>> tsk->sched_task_group) ?
>>> Adding Wuchi who added the condition
>>
>> IMHO, UnixBench spawn reports a performance number according to how many
>> tasks could be spawned whereas, IIUC, commit eff6c8ce8d4d was reporting
>> the time spend in sched_move_task().
> 
> But does not your patch revert the benefits shown in the figures of
> commit eff6c8ce8d4d ? It skipped sched_move task in do_exit autogroup
> and you adds it back

Yeah, we do need the PELT update in sched_change_group()
(task_change_group_fair()) in the do_exit() path to get the 30% score
back in 'UnixBench spawn'. Even that means we need more time due to this
in sched_move_task().

I retested this and it turns out that 'group == tsk->sched_task_group'
is only true when sched_move_task() is called from exit.

So to get the score back for 'UnixBench spawn' we should rather revert
commit eff6c8ce8d4d.

The analysis in my patch still holds though.

If you guys agree I can send the revert with my analysis in the
patch-header.
Re: [PATCH] /sched/core: Fix Unixbench spawn test regression
Posted by Hagar Hemdan 11 months ago
On Wed, Mar 12, 2025 at 03:41:40PM +0100, Dietmar Eggemann wrote:
> On 11/03/2025 17:35, Vincent Guittot wrote:
> > On Mon, 10 Mar 2025 at 16:29, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> >>
> >> On 10/03/2025 14:59, Vincent Guittot wrote:
> >>> On Thu, 6 Mar 2025 at 17:26, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> >>>>
> >>>> Hagar reported a 30% drop in UnixBench spawn test with commit
> >>>> eff6c8ce8d4d ("sched/core: Reduce cost of sched_move_task when config
> >>>> autogroup") on a m6g.xlarge AWS EC2 instance with 4 vCPUs and 16 GiB RAM
> >>>> (aarch64) (single level MC sched domain) [1].
> >>>>
> >>>> There is an early bail from sched_move_task() if p->sched_task_group is
> >>>> equal to p's 'cpu cgroup' (sched_get_task_group()). E.g. both are
> >>>> pointing to taskgroup '/user.slice/user-1000.slice/session-1.scope'
> >>>> (Ubuntu '22.04.5 LTS').
> >>>
> >>> Isn't this same use case that has been used by commit eff6c8ce8d4d to
> >>> show the benefit of adding the test if ((group ==
> >>> tsk->sched_task_group) ?
> >>> Adding Wuchi who added the condition
> >>
> >> IMHO, UnixBench spawn reports a performance number according to how many
> >> tasks could be spawned whereas, IIUC, commit eff6c8ce8d4d was reporting
> >> the time spend in sched_move_task().
> > 
> > But does not your patch revert the benefits shown in the figures of
> > commit eff6c8ce8d4d ? It skipped sched_move task in do_exit autogroup
> > and you adds it back
> 
> Yeah, we do need the PELT update in sched_change_group()
> (task_change_group_fair()) in the do_exit() path to get the 30% score
> back in 'UnixBench spawn'. Even that means we need more time due to this
> in sched_move_task().
> 
> I retested this and it turns out that 'group == tsk->sched_task_group'
> is only true when sched_move_task() is called from exit.
> 
> So to get the score back for 'UnixBench spawn' we should rather revert
> commit eff6c8ce8d4d.
> 
> The analysis in my patch still holds though.
> 
> If you guys agree I can send the revert with my analysis in the
> patch-header.
Agree. The follow up commit fa614b4feb5a ("sched: Simplify sched_move_task()")
needs to be reverted as well.
Re: [PATCH] /sched/core: Fix Unixbench spawn test regression
Posted by Vincent Guittot 10 months, 4 weeks ago
On Thu, 13 Mar 2025 at 10:21, Hagar Hemdan <hagarhem@amazon.com> wrote:
>
> On Wed, Mar 12, 2025 at 03:41:40PM +0100, Dietmar Eggemann wrote:
> > On 11/03/2025 17:35, Vincent Guittot wrote:
> > > On Mon, 10 Mar 2025 at 16:29, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> > >>
> > >> On 10/03/2025 14:59, Vincent Guittot wrote:
> > >>> On Thu, 6 Mar 2025 at 17:26, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> > >>>>
> > >>>> Hagar reported a 30% drop in UnixBench spawn test with commit
> > >>>> eff6c8ce8d4d ("sched/core: Reduce cost of sched_move_task when config
> > >>>> autogroup") on a m6g.xlarge AWS EC2 instance with 4 vCPUs and 16 GiB RAM
> > >>>> (aarch64) (single level MC sched domain) [1].
> > >>>>
> > >>>> There is an early bail from sched_move_task() if p->sched_task_group is
> > >>>> equal to p's 'cpu cgroup' (sched_get_task_group()). E.g. both are
> > >>>> pointing to taskgroup '/user.slice/user-1000.slice/session-1.scope'
> > >>>> (Ubuntu '22.04.5 LTS').
> > >>>
> > >>> Isn't this same use case that has been used by commit eff6c8ce8d4d to
> > >>> show the benefit of adding the test if ((group ==
> > >>> tsk->sched_task_group) ?
> > >>> Adding Wuchi who added the condition
> > >>
> > >> IMHO, UnixBench spawn reports a performance number according to how many
> > >> tasks could be spawned whereas, IIUC, commit eff6c8ce8d4d was reporting
> > >> the time spend in sched_move_task().
> > >
> > > But does not your patch revert the benefits shown in the figures of
> > > commit eff6c8ce8d4d ? It skipped sched_move task in do_exit autogroup
> > > and you adds it back
> >
> > Yeah, we do need the PELT update in sched_change_group()
> > (task_change_group_fair()) in the do_exit() path to get the 30% score
> > back in 'UnixBench spawn'. Even that means we need more time due to this
> > in sched_move_task().
> >
> > I retested this and it turns out that 'group == tsk->sched_task_group'
> > is only true when sched_move_task() is called from exit.
> >
> > So to get the score back for 'UnixBench spawn' we should rather revert
> > commit eff6c8ce8d4d.
> >
> > The analysis in my patch still holds though.
> >
> > If you guys agree I can send the revert with my analysis in the
> > patch-header.
> Agree. The follow up commit fa614b4feb5a ("sched: Simplify sched_move_task()")
> needs to be reverted as well.

Why do you think it should be reverted as well ?
Re: [PATCH] /sched/core: Fix Unixbench spawn test regression
Posted by Hagar Hemdan 10 months, 4 weeks ago
On Fri, Mar 14, 2025 at 05:06:50PM +0100, Vincent Guittot wrote:
> On Thu, 13 Mar 2025 at 10:21, Hagar Hemdan <hagarhem@amazon.com> wrote:
> >
> > On Wed, Mar 12, 2025 at 03:41:40PM +0100, Dietmar Eggemann wrote:
> > > On 11/03/2025 17:35, Vincent Guittot wrote:
> > > > On Mon, 10 Mar 2025 at 16:29, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> > > >>
> > > >> On 10/03/2025 14:59, Vincent Guittot wrote:
> > > >>> On Thu, 6 Mar 2025 at 17:26, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> > > >>>>
> > > >>>> Hagar reported a 30% drop in UnixBench spawn test with commit
> > > >>>> eff6c8ce8d4d ("sched/core: Reduce cost of sched_move_task when config
> > > >>>> autogroup") on a m6g.xlarge AWS EC2 instance with 4 vCPUs and 16 GiB RAM
> > > >>>> (aarch64) (single level MC sched domain) [1].
> > > >>>>
> > > >>>> There is an early bail from sched_move_task() if p->sched_task_group is
> > > >>>> equal to p's 'cpu cgroup' (sched_get_task_group()). E.g. both are
> > > >>>> pointing to taskgroup '/user.slice/user-1000.slice/session-1.scope'
> > > >>>> (Ubuntu '22.04.5 LTS').
> > > >>>
> > > >>> Isn't this same use case that has been used by commit eff6c8ce8d4d to
> > > >>> show the benefit of adding the test if ((group ==
> > > >>> tsk->sched_task_group) ?
> > > >>> Adding Wuchi who added the condition
> > > >>
> > > >> IMHO, UnixBench spawn reports a performance number according to how many
> > > >> tasks could be spawned whereas, IIUC, commit eff6c8ce8d4d was reporting
> > > >> the time spend in sched_move_task().
> > > >
> > > > But does not your patch revert the benefits shown in the figures of
> > > > commit eff6c8ce8d4d ? It skipped sched_move task in do_exit autogroup
> > > > and you adds it back
> > >
> > > Yeah, we do need the PELT update in sched_change_group()
> > > (task_change_group_fair()) in the do_exit() path to get the 30% score
> > > back in 'UnixBench spawn'. Even that means we need more time due to this
> > > in sched_move_task().
> > >
> > > I retested this and it turns out that 'group == tsk->sched_task_group'
> > > is only true when sched_move_task() is called from exit.
> > >
> > > So to get the score back for 'UnixBench spawn' we should rather revert
> > > commit eff6c8ce8d4d.
> > >
> > > The analysis in my patch still holds though.
> > >
> > > If you guys agree I can send the revert with my analysis in the
> > > patch-header.
> > Agree. The follow up commit fa614b4feb5a ("sched: Simplify sched_move_task()")
> > needs to be reverted as well.
> 
> Why do you think it should be reverted as well ?

I meant the revert of eff6c8ce8d4d7 requires fa614b4feb5a to be
reverted first. Dietmar has already done this in his revert 
https://lore.kernel.org/all/20250314151345.275739-1-dietmar.eggemann@arm.com/,
so it's all good now.
Re: [PATCH] /sched/core: Fix Unixbench spawn test regression
Posted by Vincent Guittot 11 months ago
On Wed, 12 Mar 2025 at 15:41, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 11/03/2025 17:35, Vincent Guittot wrote:
> > On Mon, 10 Mar 2025 at 16:29, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> >>
> >> On 10/03/2025 14:59, Vincent Guittot wrote:
> >>> On Thu, 6 Mar 2025 at 17:26, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> >>>>
> >>>> Hagar reported a 30% drop in UnixBench spawn test with commit
> >>>> eff6c8ce8d4d ("sched/core: Reduce cost of sched_move_task when config
> >>>> autogroup") on a m6g.xlarge AWS EC2 instance with 4 vCPUs and 16 GiB RAM
> >>>> (aarch64) (single level MC sched domain) [1].
> >>>>
> >>>> There is an early bail from sched_move_task() if p->sched_task_group is
> >>>> equal to p's 'cpu cgroup' (sched_get_task_group()). E.g. both are
> >>>> pointing to taskgroup '/user.slice/user-1000.slice/session-1.scope'
> >>>> (Ubuntu '22.04.5 LTS').
> >>>
> >>> Isn't this same use case that has been used by commit eff6c8ce8d4d to
> >>> show the benefit of adding the test if ((group ==
> >>> tsk->sched_task_group) ?
> >>> Adding Wuchi who added the condition
> >>
> >> IMHO, UnixBench spawn reports a performance number according to how many
> >> tasks could be spawned whereas, IIUC, commit eff6c8ce8d4d was reporting
> >> the time spend in sched_move_task().
> >
> > But does not your patch revert the benefits shown in the figures of
> > commit eff6c8ce8d4d ? It skipped sched_move task in do_exit autogroup
> > and you adds it back
>
> Yeah, we do need the PELT update in sched_change_group()
> (task_change_group_fair()) in the do_exit() path to get the 30% score
> back in 'UnixBench spawn'. Even that means we need more time due to this
> in sched_move_task().
>
> I retested this and it turns out that 'group == tsk->sched_task_group'
> is only true when sched_move_task() is called from exit.
>
> So to get the score back for 'UnixBench spawn' we should rather revert
> commit eff6c8ce8d4d.
>
> The analysis in my patch still holds though.
>
> If you guys agree I can send the revert with my analysis in the
> patch-header.

This seems to be the best option for me
Re: [PATCH] /sched/core: Fix Unixbench spawn test regression
Posted by Hagar Hemdan 11 months ago
On Thu, Mar 06, 2025 at 05:26:35PM +0100, Dietmar Eggemann wrote:
> Hagar reported a 30% drop in UnixBench spawn test with commit
> eff6c8ce8d4d ("sched/core: Reduce cost of sched_move_task when config
> autogroup") on a m6g.xlarge AWS EC2 instance with 4 vCPUs and 16 GiB RAM
> (aarch64) (single level MC sched domain) [1].
> 
> There is an early bail from sched_move_task() if p->sched_task_group is
> equal to p's 'cpu cgroup' (sched_get_task_group()). E.g. both are
> pointing to taskgroup '/user.slice/user-1000.slice/session-1.scope'
> (Ubuntu '22.04.5 LTS').
> 
> So in:
> 
>   do_exit()
> 
>     sched_autogroup_exit_task()
> 
>       sched_move_task()
> 
>         if sched_get_task_group(p) == p->sched_task_group
>           return
> 
>         /* p is enqueued */
>         dequeue_task()              \
>         sched_change_group()        |
>           task_change_group_fair()  |
>             detach_task_cfs_rq()    |                              (1)
>             set_task_rq()           |
>             attach_task_cfs_rq()    |
>         enqueue_task()              /
> 
> (1) isn't called for p anymore.
> 
> Turns out that the regression is related to sgs->group_util in
> group_is_overloaded() and group_has_capacity(). If (1) isn't called for
> all the 'spawn' tasks then sgs->group_util is ~900 and
> sgs->group_capacity = 1024 (single CPU sched domain) and this leads to
> group_is_overloaded() returning true (2) and group_has_capacity() false
> (3) much more often compared to the case when (1) is called.
> 
> I.e. there are much more cases of 'group_is_overloaded' and
> 'group_fully_busy' in WF_FORK wakeup sched_balance_find_dst_cpu() which
> then returns much more often a CPU != smp_processor_id() (5).
> 
> This isn't good for these extremely short running tasks (FORK + EXIT)
> and also involves calling sched_balance_find_dst_group_cpu() unnecessary
> (single CPU sched domain).
> 
> Instead if (1) is called for 'p->flags & PF_EXITING' then the path
> (4),(6) is taken much more often.
> 
>   select_task_rq_fair(..., wake_flags = WF_FORK)
> 
>     cpu = smp_processor_id()
> 
>     new_cpu = sched_balance_find_dst_cpu(..., cpu, ...)
> 
>       group = sched_balance_find_dst_group(..., cpu)
> 
>         do {
> 
>           update_sg_wakeup_stats()
> 
>             sgs->group_type = group_classify()
> 
>               if group_is_overloaded()                             (2)
>                 return group_overloaded
> 
>               if !group_has_capacity()                             (3)
>                 return group_fully_busy
> 
>               return group_has_spare                               (4)
> 
>         } while group
> 
>         if local_sgs.group_type > idlest_sgs.group_type
>           return idlest                                            (5)
> 
>         case group_has_spare:
> 
>           if local_sgs.idle_cpus >= idlest_sgs.idle_cpus
>             return NULL                                            (6)
> 
> Unixbench Tests './Run -c 4 spawn' on:
> 
> (a) VM AWS instance (m7gd.16xlarge) with v6.13 ('maxcpus=4 nr_cpus=4')
>     and Ubuntu 22.04.5 LTS (aarch64).
> 
>     Shell & test run in '/user.slice/user-1000.slice/session-1.scope'.
> 
>     w/o patch	w/ patch
>     21005	27120
> 
> (b) i7-13700K with tip/sched/core ('nosmt maxcpus=8 nr_cpus=8') and
>     Ubuntu 22.04.5 LTS (x86_64).
> 
>     Shell & test run in '/A'.
> 
>     w/o patch	w/ patch
>     67675	88806
> 
> CONFIG_SCHED_AUTOGROUP=y & /sys/proc/kernel/sched_autogroup_enabled equal
> 0 or 1.
> 
> [1] https://lkml.kernel.org/r/20250205151026.13061-1-hagarhem@amazon.com
> 
> Reported-by: Hagar Hemdan <hagarhem@amazon.com>
> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> ---
>  kernel/sched/core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b00f884701a6..ca0e3c2eb94a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -9064,7 +9064,7 @@ void sched_move_task(struct task_struct *tsk)
>  	 * group changes.
>  	 */
>  	group = sched_get_task_group(tsk);
> -	if (group == tsk->sched_task_group)
> +	if ((group == tsk->sched_task_group) && !(tsk->flags & PF_EXITING))
>  		return;
>  
>  	update_rq_clock(rq);
> -- 
> 2.34.1
>

Thank you very much for submitting the fix and for all the explanations.

Could you please add the "Fixes:" tag for commit eff6c8ce8d4d to your patch? So that it is backported to the stable 6.12.
And actually this has been discovered internally by <abuehaze@amazon> so please add Reported-by: Hazem Mohamed Abuelfotoh <abuehaze@amazon.com> and Tested-by: Hagar Hemdan <hagarhem@amazon.com>.

Thanks,
Hagar