[PATCH 0/11 v3] sched/fair: Fix statistics with delayed dequeue

Vincent Guittot posted 11 patches 3 weeks, 3 days ago
kernel/sched/core.c  |   4 +-
kernel/sched/debug.c |  14 ++-
kernel/sched/fair.c  | 240 ++++++++++++++++++++++++-------------------
kernel/sched/pelt.c  |   4 +-
kernel/sched/sched.h |  12 +--
5 files changed, 153 insertions(+), 121 deletions(-)
[PATCH 0/11 v3] sched/fair: Fix statistics with delayed dequeue
Posted by Vincent Guittot 3 weeks, 3 days ago
Delayed dequeued feature keeps a sleeping sched_entitiy enqueued until its
lag has elapsed. As a result, it stays also visible in the statistics that
are used to balance the system and in particular the field h_nr_running.

This serie fixes those metrics by creating a new h_nr_runnable that tracks
only tasks that want to run. It renames h_nr_running into h_nr_runnable.

h_nr_runnable is used in several places to make decision on load balance:
  - PELT runnable_avg
  - deciding if a group is overloaded or has spare capacity
  - numa stats
  - reduced capacity management
  - load balance between groups

While fixing h_nr_running, some fields have been renamed to follow the
same pattern. We now have:
  - cfs.h_nr_runnable : running tasks in the hierarchy
  - cfs.h_nr_queued : enqueued tasks in the hierarchy either running or
      delayed dequeue
  - cfs.h_nr_idle : enqueued sched idle tasks in the hierarchy

cfs.nr_running has been rename cfs.nr_queued because it includes the
delayed dequeued entities

The unused cfs.idle_nr_running has been removed

Load balance compares the number of running tasks when selecting the
busiest group or runqueue and tries to migrate a runnable task and not a
sleeping delayed dequeue one. delayed dequeue tasks are considered only
when migrating load as they continue to impact it.

It should be noticed that this serie doesn't fix the problem of delayed
dequeued tasks that can't migrate at wakeup.

Some additional cleanups have been added:
  - move variable declaration at the beginning of pick_next_entity()
    and dequeue_entity() 
  - sched_can_stop_tick() should use cfs.h_nr_queued instead of
    cfs.nr_queued (previously cfs.nr_running) to know how many tasks
    are running in the whole hierarchy instead of how many entities at
    root level

Changes since v2:
- Fix h_nr_runnable after removing h_nr_delayed (reported by Mike and Prateek)
- Move "sched/fair: Fix sched_can_stop_tick() for fair tasks" at the
  beginning of the series so it can be easily backported (asked by Prateek)
- Split "sched/fair: Add new cfs_rq.h_nr_runnable" in 2 patches. One
  for the creation of h_nr_runnable and one for its use (asked by Peter)
- Fix more variable declarations (reported Prateek)


Changes since v1:
- reorder the patches
- rename fields into:
  - h_nr_queued for all tasks queued both runnable and delayed dequeue
  - h_nr_runnable for all runnable tasks
  - h_nr_idle for all tasks with sched_idle policy
- Cleanup how h_nr_runnable is updated in enqueue_task_fair() and
  dequeue_entities

Peter Zijlstra (1):
  sched/eevdf: More PELT vs DELAYED_DEQUEUE

Vincent Guittot (10):
  sched/fair: Fix sched_can_stop_tick() for fair tasks
  sched/fair: Rename h_nr_running into h_nr_queued
  sched/fair: Add new cfs_rq.h_nr_runnable
  sched/fair: Use the new cfs_rq.h_nr_runnable
  sched/fair: Removed unsued cfs_rq.h_nr_delayed
  sched/fair: Rename cfs_rq.idle_h_nr_running into h_nr_idle
  sched/fair: Remove unused cfs_rq.idle_nr_running
  sched/fair: Rename cfs_rq.nr_running into nr_queued
  sched/fair: Do not try to migrate delayed dequeue task
  sched/fair: Fix variable declaration position

 kernel/sched/core.c  |   4 +-
 kernel/sched/debug.c |  14 ++-
 kernel/sched/fair.c  | 240 ++++++++++++++++++++++++-------------------
 kernel/sched/pelt.c  |   4 +-
 kernel/sched/sched.h |  12 +--
 5 files changed, 153 insertions(+), 121 deletions(-)

-- 
2.43.0
Re: [PATCH 0/11 v3] sched/fair: Fix statistics with delayed dequeue
Posted by Luis Machado 3 weeks, 1 day ago
On 12/2/24 17:45, Vincent Guittot wrote:
> Delayed dequeued feature keeps a sleeping sched_entitiy enqueued until its
> lag has elapsed. As a result, it stays also visible in the statistics that
> are used to balance the system and in particular the field h_nr_running.
> 
> This serie fixes those metrics by creating a new h_nr_runnable that tracks
> only tasks that want to run. It renames h_nr_running into h_nr_runnable.
> 
> h_nr_runnable is used in several places to make decision on load balance:
>   - PELT runnable_avg
>   - deciding if a group is overloaded or has spare capacity
>   - numa stats
>   - reduced capacity management
>   - load balance between groups
> 
> While fixing h_nr_running, some fields have been renamed to follow the
> same pattern. We now have:
>   - cfs.h_nr_runnable : running tasks in the hierarchy
>   - cfs.h_nr_queued : enqueued tasks in the hierarchy either running or
>       delayed dequeue
>   - cfs.h_nr_idle : enqueued sched idle tasks in the hierarchy
> 
> cfs.nr_running has been rename cfs.nr_queued because it includes the
> delayed dequeued entities
> 
> The unused cfs.idle_nr_running has been removed
> 
> Load balance compares the number of running tasks when selecting the
> busiest group or runqueue and tries to migrate a runnable task and not a
> sleeping delayed dequeue one. delayed dequeue tasks are considered only
> when migrating load as they continue to impact it.
> 
> It should be noticed that this serie doesn't fix the problem of delayed
> dequeued tasks that can't migrate at wakeup.
> 
> Some additional cleanups have been added:
>   - move variable declaration at the beginning of pick_next_entity()
>     and dequeue_entity() 
>   - sched_can_stop_tick() should use cfs.h_nr_queued instead of
>     cfs.nr_queued (previously cfs.nr_running) to know how many tasks
>     are running in the whole hierarchy instead of how many entities at
>     root level
> 
> Changes since v2:
> - Fix h_nr_runnable after removing h_nr_delayed (reported by Mike and Prateek)
> - Move "sched/fair: Fix sched_can_stop_tick() for fair tasks" at the
>   beginning of the series so it can be easily backported (asked by Prateek)
> - Split "sched/fair: Add new cfs_rq.h_nr_runnable" in 2 patches. One
>   for the creation of h_nr_runnable and one for its use (asked by Peter)
> - Fix more variable declarations (reported Prateek)
> 
> 
> Changes since v1:
> - reorder the patches
> - rename fields into:
>   - h_nr_queued for all tasks queued both runnable and delayed dequeue
>   - h_nr_runnable for all runnable tasks
>   - h_nr_idle for all tasks with sched_idle policy
> - Cleanup how h_nr_runnable is updated in enqueue_task_fair() and
>   dequeue_entities
> 
> Peter Zijlstra (1):
>   sched/eevdf: More PELT vs DELAYED_DEQUEUE
> 
> Vincent Guittot (10):
>   sched/fair: Fix sched_can_stop_tick() for fair tasks
>   sched/fair: Rename h_nr_running into h_nr_queued
>   sched/fair: Add new cfs_rq.h_nr_runnable
>   sched/fair: Use the new cfs_rq.h_nr_runnable
>   sched/fair: Removed unsued cfs_rq.h_nr_delayed
>   sched/fair: Rename cfs_rq.idle_h_nr_running into h_nr_idle
>   sched/fair: Remove unused cfs_rq.idle_nr_running
>   sched/fair: Rename cfs_rq.nr_running into nr_queued
>   sched/fair: Do not try to migrate delayed dequeue task
>   sched/fair: Fix variable declaration position
> 
>  kernel/sched/core.c  |   4 +-
>  kernel/sched/debug.c |  14 ++-
>  kernel/sched/fair.c  | 240 ++++++++++++++++++++++++-------------------
>  kernel/sched/pelt.c  |   4 +-
>  kernel/sched/sched.h |  12 +--
>  5 files changed, 153 insertions(+), 121 deletions(-)
> 

Sorry, it took me a bit to work the backports for 6.8/Android. I gave this series a try
on the Pixel 6 with a couple workloads (Speedometer and a UI-based one) and I didn't
spot any significant performance/frame metric differences or power metric differences compared
to a 6.8 kernel with this series applied.

Also, checking the statistics for the cores didn't show any numbers that are obviously wrong.

Though the kernel version is a bit behind mainline, I hope that's useful.

Tested-By: Luis Machado <luis.machado@arm.com>
Re: [PATCH 0/11 v3] sched/fair: Fix statistics with delayed dequeue
Posted by Vincent Guittot 3 weeks ago
On Wed, 4 Dec 2024 at 20:10, Luis Machado <luis.machado@arm.com> wrote:
>
> On 12/2/24 17:45, Vincent Guittot wrote:
> > Delayed dequeued feature keeps a sleeping sched_entitiy enqueued until its
> > lag has elapsed. As a result, it stays also visible in the statistics that
> > are used to balance the system and in particular the field h_nr_running.
> >
> > This serie fixes those metrics by creating a new h_nr_runnable that tracks
> > only tasks that want to run. It renames h_nr_running into h_nr_runnable.
> >
> > h_nr_runnable is used in several places to make decision on load balance:
> >   - PELT runnable_avg
> >   - deciding if a group is overloaded or has spare capacity
> >   - numa stats
> >   - reduced capacity management
> >   - load balance between groups
> >
> > While fixing h_nr_running, some fields have been renamed to follow the
> > same pattern. We now have:
> >   - cfs.h_nr_runnable : running tasks in the hierarchy
> >   - cfs.h_nr_queued : enqueued tasks in the hierarchy either running or
> >       delayed dequeue
> >   - cfs.h_nr_idle : enqueued sched idle tasks in the hierarchy
> >
> > cfs.nr_running has been rename cfs.nr_queued because it includes the
> > delayed dequeued entities
> >
> > The unused cfs.idle_nr_running has been removed
> >
> > Load balance compares the number of running tasks when selecting the
> > busiest group or runqueue and tries to migrate a runnable task and not a
> > sleeping delayed dequeue one. delayed dequeue tasks are considered only
> > when migrating load as they continue to impact it.
> >
> > It should be noticed that this serie doesn't fix the problem of delayed
> > dequeued tasks that can't migrate at wakeup.
> >
> > Some additional cleanups have been added:
> >   - move variable declaration at the beginning of pick_next_entity()
> >     and dequeue_entity()
> >   - sched_can_stop_tick() should use cfs.h_nr_queued instead of
> >     cfs.nr_queued (previously cfs.nr_running) to know how many tasks
> >     are running in the whole hierarchy instead of how many entities at
> >     root level
> >
> > Changes since v2:
> > - Fix h_nr_runnable after removing h_nr_delayed (reported by Mike and Prateek)
> > - Move "sched/fair: Fix sched_can_stop_tick() for fair tasks" at the
> >   beginning of the series so it can be easily backported (asked by Prateek)
> > - Split "sched/fair: Add new cfs_rq.h_nr_runnable" in 2 patches. One
> >   for the creation of h_nr_runnable and one for its use (asked by Peter)
> > - Fix more variable declarations (reported Prateek)
> >
> >
> > Changes since v1:
> > - reorder the patches
> > - rename fields into:
> >   - h_nr_queued for all tasks queued both runnable and delayed dequeue
> >   - h_nr_runnable for all runnable tasks
> >   - h_nr_idle for all tasks with sched_idle policy
> > - Cleanup how h_nr_runnable is updated in enqueue_task_fair() and
> >   dequeue_entities
> >
> > Peter Zijlstra (1):
> >   sched/eevdf: More PELT vs DELAYED_DEQUEUE
> >
> > Vincent Guittot (10):
> >   sched/fair: Fix sched_can_stop_tick() for fair tasks
> >   sched/fair: Rename h_nr_running into h_nr_queued
> >   sched/fair: Add new cfs_rq.h_nr_runnable
> >   sched/fair: Use the new cfs_rq.h_nr_runnable
> >   sched/fair: Removed unsued cfs_rq.h_nr_delayed
> >   sched/fair: Rename cfs_rq.idle_h_nr_running into h_nr_idle
> >   sched/fair: Remove unused cfs_rq.idle_nr_running
> >   sched/fair: Rename cfs_rq.nr_running into nr_queued
> >   sched/fair: Do not try to migrate delayed dequeue task
> >   sched/fair: Fix variable declaration position
> >
> >  kernel/sched/core.c  |   4 +-
> >  kernel/sched/debug.c |  14 ++-
> >  kernel/sched/fair.c  | 240 ++++++++++++++++++++++++-------------------
> >  kernel/sched/pelt.c  |   4 +-
> >  kernel/sched/sched.h |  12 +--
> >  5 files changed, 153 insertions(+), 121 deletions(-)
> >
>
> Sorry, it took me a bit to work the backports for 6.8/Android. I gave this series a try
> on the Pixel 6 with a couple workloads (Speedometer and a UI-based one) and I didn't
> spot any significant performance/frame metric differences or power metric differences compared
> to a 6.8 kernel with this series applied.
>
> Also, checking the statistics for the cores didn't show any numbers that are obviously wrong.
>
> Though the kernel version is a bit behind mainline, I hope that's useful.
>
> Tested-By: Luis Machado <luis.machado@arm.com>

Thanks
Re: [PATCH 0/11 v3] sched/fair: Fix statistics with delayed dequeue
Posted by Dietmar Eggemann 3 weeks, 2 days ago
On 02/12/2024 18:45, Vincent Guittot wrote:
> Delayed dequeued feature keeps a sleeping sched_entitiy enqueued until its
> lag has elapsed. As a result, it stays also visible in the statistics that
> are used to balance the system and in particular the field h_nr_running.
> 
> This serie fixes those metrics by creating a new h_nr_runnable that tracks
> only tasks that want to run. It renames h_nr_running into h_nr_runnable.
> 
> h_nr_runnable is used in several places to make decision on load balance:
>   - PELT runnable_avg
>   - deciding if a group is overloaded or has spare capacity
>   - numa stats
>   - reduced capacity management
>   - load balance between groups
> 
> While fixing h_nr_running, some fields have been renamed to follow the
> same pattern. We now have:
>   - cfs.h_nr_runnable : running tasks in the hierarchy
>   - cfs.h_nr_queued : enqueued tasks in the hierarchy either running or
>       delayed dequeue
>   - cfs.h_nr_idle : enqueued sched idle tasks in the hierarchy
> 
> cfs.nr_running has been rename cfs.nr_queued because it includes the
> delayed dequeued entities
> 
> The unused cfs.idle_nr_running has been removed
> 
> Load balance compares the number of running tasks when selecting the
> busiest group or runqueue and tries to migrate a runnable task and not a
> sleeping delayed dequeue one. delayed dequeue tasks are considered only
> when migrating load as they continue to impact it.
> 
> It should be noticed that this serie doesn't fix the problem of delayed
> dequeued tasks that can't migrate at wakeup.
> 
> Some additional cleanups have been added:
>   - move variable declaration at the beginning of pick_next_entity()
>     and dequeue_entity() 
>   - sched_can_stop_tick() should use cfs.h_nr_queued instead of
>     cfs.nr_queued (previously cfs.nr_running) to know how many tasks
>     are running in the whole hierarchy instead of how many entities at
>     root level
> 
> Changes since v2:
> - Fix h_nr_runnable after removing h_nr_delayed (reported by Mike and Prateek)
> - Move "sched/fair: Fix sched_can_stop_tick() for fair tasks" at the
>   beginning of the series so it can be easily backported (asked by Prateek)
> - Split "sched/fair: Add new cfs_rq.h_nr_runnable" in 2 patches. One
>   for the creation of h_nr_runnable and one for its use (asked by Peter)
> - Fix more variable declarations (reported Prateek)

with the following nits:

(1) 01/11

    Proposed 'Fixes:' missing:
    https://lkml.kernel.org/r/c82ed217-cfe4-41a4-b39a-e7356231835f@amd.com

(2) 08/11

    Would be helpful to point out that we lost the only use case for 
    'cfs_rq->idle_nr_running' with the advent of EEVDF with:

    5e963f2bd465 - sched/fair: Commit to EEVDF

(3) Using nr_running on rq/rt_rq/dl_rq and nr_queued 
    for cfs_rq might look strange to the untrained eye.

Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

[...]
Re: [PATCH 0/11 v3] sched/fair: Fix statistics with delayed dequeue
Posted by Peter Zijlstra 3 weeks, 2 days ago
On Tue, Dec 03, 2024 at 11:41:47AM +0100, Dietmar Eggemann wrote:

> with the following nits:
> 
> (1) 01/11
> 
>     Proposed 'Fixes:' missing:
>     https://lkml.kernel.org/r/c82ed217-cfe4-41a4-b39a-e7356231835f@amd.com
> 
> (2) 08/11
> 
>     Would be helpful to point out that we lost the only use case for 
>     'cfs_rq->idle_nr_running' with the advent of EEVDF with:
> 
>     5e963f2bd465 - sched/fair: Commit to EEVDF

Done.

> (3) Using nr_running on rq/rt_rq/dl_rq and nr_queued 
>     for cfs_rq might look strange to the untrained eye.

Yes, but keeping nr_running with new semantics would not be less
confusing and potentially more dangerous.

> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

Thanks!
Re: [PATCH 0/11 v3] sched/fair: Fix statistics with delayed dequeue
Posted by Vincent Guittot 3 weeks, 2 days ago
On Tue, 3 Dec 2024 at 12:18, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Dec 03, 2024 at 11:41:47AM +0100, Dietmar Eggemann wrote:
>
> > with the following nits:
> >
> > (1) 01/11
> >
> >     Proposed 'Fixes:' missing:
> >     https://lkml.kernel.org/r/c82ed217-cfe4-41a4-b39a-e7356231835f@amd.com

Yeah i  forgot to add the fix tag
Fixes: 11cc374f4643b ("sched_ext: Simplify scx_can_stop_tick()
invocation in sched_can_stop_tick()")

> >
> > (2) 08/11
> >
> >     Would be helpful to point out that we lost the only use case for
> >     'cfs_rq->idle_nr_running' with the advent of EEVDF with:
> >
> >     5e963f2bd465 - sched/fair: Commit to EEVDF
>
> Done.
>
> > (3) Using nr_running on rq/rt_rq/dl_rq and nr_queued
> >     for cfs_rq might look strange to the untrained eye.
>
> Yes, but keeping nr_running with new semantics would not be less
> confusing and potentially more dangerous.
>
> > Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
>
> Thanks!