[PATCH v3 00/10] Add a deadline server for sched_ext tasks

Joel Fernandes posted 10 patches 3 months, 4 weeks ago
There is a newer version of this series
include/linux/sched.h                         |   2 +-
kernel/sched/core.c                           |  19 +-
kernel/sched/deadline.c                       |  78 +++++--
kernel/sched/debug.c                          | 171 +++++++++++---
kernel/sched/ext.c                            | 108 ++++++++-
kernel/sched/fair.c                           |  15 +-
kernel/sched/idle.c                           |   4 +-
kernel/sched/rt.c                             |   2 +-
kernel/sched/sched.h                          |  13 +-
kernel/sched/stop_task.c                      |   2 +-
tools/testing/selftests/sched_ext/Makefile    |   1 +
.../selftests/sched_ext/rt_stall.bpf.c        |  23 ++
tools/testing/selftests/sched_ext/rt_stall.c  | 213 ++++++++++++++++++
13 files changed, 579 insertions(+), 72 deletions(-)
create mode 100644 tools/testing/selftests/sched_ext/rt_stall.bpf.c
create mode 100644 tools/testing/selftests/sched_ext/rt_stall.c
[PATCH v3 00/10] Add a deadline server for sched_ext tasks
Posted by Joel Fernandes 3 months, 4 weeks ago
sched_ext tasks currently are starved by RT hoggers especially since RT
throttling was replaced by deadline servers to boost only CFS tasks. Several
users in the community have reported issues with RT stalling sched_ext tasks.
Add a sched_ext deadline server as well so that sched_ext tasks are also
boosted and do not suffer starvation.

A kselftest is also provided to verify the starvation issues are now fixed.

Btw, there is still something funky going on with CPU hotplug and the
relinquish patch. Sometimes the sched_ext's hotplug self-test locks up
(./runner -t hotplug). Reverting that patch fixes it, so I am suspecting
something is off in dl_server_remove_params() when it is being called on
offline CPUs.

v2->v3:
 - Removed code duplication in debugfs. Made ext interface separate.
 - Fixed issue where rq_lock_irqsave was not used in the relinquish patch.
 - Fixed running bw accounting issue in dl_server_remove_params.

Link to v1: https://lore.kernel.org/all/20250315022158.2354454-1-joelagnelf@nvidia.com/
Link to v2: https://lore.kernel.org/all/20250602180110.816225-1-joelagnelf@nvidia.com/

Andrea Righi (1):
  selftests/sched_ext: Add test for sched_ext dl_server

Joel Fernandes (9):
  sched/debug: Fix updating of ppos on server write ops
  sched/debug: Stop and start server based on if it was active
  sched/deadline: Clear the defer params
  sched: Add support to pick functions to take rf
  sched: Add a server arg to dl_server_update_idle_time()
  sched/ext: Add a DL server for sched_ext tasks
  sched/debug: Add support to change sched_ext server params
  sched/deadline: Add support to remove DL server bandwidth
  sched/ext: Relinquish DL server reservations when not needed

 include/linux/sched.h                         |   2 +-
 kernel/sched/core.c                           |  19 +-
 kernel/sched/deadline.c                       |  78 +++++--
 kernel/sched/debug.c                          | 171 +++++++++++---
 kernel/sched/ext.c                            | 108 ++++++++-
 kernel/sched/fair.c                           |  15 +-
 kernel/sched/idle.c                           |   4 +-
 kernel/sched/rt.c                             |   2 +-
 kernel/sched/sched.h                          |  13 +-
 kernel/sched/stop_task.c                      |   2 +-
 tools/testing/selftests/sched_ext/Makefile    |   1 +
 .../selftests/sched_ext/rt_stall.bpf.c        |  23 ++
 tools/testing/selftests/sched_ext/rt_stall.c  | 213 ++++++++++++++++++
 13 files changed, 579 insertions(+), 72 deletions(-)
 create mode 100644 tools/testing/selftests/sched_ext/rt_stall.bpf.c
 create mode 100644 tools/testing/selftests/sched_ext/rt_stall.c

-- 
2.34.1
Re: [PATCH v3 00/10] Add a deadline server for sched_ext tasks
Posted by Joel Fernandes 3 months, 4 weeks ago

On 6/13/2025 1:17 AM, Joel Fernandes wrote:
> sched_ext tasks currently are starved by RT hoggers especially since RT
> throttling was replaced by deadline servers to boost only CFS tasks. Several
> users in the community have reported issues with RT stalling sched_ext tasks.
> Add a sched_ext deadline server as well so that sched_ext tasks are also
> boosted and do not suffer starvation.
> 
> A kselftest is also provided to verify the starvation issues are now fixed.
> 
> Btw, there is still something funky going on with CPU hotplug and the
> relinquish patch. Sometimes the sched_ext's hotplug self-test locks up
> (./runner -t hotplug). Reverting that patch fixes it, so I am suspecting
> something is off in dl_server_remove_params() when it is being called on
> offline CPUs.

I think I got somewhere here with this sched_ext hotplug test but still not
there yet. Juri, Andrea, Tejun, can you take a look at the below when you get a
chance?

In the hotplug test, when the CPU is brought online, I see the following warning
fire [1]. Basically, dl_server_apply_params() fails with -EBUSY due to overflow
checks.

@@ -1657,8 +1657,7 @@ void dl_server_start(struct sched_dl_entity *dl_se)
                u64 runtime =  50 * NSEC_PER_MSEC;
                u64 period = 1000 * NSEC_PER_MSEC;

-               dl_server_apply_params(dl_se, runtime, period, 1);
-
+               WARN_ON_ONCE(dl_server_apply_params(dl_se, runtime, period, 1));
                dl_se->dl_server = 1;
                dl_se->dl_defer = 1;
                setup_new_dl_entity(dl_se);

I dug deeper, and it seems CPU 1 was previously brought offline and then online
before the warning happened during *that onlining*. During the onlining,
enqueue_task_scx() -> dl_server_start() was called but dl_server_apply_params()
returned -EBUSY.

In dl_server_apply_params() -> __dl_overflow(), it appears dl_bw_cpus()=0 and
cap=0. That is really odd and probably the reason for warning. Is that because
the CPU was offlined earlier and is not yet attached to the root domain?

The problem also comes down to why does this happen only when calling my
dl_server_remove_params() only and not otherwise, and why on earth is
dl_bw_cpus() returning 0. There's at least 2 other CPUs online at the time.

Anyway, other than this mystery, I fixed all other bandwidth-related warnings
due to dl_server_remove_params() and the updated patch below [2].

[1] Warning:

[   11.878005] DL server bandwidth overflow on CPU 1: dl_b->bw=996147, cap=0,
total_bw=0, old_bw=0, new_bw=52428, dl_bw_cpus=0
[   11.878356] ------------[ cut here ]------------
[   11.878528] WARNING: CPU: 0 PID: 145 at
               kernel/sched/deadline.c:1670 dl_server_start+0x96/0xa0
[   11.879400] Sched_ext: hotplug_cbs (enabled+all), task: runnable_at=+0ms

       [   11.879404] RIP: 0010:dl_server_start+0x96/0xa0
[   11.879732] Code: 53 10 75 1d 49 8b 86 10 0c 00 00 48 8b
[   11.882510] Call Trace:
[   11.882592]  <TASK>
[   11.882685]  enqueue_task_scx+0x190/0x280
[   11.882802]  ttwu_do_activate+0xaa/0x2a0
[   11.882925]  try_to_wake_up+0x371/0x600
[   11.883047]  cpuhp_bringup_ap+0xd6/0x170

       [   11.883172]  cpuhp_invoke_callback+0x142/0x540

              [   11.883327]  _cpu_up+0x15b/0x270
[   11.883450]  cpu_up+0x52/0xb0
[   11.883576]  cpu_subsys_online+0x32/0x120
[   11.883704]  online_store+0x98/0x130
[   11.883824]  kernfs_fop_write_iter+0xeb/0x170
[   11.883972]  vfs_write+0x2c7/0x430

       [   11.884091]  ksys_write+0x70/0xe0
[   11.884209]  do_syscall_64+0xd6/0x250
[   11.884327]  ? clear_bhb_loop+0x40/0x90

       [   11.884443]  entry_SYSCALL_64_after_hwframe+0x77/0x7f


[2]: Updated patch "sched/ext: Relinquish DL server reservations when not needed":
https://git.kernel.org/pub/scm/linux/kernel/git/jfern/linux.git/commit/?h=sched/scx-dlserver-boost-rebase&id=56581c2a6bb8e78593df80ad47520a8399055eae

thanks,

 - Joel


> 
> v2->v3:
>  - Removed code duplication in debugfs. Made ext interface separate.
>  - Fixed issue where rq_lock_irqsave was not used in the relinquish patch.
>  - Fixed running bw accounting issue in dl_server_remove_params.
> 
> Link to v1: https://lore.kernel.org/all/20250315022158.2354454-1-joelagnelf@nvidia.com/
> Link to v2: https://lore.kernel.org/all/20250602180110.816225-1-joelagnelf@nvidia.com/
> 
> Andrea Righi (1):
>   selftests/sched_ext: Add test for sched_ext dl_server
> 
> Joel Fernandes (9):
>   sched/debug: Fix updating of ppos on server write ops
>   sched/debug: Stop and start server based on if it was active
>   sched/deadline: Clear the defer params
>   sched: Add support to pick functions to take rf
>   sched: Add a server arg to dl_server_update_idle_time()
>   sched/ext: Add a DL server for sched_ext tasks
>   sched/debug: Add support to change sched_ext server params
>   sched/deadline: Add support to remove DL server bandwidth
>   sched/ext: Relinquish DL server reservations when not needed
> 
>  include/linux/sched.h                         |   2 +-
>  kernel/sched/core.c                           |  19 +-
>  kernel/sched/deadline.c                       |  78 +++++--
>  kernel/sched/debug.c                          | 171 +++++++++++---
>  kernel/sched/ext.c                            | 108 ++++++++-
>  kernel/sched/fair.c                           |  15 +-
>  kernel/sched/idle.c                           |   4 +-
>  kernel/sched/rt.c                             |   2 +-
>  kernel/sched/sched.h                          |  13 +-
>  kernel/sched/stop_task.c                      |   2 +-
>  tools/testing/selftests/sched_ext/Makefile    |   1 +
>  .../selftests/sched_ext/rt_stall.bpf.c        |  23 ++
>  tools/testing/selftests/sched_ext/rt_stall.c  | 213 ++++++++++++++++++
>  13 files changed, 579 insertions(+), 72 deletions(-)
>  create mode 100644 tools/testing/selftests/sched_ext/rt_stall.bpf.c
>  create mode 100644 tools/testing/selftests/sched_ext/rt_stall.c
>
Re: [PATCH v3 00/10] Add a deadline server for sched_ext tasks
Posted by Joel Fernandes 3 months, 4 weeks ago

On 6/13/2025 1:35 PM, Joel Fernandes wrote:
> 
> 
> On 6/13/2025 1:17 AM, Joel Fernandes wrote:
>> sched_ext tasks currently are starved by RT hoggers especially since RT
>> throttling was replaced by deadline servers to boost only CFS tasks. Several
>> users in the community have reported issues with RT stalling sched_ext tasks.
>> Add a sched_ext deadline server as well so that sched_ext tasks are also
>> boosted and do not suffer starvation.
>>
>> A kselftest is also provided to verify the starvation issues are now fixed.
>>
>> Btw, there is still something funky going on with CPU hotplug and the
>> relinquish patch. Sometimes the sched_ext's hotplug self-test locks up
>> (./runner -t hotplug). Reverting that patch fixes it, so I am suspecting
>> something is off in dl_server_remove_params() when it is being called on
>> offline CPUs.
> 
> I think I got somewhere here with this sched_ext hotplug test but still not
> there yet. Juri, Andrea, Tejun, can you take a look at the below when you get a
> chance?

The following patch makes the sched_ext hotplug test reliably pass for me now.
Thoughts?

From: Joel Fernandes <joelagnelf@nvidia.com>
Subject: [PATCH] sched/deadline: Prevent setting server as started if params
 couldn't be applied

The following call trace fails to set dl_server_apply_params() as
dl_bw_cpus() is 0 during CPU onlining in the below path.

[   11.878356] ------------[ cut here ]------------
[   11.882592]  <TASK>
[   11.882685]  enqueue_task_scx+0x190/0x280
[   11.882802]  ttwu_do_activate+0xaa/0x2a0
[   11.882925]  try_to_wake_up+0x371/0x600
[   11.883047]  cpuhp_bringup_ap+0xd6/0x170

       [   11.883172]  cpuhp_invoke_callback+0x142/0x540

              [   11.883327]  _cpu_up+0x15b/0x270
[   11.883450]  cpu_up+0x52/0xb0
[   11.883576]  cpu_subsys_online+0x32/0x120
[   11.883704]  online_store+0x98/0x130
[   11.883824]  kernfs_fop_write_iter+0xeb/0x170
[   11.883972]  vfs_write+0x2c7/0x430

       [   11.884091]  ksys_write+0x70/0xe0
[   11.884209]  do_syscall_64+0xd6/0x250
[   11.884327]  ? clear_bhb_loop+0x40/0x90

       [   11.884443]  entry_SYSCALL_64_after_hwframe+0x77/0x7f

It seems too early to start the server. Simply defer the starting of the
server to the next enqueue if dl_server_apply_params() returns an error.
In any case, we should not pretend like the server started and it does
seem to mess up with the sched_ext CPU hotplug test.

With this, the sched_ext hotplug test reliably passes.

Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
 kernel/sched/deadline.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index f0cd1dbca4b8..8dd0c6d71489 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1657,8 +1657,8 @@ void dl_server_start(struct sched_dl_entity *dl_se)
                u64 runtime =  50 * NSEC_PER_MSEC;
                u64 period = 1000 * NSEC_PER_MSEC;

-               dl_server_apply_params(dl_se, runtime, period, 1);
-
+               if (dl_server_apply_params(dl_se, runtime, period, 1))
+                       return;
                dl_se->dl_server = 1;
                dl_se->dl_defer = 1;
                setup_new_dl_entity(dl_se);
@@ -1675,7 +1675,7 @@ void dl_server_start(struct sched_dl_entity *dl_se)

 void dl_server_stop(struct sched_dl_entity *dl_se)
 {
-       if (!dl_se->dl_runtime)
+       if (!dl_se->dl_runtime || !dl_se->dl_server_active)
                return;

        dequeue_dl_entity(dl_se, DEQUEUE_SLEEP);
Re: [PATCH v3 00/10] Add a deadline server for sched_ext tasks
Posted by Andrea Righi 3 months, 4 weeks ago
Hi Joel,

On Fri, Jun 13, 2025 at 02:05:03PM -0400, Joel Fernandes wrote:
> 
> 
> On 6/13/2025 1:35 PM, Joel Fernandes wrote:
> > 
> > 
> > On 6/13/2025 1:17 AM, Joel Fernandes wrote:
> >> sched_ext tasks currently are starved by RT hoggers especially since RT
> >> throttling was replaced by deadline servers to boost only CFS tasks. Several
> >> users in the community have reported issues with RT stalling sched_ext tasks.
> >> Add a sched_ext deadline server as well so that sched_ext tasks are also
> >> boosted and do not suffer starvation.
> >>
> >> A kselftest is also provided to verify the starvation issues are now fixed.
> >>
> >> Btw, there is still something funky going on with CPU hotplug and the
> >> relinquish patch. Sometimes the sched_ext's hotplug self-test locks up
> >> (./runner -t hotplug). Reverting that patch fixes it, so I am suspecting
> >> something is off in dl_server_remove_params() when it is being called on
> >> offline CPUs.
> > 
> > I think I got somewhere here with this sched_ext hotplug test but still not
> > there yet. Juri, Andrea, Tejun, can you take a look at the below when you get a
> > chance?
> 
> The following patch makes the sched_ext hotplug test reliably pass for me now.
> Thoughts?

For me it gets stuck here, when the hotplug test tries to bring the CPU
offline:

TEST: hotplug
DESCRIPTION: Verify hotplug behavior
OUTPUT:
[    5.042497] smpboot: CPU 1 is now offline
[    5.069691] sched_ext: BPF scheduler "hotplug_cbs" enabled
[    5.108705] smpboot: Booting Node 0 Processor 1 APIC 0x1
[    5.149484] sched_ext: BPF scheduler "hotplug_cbs" disabled (unregistered from BPF)
EXIT: unregistered from BPF (hotplug event detected (1 going online))
[    5.204500] sched_ext: BPF scheduler "hotplug_cbs" enabled
Failed to bring CPU offline (Device or resource busy)

However, if I don't stop rq->fair_server in the scx_switching_all case
everything seems to work (which I still don't understand why).

I didn't have much time to look at this today, I'll investigate more
tomorrow.

-Andrea

> 
> From: Joel Fernandes <joelagnelf@nvidia.com>
> Subject: [PATCH] sched/deadline: Prevent setting server as started if params
>  couldn't be applied
> 
> The following call trace fails to set dl_server_apply_params() as
> dl_bw_cpus() is 0 during CPU onlining in the below path.
> 
> [   11.878356] ------------[ cut here ]------------
> [   11.882592]  <TASK>
> [   11.882685]  enqueue_task_scx+0x190/0x280
> [   11.882802]  ttwu_do_activate+0xaa/0x2a0
> [   11.882925]  try_to_wake_up+0x371/0x600
> [   11.883047]  cpuhp_bringup_ap+0xd6/0x170
> 
>        [   11.883172]  cpuhp_invoke_callback+0x142/0x540
> 
>               [   11.883327]  _cpu_up+0x15b/0x270
> [   11.883450]  cpu_up+0x52/0xb0
> [   11.883576]  cpu_subsys_online+0x32/0x120
> [   11.883704]  online_store+0x98/0x130
> [   11.883824]  kernfs_fop_write_iter+0xeb/0x170
> [   11.883972]  vfs_write+0x2c7/0x430
> 
>        [   11.884091]  ksys_write+0x70/0xe0
> [   11.884209]  do_syscall_64+0xd6/0x250
> [   11.884327]  ? clear_bhb_loop+0x40/0x90
> 
>        [   11.884443]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> 
> It seems too early to start the server. Simply defer the starting of the
> server to the next enqueue if dl_server_apply_params() returns an error.
> In any case, we should not pretend like the server started and it does
> seem to mess up with the sched_ext CPU hotplug test.
> 
> With this, the sched_ext hotplug test reliably passes.
> 
> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
> ---
>  kernel/sched/deadline.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index f0cd1dbca4b8..8dd0c6d71489 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1657,8 +1657,8 @@ void dl_server_start(struct sched_dl_entity *dl_se)
>                 u64 runtime =  50 * NSEC_PER_MSEC;
>                 u64 period = 1000 * NSEC_PER_MSEC;
> 
> -               dl_server_apply_params(dl_se, runtime, period, 1);
> -
> +               if (dl_server_apply_params(dl_se, runtime, period, 1))
> +                       return;
>                 dl_se->dl_server = 1;
>                 dl_se->dl_defer = 1;
>                 setup_new_dl_entity(dl_se);
> @@ -1675,7 +1675,7 @@ void dl_server_start(struct sched_dl_entity *dl_se)
> 
>  void dl_server_stop(struct sched_dl_entity *dl_se)
>  {
> -       if (!dl_se->dl_runtime)
> +       if (!dl_se->dl_runtime || !dl_se->dl_server_active)
>                 return;
> 
>         dequeue_dl_entity(dl_se, DEQUEUE_SLEEP);