MAINTAINERS | 1 + kernel/sched/core.c | 2 + kernel/sched/deadline.c | 61 +++++++++++++++++++--------- kernel/sched/rt.c | 6 +++ kernel/sched/sched.h | 1 + tools/sched/dl_bw_dump.py | 57 ++++++++++++++++++++++++++ tools/sched/root_domains_dump.py | 68 ++++++++++++++++++++++++++++++++ 7 files changed, 177 insertions(+), 19 deletions(-) create mode 100755 tools/sched/dl_bw_dump.py create mode 100755 tools/sched/root_domains_dump.py
Hi All, This patch series addresses a significant regression observed in `SCHED_DEADLINE` performance, specifically when `SCHED_FLAG_RECLAIM` (Greedy Reclamation of Unused Bandwidth - GRUB) is enabled alongside overrunning jobs. This issue was reported by Marcel [1]. Marcel's team extensive real-time scheduler (`SCHED_DEADLINE`) tests on mainline Linux kernels (amd64-based Intel NUCs and aarch64-based RADXA ROCK5Bs) typically show zero deadline misses for 5ms granularity tasks. However, with reclaim mode enabled and the same two overrunning jobs in the mix, they observed a dramatic increase in deadline misses: 43 million on NUC and 600 thousand on ROCK55B. This highlights a critical accounting issue within `SCHED_DEADLINE` when reclaim is active. This series fixes the issue by doing the following. - 1/5: sched/deadline: Initialize dl_servers after SMP Currently, `dl-servers` are initialized too early during boot, before all CPUs are online. This results in an incorrect calculation of per-runqueue `DEADLINE` variables, such as `extra_bw`, which rely on a stable CPU count. This patch moves the `dl-server` initialization to a later stage, after SMP initialization, ensuring all CPUs are online and correct `extra_bw` values can be computed from the start. - 2/5: sched/deadline: Reset extra_bw to max_bw when clearing root domains The `dl_clear_root_domain()` function was found to not properly account for the fact that per-runqueue `extra_bw` variables retained stale values computed before root domain changes. This led to broken accounting. This patch fixes the issue by resetting `extra_bw` to `max_bw` before restoring `dl-server` contributions, ensuring a clean state. - 3/5: sched/deadline: Fix accounting after global limits change Changes to global `SCHED_DEADLINE` limits (handled by `sched_rt_handler()` logic) were found to leave stale or incorrect values in various accounting-related variables, including `extra_bw`. This patch properly cleans up per-runqueue variables before implementing the global limit change and then rebuilds the scheduling domains. This ensures that the accounting is correctly restored and maintained after such global limit adjustments. - 4/5 and 5/5 are simple drgn scripts I put together to help debugging this issue. I have the impression that they might be useful to have around for the future. Please review and test. The set is also availabe at git@github.com:jlelli/linux.git upstream/fix-grub-tip 1 - https://lore.kernel.org/lkml/ce8469c4fb2f3e2ada74add22cce4bfe61fd5bab.camel@codethink.co.uk/ Thanks, Juri Juri Lelli (5): sched/deadline: Initialize dl_servers after SMP sched/deadline: Reset extra_bw to max_bw when clearing root domains sched/deadline: Fix accounting after global limits change tools/sched: Add root_domains_dump.py which dumps root domains info tools/sched: Add dl_bw_dump.py for printing bandwidth accounting info MAINTAINERS | 1 + kernel/sched/core.c | 2 + kernel/sched/deadline.c | 61 +++++++++++++++++++--------- kernel/sched/rt.c | 6 +++ kernel/sched/sched.h | 1 + tools/sched/dl_bw_dump.py | 57 ++++++++++++++++++++++++++ tools/sched/root_domains_dump.py | 68 ++++++++++++++++++++++++++++++++ 7 files changed, 177 insertions(+), 19 deletions(-) create mode 100755 tools/sched/dl_bw_dump.py create mode 100755 tools/sched/root_domains_dump.py -- 2.49.0
Hi Juri On Fri, 2025-06-27 at 13:51 +0200, Juri Lelli wrote: > Hi All, > > This patch series addresses a significant regression observed in > `SCHED_DEADLINE` performance, specifically when `SCHED_FLAG_RECLAIM` > (Greedy Reclamation of Unused Bandwidth - GRUB) is enabled alongside > overrunning jobs. This issue was reported by Marcel [1]. > > Marcel's team extensive real-time scheduler (`SCHED_DEADLINE`) tests on > mainline Linux kernels (amd64-based Intel NUCs and aarch64-based RADXA > ROCK5Bs) typically show zero deadline misses for 5ms granularity tasks. > However, with reclaim mode enabled and the same two overrunning jobs in > the mix, they observed a dramatic increase in deadline misses: 43 > million on NUC and 600 thousand on ROCK55B. This highlights a critical > accounting issue within `SCHED_DEADLINE` when reclaim is active. > > This series fixes the issue by doing the following. > > - 1/5: sched/deadline: Initialize dl_servers after SMP > Currently, `dl-servers` are initialized too early during boot, before > all CPUs are online. This results in an incorrect calculation of > per-runqueue `DEADLINE` variables, such as `extra_bw`, which rely on a > stable CPU count. This patch moves the `dl-server` initialization to a > later stage, after SMP initialization, ensuring all CPUs are online and > correct `extra_bw` values can be computed from the start. > > - 2/5: sched/deadline: Reset extra_bw to max_bw when clearing root domains > The `dl_clear_root_domain()` function was found to not properly account > for the fact that per-runqueue `extra_bw` variables retained stale > values computed before root domain changes. This led to broken > accounting. This patch fixes the issue by resetting `extra_bw` to > `max_bw` before restoring `dl-server` contributions, ensuring a clean > state. > > - 3/5: sched/deadline: Fix accounting after global limits change > Changes to global `SCHED_DEADLINE` limits (handled by > `sched_rt_handler()` logic) were found to leave stale or incorrect > values in various accounting-related variables, including `extra_bw`. > This patch properly cleans up per-runqueue variables before implementing > the global limit change and then rebuilds the scheduling domains. This > ensures that the accounting is correctly restored and maintained after > such global limit adjustments. > > - 4/5 and 5/5 are simple drgn scripts I put together to help debugging > this issue. I have the impression that they might be useful to have > around for the future. > > Please review and test. Over the weekend I run 312 mio. test runs on NUC and 231 mio. on ROCK55B without any single deadline misses. Therefore, for the whole series: Tested-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> # nuc & rock5b Thanks! > The set is also availabe at > > git@github.com:jlelli/linux.git upstream/fix-grub-tip > > 1 - https://lore.kernel.org/lkml/ce8469c4fb2f3e2ada74add22cce4bfe61fd5bab.camel@codethink.co.uk/ > > Thanks, > Juri > > Juri Lelli (5): > sched/deadline: Initialize dl_servers after SMP > sched/deadline: Reset extra_bw to max_bw when clearing root domains > sched/deadline: Fix accounting after global limits change > tools/sched: Add root_domains_dump.py which dumps root domains info > tools/sched: Add dl_bw_dump.py for printing bandwidth accounting info > > MAINTAINERS | 1 + > kernel/sched/core.c | 2 + > kernel/sched/deadline.c | 61 +++++++++++++++++++--------- > kernel/sched/rt.c | 6 +++ > kernel/sched/sched.h | 1 + > tools/sched/dl_bw_dump.py | 57 ++++++++++++++++++++++++++ > tools/sched/root_domains_dump.py | 68 ++++++++++++++++++++++++++++++++ > 7 files changed, 177 insertions(+), 19 deletions(-) > create mode 100755 tools/sched/dl_bw_dump.py > create mode 100755 tools/sched/root_domains_dump.py Cheers Marcel
Hi everybody On Fri, 2025-06-27 at 13:51 +0200, Juri Lelli wrote: > Hi All, Any more progress on this? As this is a bug it would be really nice to land a fix sooner than later : ) Thanks! > This patch series addresses a significant regression observed in > `SCHED_DEADLINE` performance, specifically when `SCHED_FLAG_RECLAIM` > (Greedy Reclamation of Unused Bandwidth - GRUB) is enabled alongside > overrunning jobs. This issue was reported by Marcel [1]. > > Marcel's team extensive real-time scheduler (`SCHED_DEADLINE`) tests on > mainline Linux kernels (amd64-based Intel NUCs and aarch64-based RADXA > ROCK5Bs) typically show zero deadline misses for 5ms granularity tasks. > However, with reclaim mode enabled and the same two overrunning jobs in > the mix, they observed a dramatic increase in deadline misses: 43 > million on NUC and 600 thousand on ROCK55B. This highlights a critical > accounting issue within `SCHED_DEADLINE` when reclaim is active. > > This series fixes the issue by doing the following. > > - 1/5: sched/deadline: Initialize dl_servers after SMP > Currently, `dl-servers` are initialized too early during boot, before > all CPUs are online. This results in an incorrect calculation of > per-runqueue `DEADLINE` variables, such as `extra_bw`, which rely on a > stable CPU count. This patch moves the `dl-server` initialization to a > later stage, after SMP initialization, ensuring all CPUs are online and > correct `extra_bw` values can be computed from the start. > > - 2/5: sched/deadline: Reset extra_bw to max_bw when clearing root domains > The `dl_clear_root_domain()` function was found to not properly account > for the fact that per-runqueue `extra_bw` variables retained stale > values computed before root domain changes. This led to broken > accounting. This patch fixes the issue by resetting `extra_bw` to > `max_bw` before restoring `dl-server` contributions, ensuring a clean > state. > > - 3/5: sched/deadline: Fix accounting after global limits change > Changes to global `SCHED_DEADLINE` limits (handled by > `sched_rt_handler()` logic) were found to leave stale or incorrect > values in various accounting-related variables, including `extra_bw`. > This patch properly cleans up per-runqueue variables before implementing > the global limit change and then rebuilds the scheduling domains. This > ensures that the accounting is correctly restored and maintained after > such global limit adjustments. > > - 4/5 and 5/5 are simple drgn scripts I put together to help debugging > this issue. I have the impression that they might be useful to have > around for the future. > > Please review and test. > > The set is also availabe at > > git@github.com:jlelli/linux.git upstream/fix-grub-tip > > 1 - https://lore.kernel.org/lkml/ce8469c4fb2f3e2ada74add22cce4bfe61fd5bab.camel@codethink.co.uk/ > > Thanks, > Juri > > Juri Lelli (5): > sched/deadline: Initialize dl_servers after SMP > sched/deadline: Reset extra_bw to max_bw when clearing root domains > sched/deadline: Fix accounting after global limits change > tools/sched: Add root_domains_dump.py which dumps root domains info > tools/sched: Add dl_bw_dump.py for printing bandwidth accounting info > > MAINTAINERS | 1 + > kernel/sched/core.c | 2 + > kernel/sched/deadline.c | 61 +++++++++++++++++++--------- > kernel/sched/rt.c | 6 +++ > kernel/sched/sched.h | 1 + > tools/sched/dl_bw_dump.py | 57 ++++++++++++++++++++++++++ > tools/sched/root_domains_dump.py | 68 ++++++++++++++++++++++++++++++++ > 7 files changed, 177 insertions(+), 19 deletions(-) > create mode 100755 tools/sched/dl_bw_dump.py > create mode 100755 tools/sched/root_domains_dump.py Cheers Marcel
© 2016 - 2025 Red Hat, Inc.