Hello,
This is the v4 for Hierarchical Constant Bandwidth Server, aiming at replacing
the current RT_GROUP_SCHED mechanism with something more robust and
theoretically sound. The patchset has been presented at OSPM25
(https://retis.sssup.it/ospm-summit/), and a summary of its inner workings can
be found at https://lwn.net/Articles/1021332/ . You can find the previous
versions of this patchset at the bottom of the page, in particular version 1
which talks in more detail what this patchset is all about and how it is
implemented.
This v4 version reworks some of the patches as suggested by Juri Lelli and
Markus Elfring. Follows the list of changes:
- General refactorings, cleanups, removal of unnecessary ifdeffy and comments.
- Add Documentation for HCBS, with how-tos and some theoretical background.
- Change names/definitions of active groups:
- A **live** group is one that is accounted for bw and tasks can be attached.
- An **active** group is a **live** group with tasks running inside.
- Add correct cleanup of allocated memory in alloc_rt_sched_group (on allocation
failure), even tho free_rt_sched_group is called on error.
- Fix computing of new bandwidth values in dl_init_tg.
- Fix check in dl_check_tg to use capacity scaling.
- Fix wakeup_preempt_rt to check if curr is a DEADLINE task.
- Update inc/dec_dl_tasks to account for served runqueues regardless of the
server type.
- Update add_nr_running to update root domains and perform tracing only if the
given runqueue is global.
- Introduce server_try_pull_task, as server_has_task gets removed in kernel
version 6.18. This is needed to perform a pull on HCBS server replenish.
- Introduce RELEASE_LOCK macro for cleaner guard-based lock code.
- Move debug BUG_ONs to separate patches, since they are not meant to be used as
asserts. The last two patches are not meant to be incorporated in the kernel,
but are just used to introduce debug asserts for easier testing of expected
preconditions when executing some functions.
The testing system has also been updated to get rid of the closed-source
software dependency to generate the tasksets and their valid configurations.
More on that on the Testing section.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Summary of the patches:
1-4) Preparation patches, so that the RT classes' code can be used both
for normal and cgroup scheduling.
5-16) Implementation of HCBS, no migration and only one level hierarchy.
The old RT_GROUP_SCHED code is removed.
17-18) Remove cgroups v1 in favour of v2.
19) Add support for deeper hierarchies.
20-25) Add support for tasks migration.
26) Documentation for HCBS.
27-28) Debug BUG_ONs optional patches.
Updates from v3:
- Rebase to latest tip/master.
- General rebasing/cleanup.
- Add Documentation.
- Define **live** and **active** groups.
- Introduce server_try_pull_task in place of the removed server_has_task.
- Introduce RELEASE_LOCK helper macro for guard-based locking.
- Update inc/dec_dl_tasks to account for served runqueues regardless of the
server type.
- Fix computing of new bandwidth values in dl_init_tg.
- Fix check in dl_check_tg to use capacity scaling.
- Fix wakeup_preempt_rt to check if curr is a DEADLINE task.
Updates from v2:
- Rebase to latest tip/master.
- Remove fair-servers' bw reclaiming.
- Fix a check which prevented execution of wakeup_preempt code.
- Fix a priority check in group_pull_rt_task between tasks of different groups.
- Rework allocation/deallocation code for rt-cgroups.
- Update signatures for some group related migration functions.
- Add documentation for wakeup_preempt preemption rules.
Updates from v1:
- Rebase to latest tip/master.
- Add migration code.
- Split big patches for more readability.
- Refactor code to use guarded locks where applicable.
- Remove unnecessary patches from v1 which have been addressed differently by
mainline updates.
- Remove unnecessary checks and general code cleanup.
Notes:
Task migration support needs some extra work to reduce its invasiveness,
especially patches 24-25. Patches 27-28 are completely optional and are not
meant to be included in the final patchset: they just add some invasive BUG_ONs
that assert some preconditions expected on some function calls.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Testing v4:
We are still using the tests published in version 3 for the evaluation of the
patchset (refer to the v3 cover letter for more details). For HCBS v4 the so
called "Taskset" tests have been updated to use rt-app as a runner, while the
tasksets + configurations themselves are now generated using a completely new
open source tool: EVA-rt-Engine (https://github.com/Yurand2000/EVA-rt-Engine).
The tests are available at https://github.com/Yurand2000/HCBS-Test-Suite . Refer
to the README of the repository for more details.
Follow these steps to test HCBS v4:
- Get the HCBS patch up and running. Any kernel/disto should work effortlessly.
- Get, compile and _install_ the tests.
- Run the `go_rt.sh` script to set the frequency of the CPUs to a fixed value
and disable hyperthreading and power saving features.
- Run the `run_tests.sh full` script, to run the whole test suite.
Notes:
While you may have rt-app installed in your system, the testing suite comes with
its own rt-app version bundled.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Future Work:
While we wait for more comments, and expect stuff to break, we will work on
completing the currently partial/untested, implementation of HCBS with different
runtimes per CPU, instead of having the same runtime allocated on all CPUs, to
include it in a future RCF.
Future patches:
- HCBS with different runtimes per CPU.
- capacity aware bandwidth reservation.
- enable/disable dl_servers when a CPU goes online/offline.
Have a nice day,
Yuri
v1: https://lore.kernel.org/all/20250605071412.139240-1-yurand2000@gmail.com/
v2: https://lore.kernel.org/all/20250731105543.40832-1-yurand2000@gmail.com/
v3: https://lore.kernel.org/all/20250929092221.10947-1-yurand2000@gmail.com/
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Yuri Andriaccio (11):
sched/rt: Disable RT_GROUP_SCHED
sched/rt: Remove rq field in struct rt_rq
sched/rt: Implement dl-server operations for rt-cgroups.
sched/rt: Update task event callbacks for HCBS scheduling
sched/rt: Allow zeroing the runtime of the root control group
sched/rt: Remove support for cgroups-v1
sched/deadline: Introduce dl_server_try_pull_f
sched/core: Execute enqueued balance callbacks when migrating task
betweeen cgroups
Documentation: Update documentation for real-time cgroups
[DEBUG] sched/rt: Add debug BUG_ONs for pre-migration code
[DEBUG] sched/rt: Add debug BUG_ONs in migration code.
luca abeni (17):
sched/deadline: Do not access dl_se->rq directly
sched/deadline: Distinct between dl_rq and my_q
sched/rt: Pass an rt_rq instead of an rq where needed
sched/rt: Move some functions from rt.c to sched.h
sched/rt: Introduce HCBS specific structs in task_group
sched/core: Initialize HCBS specific structures.
sched/deadline: Add dl_init_tg
sched/rt: Add {alloc/free}_rt_sched_group
sched/deadline: Account rt-cgroups bandwidth in deadline tasks
schedulability tests.
sched/rt: Update rt-cgroup schedulability checks
sched/rt: Remove old RT_GROUP_SCHED data structures
sched/core: Cgroup v2 support
sched/deadline: Allow deeper hierarchies of RT cgroups
sched/rt: Add rt-cgroup migration
sched/rt: Add HCBS migration related checks and function calls
sched/deadline: Fix HCBS migrations on server stop
sched/core: Execute enqueued balance callbacks when changing allowed
CPUs
Documentation/scheduler/sched-rt-group.rst | 500 +++-
include/linux/cleanup.h | 3 +
include/linux/sched.h | 13 +-
kernel/sched/autogroup.c | 4 +-
kernel/sched/core.c | 63 +-
kernel/sched/deadline.c | 257 +-
kernel/sched/debug.c | 6 -
kernel/sched/fair.c | 10 +-
kernel/sched/rt.c | 3097 ++++++++++----------
kernel/sched/sched.h | 176 +-
kernel/sched/syscalls.c | 6 +-
11 files changed, 2346 insertions(+), 1789 deletions(-)
base-commit: 6a23ae0a96a600d1d12557add110e0bb6e32730c
--
2.51.0