Hello,
This is the v6 for Hierarchical Constant Bandwidth Server, aiming at replacing
the current RT_GROUP_SCHED mechanism with something more robust and
theoretically sound. The patchset has been presented at OSPM25 and OSPM26
(https://retis.sssup.it/ospm-summit/), and a summary of its inner workings can
be found at https://lwn.net/Articles/1021332/ . You can find the previous
versions of this patchset at the bottom of the page, in particular version 1
which talks in more detail what this patchset is all about and how it is
implemented.
This v6 version works on the comments by the reviewers and introduces the
following meaningful changes:
- Update to kernel version 7.1.
- Refactorings and general cleanups.
- Removal of substantial duplicated code.
- Express more locking constraints in code.
- New cpu.rt.max interface.
- Refactoring of migration code to reduce code duplication.
The new migration code now reuses the existing push/pull and similar functions
and specializes where needed, substantially reducing the footprint of group
migration code from previous versions.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
New cgroup-v2 interface:
After extensive discussions with the kernel's maintainers, we have built a new
interface to support HCBS scheduling. Since this will be a cgroup-v2 only
feature (the fate of cgroup-v1 old RT_GROUP_SCHED has yet to be decided), it was
possible to drop the original v1 interface entirely and create a completely new
one that is similar to those that are already existing.
Every cgroup has now two new files:
- cpu.rt.max (similar to the cpu.max file)
- cpu.rt.internal (read-only, not available in the root cgroup, it may be
removed if deemed unnecessary, see later for details)
In this new interface, HCBS cgroups may either be set to use deadline servers,
and thus reserving a specified amount of bandwidth, very similarly to the
previous system, or can delegate their FIFO/RR tasks' scheduling to the nearest
ancestor that it is configured (default on group creation). If the nearest
configured ancestor is the root cgroup, tasks will be effectively run on the
root runqueue even if their cgroup is not the root task group.
This means that subtrees are allowed to retain the original non-RT_GROUP_SCHED
behaviour, scheduling on root, while the feature is nonetheless active. In the
meantime other subtrees may use HCBS, and the whole hierarchy can coexist
without issues.
This behaviour is specified in the cpu.rt.max file, which accepts the string
"<runtime | 'max'> <period>". A zero runtime disables FIFO/RR scheduling for
tasks in that group, a non-zero runtime creates a reservation and uses HCBS, a
runtime of 'max' instead tells the scheduler to use the nearest configured
ancestor for the FIFO/RR task scheduling.
The admission test now does not only check the immediate children of a cgroup
for schedulability (recall that a group's bandwidth must be always greater than
or equal to its children total bandwidth), but it has to check its whole
subtree: if a child delegates its tasks to its parents (runtime = 'max'), then
this child's own children (the grandchildrens) are effectively viewed as
immediate children that compete for the same bandwidth of their grandparent, and
so on down the hierarchy.
To support both threaded and domain cgroups, the original test that allowed only
to run tasks in leaf cgroups has been removed: this is already enforced for
domain cgroups by existing code, while this must not be the case for threaded
cgroups.
Since groups in the middle of the hierarchy can now also run tasks, their
dl_servers must be configured properly: a parent cgroup dl_servers can only use
their assigned bandwidth minus the total of their children. The cpu.rt.internal
file reads exactly what is this "remainder" bandwidth. Since dl_servers must
have a runtime and period values assigned, the period is taken from the user
configured cpu.rt.max file and the runtime is computed from the remainder bw.
This runtime and the period are the values shown by cpu.rt.internal.
Supporting both threaded and domain cgroups also dropped all the extra code
related to active and 'live' cgroups as mentioned in previous RFCs.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Summary of the patches:
1-2) Commits already included in sched/tip (not yet in mainline).
3-8) Preparation patches, so that the RT classes' code can be used both
for normal and cgroup scheduling.
9-19) Implementation of HCBS, no migration.
The old RT_GROUP_SCHED code is removed.
16) Remove support for cgroup-v1.
17) Implement cgroup-v2 cpu.rt.max interface.
20-24) Add support for tasks migration.
25) Documentation for HCBS.
Updates from v5:
- Rebase to latest master.
- General rebasing/cleanup.
- More locking contraints expressed in code.
- New cpu.rt.max interface.
- Refactoring of migration code.
Updates from v4:
- Rebase to latest tip/master.
- General rebasing/cleanup.
- Update default sysctl_sched_rt_runtime to 1s, same as the period.
- Fix non-deferred deadline server replenishment logic.
- Add missing RCU read sections.
- Account HCBS servers along with their tasks when the servers are active.
- Release bandwidth resources early in unregister_rt_sched_group.
- Drop server_try_pull_task as it is now redundant.
- Remove dl_server_stop call in dequeue_task_rt.
- Update to reuse __checkparam_dl for deadline servers.
Updates from v3:
- Rebase to latest tip/master.
- General rebasing/cleanup.
- Add Documentation.
- Define **live** and **active** groups.
- Introduce server_try_pull_task in place of the removed server_has_task.
- Introduce RELEASE_LOCK helper macro for guard-based locking.
- Update inc/dec_dl_tasks to account for served runqueues regardless of the
server type.
- Fix computing of new bandwidth values in dl_init_tg.
- Fix check in dl_check_tg to use capacity scaling.
- Fix wakeup_preempt_rt to check if curr is a DEADLINE task.
Updates from v2:
- Rebase to latest tip/master.
- Remove fair-servers' bw reclaiming.
- Fix a check which prevented execution of wakeup_preempt code.
- Fix a priority check in group_pull_rt_task between tasks of different groups.
- Rework allocation/deallocation code for rt-cgroups.
- Update signatures for some group related migration functions.
- Add documentation for wakeup_preempt preemption rules.
Updates from v1:
- Rebase to latest tip/master.
- Add migration code.
- Split big patches for more readability.
- Refactor code to use guarded locks where applicable.
- Remove unnecessary patches from v1 which have been addressed differently by
mainline updates.
- Remove unnecessary checks and general code cleanup.
Notes:
Patches 1-2 have already been merged in sched/tip, but not yet merged in master.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Testing v6:
The patchset has been tested with a suite of tests tailored to stress all the
implemented functionalities.
The tests are available at https://github.com/Yurand2000/HCBS-Test-Suite .
Refer to the README of the repository for more details.
Follow these steps to test HCBS v6:
- Get the HCBS patch up and running. Any kernel/disto should work effortlessly.
- Get, compile and _install_ the tests.
- Run the `go_rt.sh` script to set the frequency of the CPUs to a fixed value
and disable hyperthreading and power saving features.
- Run the `run_tests.sh full` script, to run the whole test suite.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Future Work:
We think the current patchset is stable enough. Our current test suite
demonstrates, on our limited hardware, that the kernel does not throw warnings
and that it is actually possible to guarantee time reservations and isolation
among tenants.
Comments on the new cpu.rt.max interface are to be expected, but hopefully with
this new ideas we have solved some of the issues mentioned in the past, such as
not being able to use the cpu controller because standard FIFO/RR tasks had to
be migrated to the root cgroup first. For the future it needs to be investigated
how to integrate this interface with the cpuset controller and with the multiCPU
feature which was presented at OSPM26.
Additional future work:
- unprivileged FIFO/RR in cgroups.
- capacity aware bandwidth reservation.
- hotplug/hotunplug management.
Have a nice day,
Yuri
v1: https://lore.kernel.org/all/20250605071412.139240-1-yurand2000@gmail.com/
v2: https://lore.kernel.org/all/20250731105543.40832-1-yurand2000@gmail.com/
v3: https://lore.kernel.org/all/20250929092221.10947-1-yurand2000@gmail.com/
v4: https://lore.kernel.org/all/20251201124205.11169-1-yurand2000@gmail.com/
v5: https://lore.kernel.org/all/20260430213835.62217-1-yurand2000@gmail.com/
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Yuri Andriaccio (14):
sched/deadline: Fix replenishment logic for non-deferred servers
sched/rt: Update default bandwidth for real-time tasks to ONE
sched/rt: Disable RT_GROUP_SCHED
sched/rt: Remove unnecessary runqueue pointer in struct rt_rq
sched/rt: Add {alloc/unregister/free}_rt_sched_group
sched/rt: Implement dl-server operations for rt-cgroups.
sched/rt: Update task event callbacks for HCBS scheduling
sched/rt: Remove support for cgroups-v1
sched/rt: Update task's RT runqueue when switching scheduling class
sched/rt: Add HCBS migration code to related functions
sched/rt: Hook HCBS migration functions
sched/rt: Try pull task on empty server pick.
sched/core: Execute enqueued balance callbacks after
migrate_disable_switch
Documentation: Update documentation for real-time cgroups
luca abeni (11):
sched/deadline: Do not access dl_se->rq directly
sched/deadline: Distinguish between dl_rq and my_q
sched/rt: Pass an rt_rq instead of an rq where needed
sched/rt: Move functions from rt.c to sched.h
sched/rt: Introduce HCBS specific structs in task_group
sched/core: Initialize HCBS specific structures.
sched/deadline: Add dl_init_tg
sched/deadline: Account rt-cgroups bandwidth in deadline tasks
schedulability tests.
sched/rt: Update rt-cgroup schedulability checks
sched/rt: Remove old RT_GROUP_SCHED data structures
sched/core: Execute enqueued balance callbacks when changing allowed
CPUs
Documentation/scheduler/sched-rt-group.rst | 470 ++++-
include/linux/rcupdate.h | 1 +
include/linux/sched.h | 12 +-
kernel/sched/autogroup.c | 4 +-
kernel/sched/core.c | 143 +-
kernel/sched/deadline.c | 221 ++-
kernel/sched/debug.c | 6 -
kernel/sched/ext.c | 4 +-
kernel/sched/fair.c | 4 +-
kernel/sched/rt.c | 2046 +++++++++-----------
kernel/sched/sched.h | 214 +-
kernel/sched/syscalls.c | 11 +-
12 files changed, 1761 insertions(+), 1375 deletions(-)
base-commit: e43ffb69e0438cddd72aaa30898b4dc446f664f8
--
2.54.0