[v6] Hierarchical Constant Bandwidth Server

[RFC PATCH v6 00/25] Hierarchical Constant Bandwidth Server

Posted by Yuri Andriaccio 5 days, 13 hours ago

Hello,

This is the v6 for Hierarchical Constant Bandwidth Server, aiming at replacing
the current RT_GROUP_SCHED mechanism with something more robust and
theoretically sound. The patchset has been presented at OSPM25 and OSPM26
(https://retis.sssup.it/ospm-summit/), and a summary of its inner workings can
be found at https://lwn.net/Articles/1021332/ . You can find the previous
versions of this patchset at the bottom of the page, in particular version 1
which talks in more detail what this patchset is all about and how it is
implemented.

This v6 version works on the comments by the reviewers and introduces the
following meaningful changes:
- Update to kernel version 7.1.
- Refactorings and general cleanups.
- Removal of substantial duplicated code.
- Express more locking constraints in code.
- New cpu.rt.max interface.
- Refactoring of migration code to reduce code duplication.
  The new migration code now reuses the existing push/pull and similar functions
  and specializes where needed, substantially reducing the footprint of group
  migration code from previous versions.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
New cgroup-v2 interface:
After extensive discussions with the kernel's maintainers, we have built a new
interface to support HCBS scheduling. Since this will be a cgroup-v2 only
feature (the fate of cgroup-v1 old RT_GROUP_SCHED has yet to be decided), it was
possible to drop the original v1 interface entirely and create a completely new
one that is similar to those that are already existing.

Every cgroup has now two new files:
- cpu.rt.max (similar to the cpu.max file)
- cpu.rt.internal (read-only, not available in the root cgroup, it may be
                   removed if deemed unnecessary, see later for details)

In this new interface, HCBS cgroups may either be set to use deadline servers,
and thus reserving a specified amount of bandwidth, very similarly to the
previous system, or can delegate their FIFO/RR tasks' scheduling to the nearest
ancestor that it is configured (default on group creation). If the nearest
configured ancestor is the root cgroup, tasks will be effectively run on the
root runqueue even if their cgroup is not the root task group.

This means that subtrees are allowed to retain the original non-RT_GROUP_SCHED
behaviour, scheduling on root, while the feature is nonetheless active. In the
meantime other subtrees may use HCBS, and the whole hierarchy can coexist
without issues.

This behaviour is specified in the cpu.rt.max file, which accepts the string
"<runtime | 'max'> <period>". A zero runtime disables FIFO/RR scheduling for
tasks in that group, a non-zero runtime creates a reservation and uses HCBS, a
runtime of 'max' instead tells the scheduler to use the nearest configured
ancestor for the FIFO/RR task scheduling.

The admission test now does not only check the immediate children of a cgroup
for schedulability (recall that a group's bandwidth must be always greater than
or equal to its children total bandwidth), but it has to check its whole
subtree: if a child delegates its tasks to its parents (runtime = 'max'), then
this child's own children (the grandchildrens) are effectively viewed as
immediate children that compete for the same bandwidth of their grandparent, and
so on down the hierarchy.

To support both threaded and domain cgroups, the original test that allowed only
to run tasks in leaf cgroups has been removed: this is already enforced for
domain cgroups by existing code, while this must not be the case for threaded
cgroups.

Since groups in the middle of the hierarchy can now also run tasks, their
dl_servers must be configured properly: a parent cgroup dl_servers can only use
their assigned bandwidth minus the total of their children. The cpu.rt.internal
file reads exactly what is this "remainder" bandwidth. Since dl_servers must
have a runtime and period values assigned, the period is taken from the user
configured cpu.rt.max file and the runtime is computed from the remainder bw.
This runtime and the period are the values shown by cpu.rt.internal.

Supporting both threaded and domain cgroups also dropped all the extra code
related to active and 'live' cgroups as mentioned in previous RFCs.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Summary of the patches:
   1-2) Commits already included in sched/tip (not yet in mainline).
   3-8) Preparation patches, so that the RT classes' code can be used both
        for normal and cgroup scheduling.
  9-19) Implementation of HCBS, no migration.
        The old RT_GROUP_SCHED code is removed.
        16) Remove support for cgroup-v1.
        17) Implement cgroup-v2 cpu.rt.max interface.
 20-24) Add support for tasks migration.
    25) Documentation for HCBS.

Updates from v5:
- Rebase to latest master.
- General rebasing/cleanup.
- More locking contraints expressed in code.
- New cpu.rt.max interface.
- Refactoring of migration code.

Updates from v4:
- Rebase to latest tip/master.
- General rebasing/cleanup.
- Update default sysctl_sched_rt_runtime to 1s, same as the period.
- Fix non-deferred deadline server replenishment logic.
- Add missing RCU read sections.
- Account HCBS servers along with their tasks when the servers are active.
- Release bandwidth resources early in unregister_rt_sched_group.
- Drop server_try_pull_task as it is now redundant.
- Remove dl_server_stop call in dequeue_task_rt.
- Update to reuse __checkparam_dl for deadline servers.

Updates from v3:
- Rebase to latest tip/master.
- General rebasing/cleanup.
- Add Documentation.
- Define **live** and **active** groups.
- Introduce server_try_pull_task in place of the removed server_has_task.
- Introduce RELEASE_LOCK helper macro for guard-based locking.
- Update inc/dec_dl_tasks to account for served runqueues regardless of the
  server type.
- Fix computing of new bandwidth values in dl_init_tg.
- Fix check in dl_check_tg to use capacity scaling.
- Fix wakeup_preempt_rt to check if curr is a DEADLINE task.

Updates from v2:
- Rebase to latest tip/master.
- Remove fair-servers' bw reclaiming.
- Fix a check which prevented execution of wakeup_preempt code.
- Fix a priority check in group_pull_rt_task between tasks of different groups.
- Rework allocation/deallocation code for rt-cgroups.
- Update signatures for some group related migration functions.
- Add documentation for wakeup_preempt preemption rules.

Updates from v1:
- Rebase to latest tip/master.
- Add migration code.
- Split big patches for more readability.
- Refactor code to use guarded locks where applicable.
- Remove unnecessary patches from v1 which have been addressed differently by
  mainline updates.
- Remove unnecessary checks and general code cleanup.

Notes:

Patches 1-2 have already been merged in sched/tip, but not yet merged in master.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Testing v6:

The patchset has been tested with a suite of tests tailored to stress all the
implemented functionalities.
The tests are available at https://github.com/Yurand2000/HCBS-Test-Suite .
Refer to the README of the repository for more details.

Follow these steps to test HCBS v6:
- Get the HCBS patch up and running. Any kernel/disto should work effortlessly.
- Get, compile and _install_ the tests.
- Run the `go_rt.sh` script to set the frequency of the CPUs to a fixed value
  and disable hyperthreading and power saving features.
- Run the `run_tests.sh full` script, to run the whole test suite.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Future Work:

We think the current patchset is stable enough. Our current test suite
demonstrates, on our limited hardware, that the kernel does not throw warnings
and that it is actually possible to guarantee time reservations and isolation
among tenants.

Comments on the new cpu.rt.max interface are to be expected, but hopefully with
this new ideas we have solved some of the issues mentioned in the past, such as
not being able to use the cpu controller because standard FIFO/RR tasks had to
be migrated to the root cgroup first. For the future it needs to be investigated
how to integrate this interface with the cpuset controller and with the multiCPU
feature which was presented at OSPM26.

Additional future work:
 - unprivileged FIFO/RR in cgroups.
 - capacity aware bandwidth reservation.
 - hotplug/hotunplug management.

Have a nice day,
Yuri

v1: https://lore.kernel.org/all/20250605071412.139240-1-yurand2000@gmail.com/
v2: https://lore.kernel.org/all/20250731105543.40832-1-yurand2000@gmail.com/
v3: https://lore.kernel.org/all/20250929092221.10947-1-yurand2000@gmail.com/
v4: https://lore.kernel.org/all/20251201124205.11169-1-yurand2000@gmail.com/
v5: https://lore.kernel.org/all/20260430213835.62217-1-yurand2000@gmail.com/

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Yuri Andriaccio (14):
  sched/deadline: Fix replenishment logic for non-deferred servers
  sched/rt: Update default bandwidth for real-time tasks to ONE
  sched/rt: Disable RT_GROUP_SCHED
  sched/rt: Remove unnecessary runqueue pointer in struct rt_rq
  sched/rt: Add {alloc/unregister/free}_rt_sched_group
  sched/rt: Implement dl-server operations for rt-cgroups.
  sched/rt: Update task event callbacks for HCBS scheduling
  sched/rt: Remove support for cgroups-v1
  sched/rt: Update task's RT runqueue when switching scheduling class
  sched/rt: Add HCBS migration code to related functions
  sched/rt: Hook HCBS migration functions
  sched/rt: Try pull task on empty server pick.
  sched/core: Execute enqueued balance callbacks after
    migrate_disable_switch
  Documentation: Update documentation for real-time cgroups

luca abeni (11):
  sched/deadline: Do not access dl_se->rq directly
  sched/deadline: Distinguish between dl_rq and my_q
  sched/rt: Pass an rt_rq instead of an rq where needed
  sched/rt: Move functions from rt.c to sched.h
  sched/rt: Introduce HCBS specific structs in task_group
  sched/core: Initialize HCBS specific structures.
  sched/deadline: Add dl_init_tg
  sched/deadline: Account rt-cgroups bandwidth in deadline tasks
    schedulability tests.
  sched/rt: Update rt-cgroup schedulability checks
  sched/rt: Remove old RT_GROUP_SCHED data structures
  sched/core: Execute enqueued balance callbacks when changing allowed
    CPUs

 Documentation/scheduler/sched-rt-group.rst |  470 ++++-
 include/linux/rcupdate.h                   |    1 +
 include/linux/sched.h                      |   12 +-
 kernel/sched/autogroup.c                   |    4 +-
 kernel/sched/core.c                        |  143 +-
 kernel/sched/deadline.c                    |  221 ++-
 kernel/sched/debug.c                       |    6 -
 kernel/sched/ext.c                         |    4 +-
 kernel/sched/fair.c                        |    4 +-
 kernel/sched/rt.c                          | 2046 +++++++++-----------
 kernel/sched/sched.h                       |  214 +-
 kernel/sched/syscalls.c                    |   11 +-
 12 files changed, 1761 insertions(+), 1375 deletions(-)


base-commit: e43ffb69e0438cddd72aaa30898b4dc446f664f8
--
2.54.0

Re: [RFC PATCH v6 00/25] Hierarchical Constant Bandwidth Server

Posted by Juri Lelli 4 days, 9 hours ago

Hi Yuri,

Thanks for sending this out.

On 08/06/26 14:15, Yuri Andriaccio wrote:
> Hello,
> 
> This is the v6 for Hierarchical Constant Bandwidth Server, aiming at replacing
> the current RT_GROUP_SCHED mechanism with something more robust and
> theoretically sound. The patchset has been presented at OSPM25 and OSPM26
> (https://retis.sssup.it/ospm-summit/), and a summary of its inner workings can
> be found at https://lwn.net/Articles/1021332/ . You can find the previous
> versions of this patchset at the bottom of the page, in particular version 1
> which talks in more detail what this patchset is all about and how it is
> implemented.
> 
> This v6 version works on the comments by the reviewers and introduces the
> following meaningful changes:
> - Update to kernel version 7.1.
> - Refactorings and general cleanups.
> - Removal of substantial duplicated code.
> - Express more locking constraints in code.
> - New cpu.rt.max interface.
> - Refactoring of migration code to reduce code duplication.
>   The new migration code now reuses the existing push/pull and similar functions
>   and specializes where needed, substantially reducing the footprint of group
>   migration code from previous versions.
> 
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> New cgroup-v2 interface:
> After extensive discussions with the kernel's maintainers, we have built a new
> interface to support HCBS scheduling. Since this will be a cgroup-v2 only
> feature (the fate of cgroup-v1 old RT_GROUP_SCHED has yet to be decided), it was
> possible to drop the original v1 interface entirely and create a completely new
> one that is similar to those that are already existing.
> 
> Every cgroup has now two new files:
> - cpu.rt.max (similar to the cpu.max file)
> - cpu.rt.internal (read-only, not available in the root cgroup, it may be
>                    removed if deemed unnecessary, see later for details)
> 
> In this new interface, HCBS cgroups may either be set to use deadline servers,
> and thus reserving a specified amount of bandwidth, very similarly to the
> previous system, or can delegate their FIFO/RR tasks' scheduling to the nearest
> ancestor that it is configured (default on group creation). If the nearest
> configured ancestor is the root cgroup, tasks will be effectively run on the
> root runqueue even if their cgroup is not the root task group.
> 
> This means that subtrees are allowed to retain the original non-RT_GROUP_SCHED
> behaviour, scheduling on root, while the feature is nonetheless active. In the
> meantime other subtrees may use HCBS, and the whole hierarchy can coexist
> without issues.
> 
> This behaviour is specified in the cpu.rt.max file, which accepts the string
> "<runtime | 'max'> <period>". A zero runtime disables FIFO/RR scheduling for
> tasks in that group, a non-zero runtime creates a reservation and uses HCBS, a
> runtime of 'max' instead tells the scheduler to use the nearest configured
> ancestor for the FIFO/RR task scheduling.
> 
> The admission test now does not only check the immediate children of a cgroup
> for schedulability (recall that a group's bandwidth must be always greater than
> or equal to its children total bandwidth), but it has to check its whole
> subtree: if a child delegates its tasks to its parents (runtime = 'max'), then
> this child's own children (the grandchildrens) are effectively viewed as
> immediate children that compete for the same bandwidth of their grandparent, and
> so on down the hierarchy.
> 
> To support both threaded and domain cgroups, the original test that allowed only
> to run tasks in leaf cgroups has been removed: this is already enforced for
> domain cgroups by existing code, while this must not be the case for threaded
> cgroups.
> 
> Since groups in the middle of the hierarchy can now also run tasks, their
> dl_servers must be configured properly: a parent cgroup dl_servers can only use
> their assigned bandwidth minus the total of their children. The cpu.rt.internal
> file reads exactly what is this "remainder" bandwidth. Since dl_servers must
> have a runtime and period values assigned, the period is taken from the user
> configured cpu.rt.max file and the runtime is computed from the remainder bw.
> This runtime and the period are the values shown by cpu.rt.internal.
> 
> Supporting both threaded and domain cgroups also dropped all the extra code
> related to active and 'live' cgroups as mentioned in previous RFCs.
> 

I started playing with the new interface and ended up with the following

bash-5.3# cat cpu.rt.max  (root)
10000 100000
bash-5.3# cat g1/cpu.rt.max
10000 100000
bash-5.3# cat g1/cpu.rt.internal
9999 100000

which looks odd to me, as nothing is running on g1 yet and no children
groups either. Maybe a rounding error of some kind?

Thanks,
Juri

Re: [RFC PATCH v6 00/25] Hierarchical Constant Bandwidth Server

Posted by Yuri Andriaccio 4 days, 8 hours ago

Hi Juri,

Thanks for looking into this.

 > I started playing with the new interface and ended up with the following
 >
 > bash-5.3# cat cpu.rt.max  (root)
 > 10000 100000
 > bash-5.3# cat g1/cpu.rt.max
 > 10000 100000
 > bash-5.3# cat g1/cpu.rt.internal
 > 9999 100000
 >
 > which looks odd to me, as nothing is running on g1 yet and no children
 > groups either. Maybe a rounding error of some kind?

You are right. I should have mentioned that it is just a rounding error 
that occurs when converting from a bandwidth value to a runtime value. 
This happens because the tg_rt_internal_bandwidth() function truncates 
the value when transforming the runtime from nanoseconds to micros. 
Rounding could be used here to report a more accurate value.

This same issue is probably found in the from_ratio() function, which 
has a similar truncation issue when converting from bandwidth to 
runtime, but since it is working in the nanoseconds range it might not 
be that big of a problem. The value from from_ratio() is used for the 
setup of the dl_servers even when the children bw is zero, so maybe it 
is possible to add a special case?

Anyways, as it is right now, the cpu.rt.internal may have only a +1/-1us 
error in reporting the actual used values, while the error for the 
runtime value used internally to setup the dl_servers is in the range of 
tens of nanoseconds.

Thanks,
Yuri

Re: [RFC PATCH v6 00/25] Hierarchical Constant Bandwidth Server

Posted by Juri Lelli 3 days, 15 hours ago

On 09/06/26 18:23, Yuri Andriaccio wrote:
> Hi Juri,
> 
> Thanks for looking into this.
> 
> > I started playing with the new interface and ended up with the following
> >
> > bash-5.3# cat cpu.rt.max  (root)
> > 10000 100000
> > bash-5.3# cat g1/cpu.rt.max
> > 10000 100000
> > bash-5.3# cat g1/cpu.rt.internal
> > 9999 100000
> >
> > which looks odd to me, as nothing is running on g1 yet and no children
> > groups either. Maybe a rounding error of some kind?
> 
> You are right. I should have mentioned that it is just a rounding error that
> occurs when converting from a bandwidth value to a runtime value. This
> happens because the tg_rt_internal_bandwidth() function truncates the value
> when transforming the runtime from nanoseconds to micros. Rounding could be
> used here to report a more accurate value.
> 
> This same issue is probably found in the from_ratio() function, which has a
> similar truncation issue when converting from bandwidth to runtime, but
> since it is working in the nanoseconds range it might not be that big of a
> problem. The value from from_ratio() is used for the setup of the dl_servers
> even when the children bw is zero, so maybe it is possible to add a special
> case?
> 
> Anyways, as it is right now, the cpu.rt.internal may have only a +1/-1us
> error in reporting the actual used values, while the error for the runtime
> value used internally to setup the dl_servers is in the range of tens of
> nanoseconds.

Not a huge problem per se, but it will raise some eyebrows (and generate
questions) if we leave things as is, I fear.

I wonder if, instead of converting to bandwidth ratios and back (losing
precision in both directions), we can compute children's runtime sum directly
in nanoseconds. For children with different periods, we can maybe normalize
(128-bit intermediate?). Parent's internal runtime is then a simple exact
subtraction: parent_runtime - children_runtime_sum. This should reduce
precision loss from double conversions. Also, as you suggest as well, apply
rounding when displaying to user.