[v3] Hierarchical Constant Bandwidth Server

[RFC PATCH v3 00/24] Hierarchical Constant Bandwidth Server

Posted by Yuri Andriaccio 4 months, 1 week ago

Hello,

This is the v3 for Hierarchical Constant Bandwidth Server, aiming at replacing
the current RT_GROUP_SCHED mechanism with something more robust and
theoretically sound. The patchset has been presented at OSPM25
(https://retis.sssup.it/ospm-summit/), and a summary of its inner workings can
be found at https://lwn.net/Articles/1021332/ . You can find the previous
versions of this patchset at the bottom of the page, in particular version 1
which talks in more detail what this patchset is all about and how it is
implemented.

This v3 version further reworks some of the patches as suggested by Juri Lelli.
While most of the work is refactorings, the following were also changed:
- The first patch which removed fair-servers' bandwidth accounting has been
  removed, as it was deemed wrong. You can find the last version of this removed
  patch, just for history reasons, here:
  https://lore.kernel.org/all/20250903114448.664452-1-yurand2000@gmail.com/
- A left-over check which prevented execution of some of wakeup_preempt code has
  been removed.
- Cgroup pull code was erroneusly comparing cgroup with non-cgroup tasks, now it
  has been fixed.
- The allocation/deallocation code for rt cgroups has been checked and reworked
  to make sure that resources are managed correctly in all the code paths.
- Some signatures of cgroup migration related functions where changed to match
  more closely to their non-group counterparts.
- Descriptions and documentation were added where necessary, in particular for
  preemption rules in wakeup_preempt.

For this v3 version we've also polished the testing system we are using and made
it public for testers to run on their own machines. The source code can be found
at https://github.com/Yurand2000/HCBS-Test-Suite , along with a README that
explains how to use it. Nonetheless I've reported a description of the tools and
instruction later in the page.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Summary of the patches:
   1-4) Preparation patches, so that the RT classes' code can be used both
        for normal and cgroup scheduling.
  5-15) Implementation of HCBS, no migration and only one level hierarchy.
        The old RT_GROUP_SCHED code is removed.
 16-17) Remove cgroups v1 in favour of v2.
    18) Add support for deeper hierarchies.
 19-24) Add support for tasks migration.

Updates from v2:
- Rebase to latest tip/master.
- Remove fair-servers' bw reclaiming.
- Fix a check which prevented execution of wakeup_preempt code.
- Fix a priority check in group_pull_rt_task between tasks of different groups.
- Rework allocation/deallocation code for rt-cgroups.
- Update signatures for some group related migration functions.
- Add documentation for wakeup_preempt preemption rules.

Updates from v1:
- Rebase to latest tip/master.
- Add migration code.
- Split big patches for more readability.
- Refactor code to use guarded locks where applicable.
- Remove unnecessary patches from v1 which have been addressed differently by
  mainline updates.
- Remove unnecessary checks and general code cleanup.

Notes:
Task migration support needs some extra work to reduce its invasiveness,
especially patches 21-22.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Testing v3:

The HCBS mechanism has been evaluated on several syntetic tests which are
designed to stress the HCBS scheduler and verify that non-interference and
mathematical schedulability guarantees are really enforced by the scheduling
algorithm.

The test suite currently runs different categories of tests:
- Constraints, which are tasked to assert that hard constraints, such as
  schedulability conditions, are respected.
- Regression, to check that HCBS does not break anything that already exists.
- Stress, to repeatedly invoke the scheduler in all the exposed interfaces,
  with the goal to detect bugs and more importantly race conditions.
- Time, simple benchmarks to assert that the dl_servers work correctly, i.e.
  they allocate the correct amount of bandwidth, and that migration code allows
  to fully utilize the cgroup's allocated bw.
- Taskset: given a set of (generated) periodic tasks and their bandwidth
  requirements, schedulability analyses are performed to decide whether or not a
  given hardware configuration can run the taskset. In particular, for each
  taskset, a HCBS's cgroup configuration along with the number of necessary CPUs
  is generated. These are mathematically guaranteed to be schedulable.
  The next step of this test suite is to configure cgroups as computed and to
  run the taskset, to verify that the HCBS implementation works as intended and
  that the scheduling overheads are within reasonable bounds.

The source code can be found at https://github.com/Yurand2000/HCBS-Test-Suite .
The README file should explain most if not all questions, but I'm writing
briefly the pipeline to run these tests here:

- Get the HCBS patch up and running. Any kernel/disto should work effortlessly.
- Get, compile and _install_ the tests. 
- Download the additional taskset files and extract them in the _install_
  folder. You can find them here:
  https://github.com/Yurand2000/HCBS-Test-Suite/releases/tag/250926
- Run the `run_tests.sh full` script, to run the whole test suite.

Expect a total runtime of ~3 hours. The script will automatically mount the
cgroup and debug filesystems (if not already mounted) and will move all the
already running SCHED_FIFO/SCHED_RR tasks in the root cgroup, so that the
cgroups' CPU controller can be mounted. It will additionally try to reserve all
the possible rt-bandwidth for cgroups (i.e. 90%) to run all the later tests, so
make sure that there are no running SCHED_DEADLINE tasks if the script fails to
setup.

Some tests specifically need a minimum amount of CPU cores, up to a maximum of
eight. If your machine has less CPUs then the tests will simply be skipped.

Notes:

The tasksets minimal requirements were computed using a closed-source software,
explaining why the tasksets are supplied separately. A open-source analyser is
being written to update this step in the future and also allow for more
customization for the testers.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Future Work:

While we wait for more comments, and expect stuff to break, we will work on
completing the currently partial/untested, implementation of HCBS with different
runtimes per CPU, instead of having the same runtime allocated on all CPUs, to
include it in a future RCF.

Future patches:
 - HCBS with different runtimes per CPU.
 - capacity aware bandwidth reservation.
 - enable/disable dl_servers when a CPU goes online/offline.

Have a nice day,
Yuri

v1: https://lore.kernel.org/all/20250605071412.139240-1-yurand2000@gmail.com/
v2: https://lore.kernel.org/all/20250731105543.40832-1-yurand2000@gmail.com/

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Yuri Andriaccio (6):
  sched/rt: Disable RT_GROUP_SCHED
  sched/rt: Add rt-cgroups' dl-servers operations.
  sched/rt: Update task event callbacks for HCBS scheduling
  sched/rt: Allow zeroing the runtime of the root control group
  sched/rt: Remove support for cgroups-v1
  sched/core: Execute enqueued balance callbacks when migrating task
    betweeen cgroups

luca abeni (18):
  sched/deadline: Do not access dl_se->rq directly
  sched/deadline: Distinct between dl_rq and my_q
  sched/rt: Pass an rt_rq instead of an rq where needed
  sched/rt: Move some functions from rt.c to sched.h
  sched/rt: Introduce HCBS specific structs in task_group
  sched/core: Initialize root_task_group
  sched/deadline: Add dl_init_tg
  sched/rt: Add {alloc/free}_rt_sched_group
  sched/deadline: Account rt-cgroups bandwidth in deadline tasks
    schedulability tests.
  sched/rt: Update rt-cgroup schedulability checks
  sched/rt: Remove old RT_GROUP_SCHED data structures
  sched/core: Cgroup v2 support
  sched/deadline: Allow deeper hierarchies of RT cgroups
  sched/rt: Add rt-cgroup migration
  sched/rt: Add HCBS migration related checks and function calls
  sched/deadline: Make rt-cgroup's servers pull tasks on timer
    replenishment
  sched/deadline: Fix HCBS migrations on server stop
  sched/core: Execute enqueued balance callbacks when changing allowed
    CPUs

 include/linux/sched.h    |   10 +-
 kernel/sched/autogroup.c |    4 +-
 kernel/sched/core.c      |   65 +-
 kernel/sched/deadline.c  |  251 +++-
 kernel/sched/debug.c     |    6 -
 kernel/sched/fair.c      |    6 +-
 kernel/sched/rt.c        | 3069 +++++++++++++++++++-------------------
 kernel/sched/sched.h     |  150 +-
 kernel/sched/syscalls.c  |    6 +-
 9 files changed, 1850 insertions(+), 1717 deletions(-)


base-commit: cec1e6e5d1ab33403b809f79cd20d6aff124ccfe
-- 
2.51.0

Re: [RFC PATCH v3 00/24] Hierarchical Constant Bandwidth Server

Posted by Juri Lelli 4 months, 1 week ago

Hello!

On 29/09/25 11:21, Yuri Andriaccio wrote:
> Hello,
> 
> This is the v3 for Hierarchical Constant Bandwidth Server, aiming at replacing
> the current RT_GROUP_SCHED mechanism with something more robust and
> theoretically sound. The patchset has been presented at OSPM25
> (https://retis.sssup.it/ospm-summit/), and a summary of its inner workings can
> be found at https://lwn.net/Articles/1021332/ . You can find the previous
> versions of this patchset at the bottom of the page, in particular version 1
> which talks in more detail what this patchset is all about and how it is
> implemented.
> 
> This v3 version further reworks some of the patches as suggested by Juri Lelli.
> While most of the work is refactorings, the following were also changed:
> - The first patch which removed fair-servers' bandwidth accounting has been
>   removed, as it was deemed wrong. You can find the last version of this removed
>   patch, just for history reasons, here:
>   https://lore.kernel.org/all/20250903114448.664452-1-yurand2000@gmail.com/

Peter wasn't indeed happy with that patch, but I am not sure we finished
that discussion. Both myself and Luca had further objections to what
Peter said, but not further replies after (which can very well be a sign
that he is still adamnt in saying no go away :). Peter?

https://lore.kernel.org/lkml/aLk9BNnFYZ3bhVAE@jlelli-thinkpadt14gen4.remote.csb/
https://lore.kernel.org/lkml/20250904091217.78de3dde@luca64/

...

> For this v3 version we've also polished the testing system we are using and made
> it public for testers to run on their own machines. The source code can be found
> at https://github.com/Yurand2000/HCBS-Test-Suite , along with a README that
> explains how to use it. Nonetheless I've reported a description of the tools and
> instruction later in the page.

Thanks for this. Quite cool. I tried to run the tests, but it looks like
the migration set brought my qemu 8 CPUs system to an halt (had to kill
the thing). I will try again.

I have been working a bit in trying to come up with a testing framework
for SCHED_DEADLINE, which I didn't unfortunately posted yet mostly
because I was waiting for the discussion about the patch mentioned above
to settle (which would require adaptation in the tests that check for
bandwidth limits). You can find the idea here [1]. It's thought to be
an addition to kselftests. I believe your test suite can be extended to
cover the tests I implemented and more, so I am not super attached to my
attempt, but it would be good (I think) to converge to something that we
can maintain for the future, so maybe have a plan to possibly merge the
suites. What do you think?

1 - https://github.com/jlelli/linux/commits/experimental/deadline-selftests-scripts/

...

> Testing v3:
> 
> The HCBS mechanism has been evaluated on several syntetic tests which are
> designed to stress the HCBS scheduler and verify that non-interference and
> mathematical schedulability guarantees are really enforced by the scheduling
> algorithm.

...

> The tasksets minimal requirements were computed using a closed-source software,
> explaining why the tasksets are supplied separately. A open-source analyser is
> being written to update this step in the future and also allow for more
> customization for the testers.

On this (generation of random taskset and corresponding schedulability
checks) I also started working on a different tool, rt-audit.

https://github.com/jlelli/rt-audit

It's very simple, the tool, the random generation and the checks. But I
honestly like the idea that it's based on rt-app. Please take a look if
you have a chance and tell me what you think. Again, I feel it would be
good to converge towards something open that we can maintain.

I will be trying to find time to continue testing and reviewing this RFC
of course.

Thanks,
Juri

Re: [RFC PATCH v3 00/24] Hierarchical Constant Bandwidth Server

Posted by Juri Lelli 3 months, 2 weeks ago

On 02/10/25 10:00, Juri Lelli wrote:
> Hello!
> 
> On 29/09/25 11:21, Yuri Andriaccio wrote:
> > Hello,
> > 
> > This is the v3 for Hierarchical Constant Bandwidth Server, aiming at replacing
> > the current RT_GROUP_SCHED mechanism with something more robust and
> > theoretically sound. The patchset has been presented at OSPM25
> > (https://retis.sssup.it/ospm-summit/), and a summary of its inner workings can
> > be found at https://lwn.net/Articles/1021332/ . You can find the previous
> > versions of this patchset at the bottom of the page, in particular version 1
> > which talks in more detail what this patchset is all about and how it is
> > implemented.
> > 
> > This v3 version further reworks some of the patches as suggested by Juri Lelli.
> > While most of the work is refactorings, the following were also changed:
> > - The first patch which removed fair-servers' bandwidth accounting has been
> >   removed, as it was deemed wrong. You can find the last version of this removed
> >   patch, just for history reasons, here:
> >   https://lore.kernel.org/all/20250903114448.664452-1-yurand2000@gmail.com/
> 
> Peter wasn't indeed happy with that patch, but I am not sure we finished
> that discussion. Both myself and Luca had further objections to what
> Peter said, but not further replies after (which can very well be a sign
> that he is still adamnt in saying no go away :). Peter?
> 
> https://lore.kernel.org/lkml/aLk9BNnFYZ3bhVAE@jlelli-thinkpadt14gen4.remote.csb/
> https://lore.kernel.org/lkml/20250904091217.78de3dde@luca64/

I had a quick chat with Peter on IRC about this. We now seem to agree
that a third option would be to move to explicitly account dl-server(s),
correspondingly moving from a 95% to 100% limit. That would also make our
life easier in the future with additional dl-servers (e.g. scx-server).

What do you think?

Re: [RFC PATCH v3 00/24] Hierarchical Constant Bandwidth Server

Posted by luca abeni 3 months, 2 weeks ago

Hi Juri,

On Mon, 20 Oct 2025 11:40:22 +0200
Juri Lelli <juri.lelli@redhat.com> wrote:
[...]
> > > - The first patch which removed fair-servers' bandwidth
> > > accounting has been removed, as it was deemed wrong. You can find
> > > the last version of this removed patch, just for history reasons,
> > > here:
> > > https://lore.kernel.org/all/20250903114448.664452-1-yurand2000@gmail.com/
> > >  
> > 
> > Peter wasn't indeed happy with that patch, but I am not sure we
> > finished that discussion. Both myself and Luca had further
> > objections to what Peter said, but not further replies after (which
> > can very well be a sign that he is still adamnt in saying no go
> > away :). Peter?
> > 
> > https://lore.kernel.org/lkml/aLk9BNnFYZ3bhVAE@jlelli-thinkpadt14gen4.remote.csb/
> > https://lore.kernel.org/lkml/20250904091217.78de3dde@luca64/  
> 
> I had a quick chat with Peter on IRC about this. We now seem to agree
> that a third option would be to move to explicitly account
> dl-server(s), correspondingly moving from a 95% to 100% limit. That
> would also make our life easier in the future with additional
> dl-servers (e.g. scx-server).
> 
> What do you think?

This looks like another good solution, thanks!

So, if I understand well with this approach
/proc/sys/kernel/sched_rt_{runtime, period}_us would be set to 100% as
a default, right?

It is often useful to know what is the maximum CPU utilization that can
be guaranteed to real-time tasks... With this approach, it would be
100% - <dl_server utilization>, but this can change when scx servers are
added... What about making this information available to userspace
programs? (maybe /proc/sys/kernel/sched_rt_{runtime, period}_us could
provide such information? Or is it better to add a new interface?)

			Thanks,
				Luca

Re: [RFC PATCH v3 00/24] Hierarchical Constant Bandwidth Server

Posted by Juri Lelli 3 months ago

On 24/10/25 10:02, luca abeni wrote:
> Hi Juri,
> 
> On Mon, 20 Oct 2025 11:40:22 +0200
> Juri Lelli <juri.lelli@redhat.com> wrote:
> [...]
> > > > - The first patch which removed fair-servers' bandwidth
> > > > accounting has been removed, as it was deemed wrong. You can find
> > > > the last version of this removed patch, just for history reasons,
> > > > here:
> > > > https://lore.kernel.org/all/20250903114448.664452-1-yurand2000@gmail.com/
> > > >  
> > > 
> > > Peter wasn't indeed happy with that patch, but I am not sure we
> > > finished that discussion. Both myself and Luca had further
> > > objections to what Peter said, but not further replies after (which
> > > can very well be a sign that he is still adamnt in saying no go
> > > away :). Peter?
> > > 
> > > https://lore.kernel.org/lkml/aLk9BNnFYZ3bhVAE@jlelli-thinkpadt14gen4.remote.csb/
> > > https://lore.kernel.org/lkml/20250904091217.78de3dde@luca64/  
> > 
> > I had a quick chat with Peter on IRC about this. We now seem to agree
> > that a third option would be to move to explicitly account
> > dl-server(s), correspondingly moving from a 95% to 100% limit. That
> > would also make our life easier in the future with additional
> > dl-servers (e.g. scx-server).
> > 
> > What do you think?
> 
> This looks like another good solution, thanks!
> 
> So, if I understand well with this approach
> /proc/sys/kernel/sched_rt_{runtime, period}_us would be set to 100% as
> a default, right?
> 
> It is often useful to know what is the maximum CPU utilization that can
> be guaranteed to real-time tasks... With this approach, it would be
> 100% - <dl_server utilization>, but this can change when scx servers are
> added... What about making this information available to userspace
> programs? (maybe /proc/sys/kernel/sched_rt_{runtime, period}_us could
> provide such information? Or is it better to add a new interface?)

Not sure. If we set it to 100% by default (as you suggest, which makes
sense to me) I wonder what would be a usecase/need to set it to less
than 100% later on. We have the debug interface for tweaking dl-servers
and sched_rt_ interface doesn't distinguish between DEADLINE and RT
anyway (so no way to leave some "bandwidth" around for RT tasks).

Maybe it's an interface we want to start deprecating and we can come up
with something better and/or more useful? Peter?

Re: [RFC PATCH v3 00/24] Hierarchical Constant Bandwidth Server

Posted by Juri Lelli 3 months, 3 weeks ago

On 02/10/25 10:00, Juri Lelli wrote:

...

> I will be trying to find time to continue testing and reviewing this RFC
> of course.

I provided comments up to the last part of the series that implements
migration. I will now stop on purpose and wait for replies/new versions
of the patches I reviewed. I believe we could try to leave migration
"for later" while we solidify the basics. What do you think?

Thanks,
Juri

Re: [RFC PATCH v3 00/24] Hierarchical Constant Bandwidth Server

Posted by Yuri Andriaccio 3 months, 3 weeks ago

Hi,

> I provided comments up to the last part of the series that implements
> migration. I will now stop on purpose and wait for replies/new versions of the
> patches I reviewed. I believe we could try to leave migration "for later"
> while we solidify the basics. What do you think?

Thanks a lot for your comments. It also makes sense to us to solidify the basics
before going on to the migration code and later updates. We have sent the
migration related patches just to show the end goal of this patchset and give
some intuition of possible use cases and demonstration on the provided
guarantees. We will work on your comments and submit a v4 patchset soon.

Have a nice day,
Yuri