include/linux/sched.h | 10 +- kernel/sched/autogroup.c | 4 +- kernel/sched/core.c | 65 +- kernel/sched/deadline.c | 251 +++- kernel/sched/debug.c | 6 - kernel/sched/fair.c | 6 +- kernel/sched/rt.c | 3069 +++++++++++++++++++------------------- kernel/sched/sched.h | 150 +- kernel/sched/syscalls.c | 6 +- 9 files changed, 1850 insertions(+), 1717 deletions(-)
Hello,
This is the v3 for Hierarchical Constant Bandwidth Server, aiming at replacing
the current RT_GROUP_SCHED mechanism with something more robust and
theoretically sound. The patchset has been presented at OSPM25
(https://retis.sssup.it/ospm-summit/), and a summary of its inner workings can
be found at https://lwn.net/Articles/1021332/ . You can find the previous
versions of this patchset at the bottom of the page, in particular version 1
which talks in more detail what this patchset is all about and how it is
implemented.
This v3 version further reworks some of the patches as suggested by Juri Lelli.
While most of the work is refactorings, the following were also changed:
- The first patch which removed fair-servers' bandwidth accounting has been
removed, as it was deemed wrong. You can find the last version of this removed
patch, just for history reasons, here:
https://lore.kernel.org/all/20250903114448.664452-1-yurand2000@gmail.com/
- A left-over check which prevented execution of some of wakeup_preempt code has
been removed.
- Cgroup pull code was erroneusly comparing cgroup with non-cgroup tasks, now it
has been fixed.
- The allocation/deallocation code for rt cgroups has been checked and reworked
to make sure that resources are managed correctly in all the code paths.
- Some signatures of cgroup migration related functions where changed to match
more closely to their non-group counterparts.
- Descriptions and documentation were added where necessary, in particular for
preemption rules in wakeup_preempt.
For this v3 version we've also polished the testing system we are using and made
it public for testers to run on their own machines. The source code can be found
at https://github.com/Yurand2000/HCBS-Test-Suite , along with a README that
explains how to use it. Nonetheless I've reported a description of the tools and
instruction later in the page.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Summary of the patches:
1-4) Preparation patches, so that the RT classes' code can be used both
for normal and cgroup scheduling.
5-15) Implementation of HCBS, no migration and only one level hierarchy.
The old RT_GROUP_SCHED code is removed.
16-17) Remove cgroups v1 in favour of v2.
18) Add support for deeper hierarchies.
19-24) Add support for tasks migration.
Updates from v2:
- Rebase to latest tip/master.
- Remove fair-servers' bw reclaiming.
- Fix a check which prevented execution of wakeup_preempt code.
- Fix a priority check in group_pull_rt_task between tasks of different groups.
- Rework allocation/deallocation code for rt-cgroups.
- Update signatures for some group related migration functions.
- Add documentation for wakeup_preempt preemption rules.
Updates from v1:
- Rebase to latest tip/master.
- Add migration code.
- Split big patches for more readability.
- Refactor code to use guarded locks where applicable.
- Remove unnecessary patches from v1 which have been addressed differently by
mainline updates.
- Remove unnecessary checks and general code cleanup.
Notes:
Task migration support needs some extra work to reduce its invasiveness,
especially patches 21-22.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Testing v3:
The HCBS mechanism has been evaluated on several syntetic tests which are
designed to stress the HCBS scheduler and verify that non-interference and
mathematical schedulability guarantees are really enforced by the scheduling
algorithm.
The test suite currently runs different categories of tests:
- Constraints, which are tasked to assert that hard constraints, such as
schedulability conditions, are respected.
- Regression, to check that HCBS does not break anything that already exists.
- Stress, to repeatedly invoke the scheduler in all the exposed interfaces,
with the goal to detect bugs and more importantly race conditions.
- Time, simple benchmarks to assert that the dl_servers work correctly, i.e.
they allocate the correct amount of bandwidth, and that migration code allows
to fully utilize the cgroup's allocated bw.
- Taskset: given a set of (generated) periodic tasks and their bandwidth
requirements, schedulability analyses are performed to decide whether or not a
given hardware configuration can run the taskset. In particular, for each
taskset, a HCBS's cgroup configuration along with the number of necessary CPUs
is generated. These are mathematically guaranteed to be schedulable.
The next step of this test suite is to configure cgroups as computed and to
run the taskset, to verify that the HCBS implementation works as intended and
that the scheduling overheads are within reasonable bounds.
The source code can be found at https://github.com/Yurand2000/HCBS-Test-Suite .
The README file should explain most if not all questions, but I'm writing
briefly the pipeline to run these tests here:
- Get the HCBS patch up and running. Any kernel/disto should work effortlessly.
- Get, compile and _install_ the tests.
- Download the additional taskset files and extract them in the _install_
folder. You can find them here:
https://github.com/Yurand2000/HCBS-Test-Suite/releases/tag/250926
- Run the `run_tests.sh full` script, to run the whole test suite.
Expect a total runtime of ~3 hours. The script will automatically mount the
cgroup and debug filesystems (if not already mounted) and will move all the
already running SCHED_FIFO/SCHED_RR tasks in the root cgroup, so that the
cgroups' CPU controller can be mounted. It will additionally try to reserve all
the possible rt-bandwidth for cgroups (i.e. 90%) to run all the later tests, so
make sure that there are no running SCHED_DEADLINE tasks if the script fails to
setup.
Some tests specifically need a minimum amount of CPU cores, up to a maximum of
eight. If your machine has less CPUs then the tests will simply be skipped.
Notes:
The tasksets minimal requirements were computed using a closed-source software,
explaining why the tasksets are supplied separately. A open-source analyser is
being written to update this step in the future and also allow for more
customization for the testers.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Future Work:
While we wait for more comments, and expect stuff to break, we will work on
completing the currently partial/untested, implementation of HCBS with different
runtimes per CPU, instead of having the same runtime allocated on all CPUs, to
include it in a future RCF.
Future patches:
- HCBS with different runtimes per CPU.
- capacity aware bandwidth reservation.
- enable/disable dl_servers when a CPU goes online/offline.
Have a nice day,
Yuri
v1: https://lore.kernel.org/all/20250605071412.139240-1-yurand2000@gmail.com/
v2: https://lore.kernel.org/all/20250731105543.40832-1-yurand2000@gmail.com/
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Yuri Andriaccio (6):
sched/rt: Disable RT_GROUP_SCHED
sched/rt: Add rt-cgroups' dl-servers operations.
sched/rt: Update task event callbacks for HCBS scheduling
sched/rt: Allow zeroing the runtime of the root control group
sched/rt: Remove support for cgroups-v1
sched/core: Execute enqueued balance callbacks when migrating task
betweeen cgroups
luca abeni (18):
sched/deadline: Do not access dl_se->rq directly
sched/deadline: Distinct between dl_rq and my_q
sched/rt: Pass an rt_rq instead of an rq where needed
sched/rt: Move some functions from rt.c to sched.h
sched/rt: Introduce HCBS specific structs in task_group
sched/core: Initialize root_task_group
sched/deadline: Add dl_init_tg
sched/rt: Add {alloc/free}_rt_sched_group
sched/deadline: Account rt-cgroups bandwidth in deadline tasks
schedulability tests.
sched/rt: Update rt-cgroup schedulability checks
sched/rt: Remove old RT_GROUP_SCHED data structures
sched/core: Cgroup v2 support
sched/deadline: Allow deeper hierarchies of RT cgroups
sched/rt: Add rt-cgroup migration
sched/rt: Add HCBS migration related checks and function calls
sched/deadline: Make rt-cgroup's servers pull tasks on timer
replenishment
sched/deadline: Fix HCBS migrations on server stop
sched/core: Execute enqueued balance callbacks when changing allowed
CPUs
include/linux/sched.h | 10 +-
kernel/sched/autogroup.c | 4 +-
kernel/sched/core.c | 65 +-
kernel/sched/deadline.c | 251 +++-
kernel/sched/debug.c | 6 -
kernel/sched/fair.c | 6 +-
kernel/sched/rt.c | 3069 +++++++++++++++++++-------------------
kernel/sched/sched.h | 150 +-
kernel/sched/syscalls.c | 6 +-
9 files changed, 1850 insertions(+), 1717 deletions(-)
base-commit: cec1e6e5d1ab33403b809f79cd20d6aff124ccfe
--
2.51.0
Hello! On 29/09/25 11:21, Yuri Andriaccio wrote: > Hello, > > This is the v3 for Hierarchical Constant Bandwidth Server, aiming at replacing > the current RT_GROUP_SCHED mechanism with something more robust and > theoretically sound. The patchset has been presented at OSPM25 > (https://retis.sssup.it/ospm-summit/), and a summary of its inner workings can > be found at https://lwn.net/Articles/1021332/ . You can find the previous > versions of this patchset at the bottom of the page, in particular version 1 > which talks in more detail what this patchset is all about and how it is > implemented. > > This v3 version further reworks some of the patches as suggested by Juri Lelli. > While most of the work is refactorings, the following were also changed: > - The first patch which removed fair-servers' bandwidth accounting has been > removed, as it was deemed wrong. You can find the last version of this removed > patch, just for history reasons, here: > https://lore.kernel.org/all/20250903114448.664452-1-yurand2000@gmail.com/ Peter wasn't indeed happy with that patch, but I am not sure we finished that discussion. Both myself and Luca had further objections to what Peter said, but not further replies after (which can very well be a sign that he is still adamnt in saying no go away :). Peter? https://lore.kernel.org/lkml/aLk9BNnFYZ3bhVAE@jlelli-thinkpadt14gen4.remote.csb/ https://lore.kernel.org/lkml/20250904091217.78de3dde@luca64/ ... > For this v3 version we've also polished the testing system we are using and made > it public for testers to run on their own machines. The source code can be found > at https://github.com/Yurand2000/HCBS-Test-Suite , along with a README that > explains how to use it. Nonetheless I've reported a description of the tools and > instruction later in the page. Thanks for this. Quite cool. I tried to run the tests, but it looks like the migration set brought my qemu 8 CPUs system to an halt (had to kill the thing). I will try again. I have been working a bit in trying to come up with a testing framework for SCHED_DEADLINE, which I didn't unfortunately posted yet mostly because I was waiting for the discussion about the patch mentioned above to settle (which would require adaptation in the tests that check for bandwidth limits). You can find the idea here [1]. It's thought to be an addition to kselftests. I believe your test suite can be extended to cover the tests I implemented and more, so I am not super attached to my attempt, but it would be good (I think) to converge to something that we can maintain for the future, so maybe have a plan to possibly merge the suites. What do you think? 1 - https://github.com/jlelli/linux/commits/experimental/deadline-selftests-scripts/ ... > Testing v3: > > The HCBS mechanism has been evaluated on several syntetic tests which are > designed to stress the HCBS scheduler and verify that non-interference and > mathematical schedulability guarantees are really enforced by the scheduling > algorithm. ... > The tasksets minimal requirements were computed using a closed-source software, > explaining why the tasksets are supplied separately. A open-source analyser is > being written to update this step in the future and also allow for more > customization for the testers. On this (generation of random taskset and corresponding schedulability checks) I also started working on a different tool, rt-audit. https://github.com/jlelli/rt-audit It's very simple, the tool, the random generation and the checks. But I honestly like the idea that it's based on rt-app. Please take a look if you have a chance and tell me what you think. Again, I feel it would be good to converge towards something open that we can maintain. I will be trying to find time to continue testing and reviewing this RFC of course. Thanks, Juri
On 02/10/25 10:00, Juri Lelli wrote: > Hello! > > On 29/09/25 11:21, Yuri Andriaccio wrote: > > Hello, > > > > This is the v3 for Hierarchical Constant Bandwidth Server, aiming at replacing > > the current RT_GROUP_SCHED mechanism with something more robust and > > theoretically sound. The patchset has been presented at OSPM25 > > (https://retis.sssup.it/ospm-summit/), and a summary of its inner workings can > > be found at https://lwn.net/Articles/1021332/ . You can find the previous > > versions of this patchset at the bottom of the page, in particular version 1 > > which talks in more detail what this patchset is all about and how it is > > implemented. > > > > This v3 version further reworks some of the patches as suggested by Juri Lelli. > > While most of the work is refactorings, the following were also changed: > > - The first patch which removed fair-servers' bandwidth accounting has been > > removed, as it was deemed wrong. You can find the last version of this removed > > patch, just for history reasons, here: > > https://lore.kernel.org/all/20250903114448.664452-1-yurand2000@gmail.com/ > > Peter wasn't indeed happy with that patch, but I am not sure we finished > that discussion. Both myself and Luca had further objections to what > Peter said, but not further replies after (which can very well be a sign > that he is still adamnt in saying no go away :). Peter? > > https://lore.kernel.org/lkml/aLk9BNnFYZ3bhVAE@jlelli-thinkpadt14gen4.remote.csb/ > https://lore.kernel.org/lkml/20250904091217.78de3dde@luca64/ I had a quick chat with Peter on IRC about this. We now seem to agree that a third option would be to move to explicitly account dl-server(s), correspondingly moving from a 95% to 100% limit. That would also make our life easier in the future with additional dl-servers (e.g. scx-server). What do you think?
Hi Juri,
On Mon, 20 Oct 2025 11:40:22 +0200
Juri Lelli <juri.lelli@redhat.com> wrote:
[...]
> > > - The first patch which removed fair-servers' bandwidth
> > > accounting has been removed, as it was deemed wrong. You can find
> > > the last version of this removed patch, just for history reasons,
> > > here:
> > > https://lore.kernel.org/all/20250903114448.664452-1-yurand2000@gmail.com/
> > >
> >
> > Peter wasn't indeed happy with that patch, but I am not sure we
> > finished that discussion. Both myself and Luca had further
> > objections to what Peter said, but not further replies after (which
> > can very well be a sign that he is still adamnt in saying no go
> > away :). Peter?
> >
> > https://lore.kernel.org/lkml/aLk9BNnFYZ3bhVAE@jlelli-thinkpadt14gen4.remote.csb/
> > https://lore.kernel.org/lkml/20250904091217.78de3dde@luca64/
>
> I had a quick chat with Peter on IRC about this. We now seem to agree
> that a third option would be to move to explicitly account
> dl-server(s), correspondingly moving from a 95% to 100% limit. That
> would also make our life easier in the future with additional
> dl-servers (e.g. scx-server).
>
> What do you think?
This looks like another good solution, thanks!
So, if I understand well with this approach
/proc/sys/kernel/sched_rt_{runtime, period}_us would be set to 100% as
a default, right?
It is often useful to know what is the maximum CPU utilization that can
be guaranteed to real-time tasks... With this approach, it would be
100% - <dl_server utilization>, but this can change when scx servers are
added... What about making this information available to userspace
programs? (maybe /proc/sys/kernel/sched_rt_{runtime, period}_us could
provide such information? Or is it better to add a new interface?)
Thanks,
Luca
On 24/10/25 10:02, luca abeni wrote:
> Hi Juri,
>
> On Mon, 20 Oct 2025 11:40:22 +0200
> Juri Lelli <juri.lelli@redhat.com> wrote:
> [...]
> > > > - The first patch which removed fair-servers' bandwidth
> > > > accounting has been removed, as it was deemed wrong. You can find
> > > > the last version of this removed patch, just for history reasons,
> > > > here:
> > > > https://lore.kernel.org/all/20250903114448.664452-1-yurand2000@gmail.com/
> > > >
> > >
> > > Peter wasn't indeed happy with that patch, but I am not sure we
> > > finished that discussion. Both myself and Luca had further
> > > objections to what Peter said, but not further replies after (which
> > > can very well be a sign that he is still adamnt in saying no go
> > > away :). Peter?
> > >
> > > https://lore.kernel.org/lkml/aLk9BNnFYZ3bhVAE@jlelli-thinkpadt14gen4.remote.csb/
> > > https://lore.kernel.org/lkml/20250904091217.78de3dde@luca64/
> >
> > I had a quick chat with Peter on IRC about this. We now seem to agree
> > that a third option would be to move to explicitly account
> > dl-server(s), correspondingly moving from a 95% to 100% limit. That
> > would also make our life easier in the future with additional
> > dl-servers (e.g. scx-server).
> >
> > What do you think?
>
> This looks like another good solution, thanks!
>
> So, if I understand well with this approach
> /proc/sys/kernel/sched_rt_{runtime, period}_us would be set to 100% as
> a default, right?
>
> It is often useful to know what is the maximum CPU utilization that can
> be guaranteed to real-time tasks... With this approach, it would be
> 100% - <dl_server utilization>, but this can change when scx servers are
> added... What about making this information available to userspace
> programs? (maybe /proc/sys/kernel/sched_rt_{runtime, period}_us could
> provide such information? Or is it better to add a new interface?)
Not sure. If we set it to 100% by default (as you suggest, which makes
sense to me) I wonder what would be a usecase/need to set it to less
than 100% later on. We have the debug interface for tweaking dl-servers
and sched_rt_ interface doesn't distinguish between DEADLINE and RT
anyway (so no way to leave some "bandwidth" around for RT tasks).
Maybe it's an interface we want to start deprecating and we can come up
with something better and/or more useful? Peter?
On 02/10/25 10:00, Juri Lelli wrote: ... > I will be trying to find time to continue testing and reviewing this RFC > of course. I provided comments up to the last part of the series that implements migration. I will now stop on purpose and wait for replies/new versions of the patches I reviewed. I believe we could try to leave migration "for later" while we solidify the basics. What do you think? Thanks, Juri
Hi, > I provided comments up to the last part of the series that implements > migration. I will now stop on purpose and wait for replies/new versions of the > patches I reviewed. I believe we could try to leave migration "for later" > while we solidify the basics. What do you think? Thanks a lot for your comments. It also makes sense to us to solidify the basics before going on to the migration code and later updates. We have sent the migration related patches just to show the end goal of this patchset and give some intuition of possible use cases and demonstration on the provided guarantees. We will work on your comments and submit a v4 patchset soon. Have a nice day, Yuri
© 2016 - 2026 Red Hat, Inc.