kernel/sched/fair.c | 19 ++++++++++++++++++- kernel/sched/features.h | 9 +++++++++ 2 files changed, 27 insertions(+), 1 deletion(-)
Hi Ingo, Peter, Juri, Vincent,
This patch introduces a new scheduler feature, RT_SUPPRESS_FAIR_SERVER,
designed to ensure strict NOHZ_FULL isolation for SCHED_FIFO workloads,
particularly in the presence of resident CFS tasks.
In strictly partitioned, latency-critical environments (such as High
Frequency Trading platforms) administrators frequently employ fully
adaptive-tick CPUs to execute pinned SCHED_FIFO workloads. The fundamental
requirement is "zero OS noise"; specifically, the scheduler clock-tick must
remain suppressed ("offloaded"), given that standard SCHED_FIFO semantics
dictate no forced preemption between tasks of identical priority.
However, the extant "Fair Server" (Deadline Server) architecture
compromises this isolation guarantee. At present, should a background
SCHED_OTHER task be enqueued, the scheduler initiates the Fair Server
(dl_server_start). As the Fair Server functions as a SCHED_DEADLINE entity,
its activation increments rq->dl.dl_nr_running.
This condition compels sched_can_stop_tick() to return false, thereby
restarting the periodic tick to enforce the server's runtime.
To address this, the patch introduces a new scheduler feature control,
RT_SUPPRESS_FAIR_SERVER.
When engaged, this modification amends enqueue_task_fair() to forego the
invocation of dl_server_start() if, and only if, the following conditions
are met:
1. A Real-Time task (SCHED_FIFO/SCHED_RR) is currently in execution
2. RT bandwidth enforcement (rt_bandwidth_enabled()) is inactive
By precluding the server's initiation, rq->dl.dl_nr_running is maintained
at zero. This permits the tick logic to defer to the standard SCHED_FIFO
protocol, thereby ensuring the tick remains suppressed.
Considerations: This serves as a precision instrument for specialised
contexts. It explicitly prioritises determinism over fairness. Whilst
enabled, queued CFS tasks shall endure total starvation until such time as
the RT task voluntarily yields. I believe this is acceptable for
partitioned architectures where housekeeping duties are allocated to
alternative cores; however, I have guarded this capability within
CONFIG_NO_HZ_FULL and a default-disabled feature flag to obviate the risk
of inadvertent starvation on general-purpose systems.
I welcome your thoughts on this approach.
Aaron Tomlin (1):
sched/fair: Introduce RT_SUPPRESS_FAIR_SERVER to optimise NOHZ_FULL
isolation
kernel/sched/fair.c | 19 ++++++++++++++++++-
kernel/sched/features.h | 9 +++++++++
2 files changed, 27 insertions(+), 1 deletion(-)
--
2.51.0
On 1/6/26 9:12 AM, Aaron Tomlin wrote:
> Hi Ingo, Peter, Juri, Vincent,
>
> This patch introduces a new scheduler feature, RT_SUPPRESS_FAIR_SERVER,
> designed to ensure strict NOHZ_FULL isolation for SCHED_FIFO workloads,
> particularly in the presence of resident CFS tasks.
>
> In strictly partitioned, latency-critical environments (such as High
> Frequency Trading platforms) administrators frequently employ fully
> adaptive-tick CPUs to execute pinned SCHED_FIFO workloads. The fundamental
> requirement is "zero OS noise"; specifically, the scheduler clock-tick must
> remain suppressed ("offloaded"), given that standard SCHED_FIFO semantics
> dictate no forced preemption between tasks of identical priority.
If all your SCHED_FIFO is pinned and their scheduling decisions
are managed in userspace, using isolcpus would offer you better
isolations compared to nohz_full.
>
> However, the extant "Fair Server" (Deadline Server) architecture
> compromises this isolation guarantee. At present, should a background
> SCHED_OTHER task be enqueued, the scheduler initiates the Fair Server
> (dl_server_start). As the Fair Server functions as a SCHED_DEADLINE entity,
> its activation increments rq->dl.dl_nr_running.
>
There is runtime allocated to fair server. If you make them 0 on CPUs of
interest, wouldn't that work?
/sys/kernel/debug/sched/fair_server/<cpu>/runtime
> This condition compels sched_can_stop_tick() to return false, thereby
> restarting the periodic tick to enforce the server's runtime.
> To address this, the patch introduces a new scheduler feature control,
> RT_SUPPRESS_FAIR_SERVER.
>
> When engaged, this modification amends enqueue_task_fair() to forego the
> invocation of dl_server_start() if, and only if, the following conditions
> are met:
>
> 1. A Real-Time task (SCHED_FIFO/SCHED_RR) is currently in execution
> 2. RT bandwidth enforcement (rt_bandwidth_enabled()) is inactive
>
> By precluding the server's initiation, rq->dl.dl_nr_running is maintained
> at zero. This permits the tick logic to defer to the standard SCHED_FIFO
> protocol, thereby ensuring the tick remains suppressed.
>
> Considerations: This serves as a precision instrument for specialised
> contexts. It explicitly prioritises determinism over fairness. Whilst
> enabled, queued CFS tasks shall endure total starvation until such time as
> the RT task voluntarily yields. I believe this is acceptable for
> partitioned architectures where housekeeping duties are allocated to
> alternative cores; however, I have guarded this capability within
> CONFIG_NO_HZ_FULL and a default-disabled feature flag to obviate the risk
> of inadvertent starvation on general-purpose systems.
>
> I welcome your thoughts on this approach.
>
>
> Aaron Tomlin (1):
> sched/fair: Introduce RT_SUPPRESS_FAIR_SERVER to optimise NOHZ_FULL
> isolation
>
> kernel/sched/fair.c | 19 ++++++++++++++++++-
> kernel/sched/features.h | 9 +++++++++
> 2 files changed, 27 insertions(+), 1 deletion(-)
>
On 06/01/26 14:37, Shrikanth Hegde wrote:
> On 1/6/26 9:12 AM, Aaron Tomlin wrote:
>> Hi Ingo, Peter, Juri, Vincent,
>>
>> This patch introduces a new scheduler feature, RT_SUPPRESS_FAIR_SERVER,
>> designed to ensure strict NOHZ_FULL isolation for SCHED_FIFO workloads,
>> particularly in the presence of resident CFS tasks.
>>
>> In strictly partitioned, latency-critical environments (such as High
>> Frequency Trading platforms) administrators frequently employ fully
>> adaptive-tick CPUs to execute pinned SCHED_FIFO workloads. The fundamental
>> requirement is "zero OS noise"; specifically, the scheduler clock-tick must
>> remain suppressed ("offloaded"), given that standard SCHED_FIFO semantics
>> dictate no forced preemption between tasks of identical priority.
>
> If all your SCHED_FIFO is pinned and their scheduling decisions
> are managed in userspace, using isolcpus would offer you better
> isolations compared to nohz_full.
>
Right, that's the part I don't get; why not use CPU isolation / cpusets to
isolate the CPUs running those NOHZ_FULL applications? Regardless of the
deadline server, if CFS tasks get scheduled on the same CPU as your
latency-sensitive tasks then something's not right.
On Tue, 6 Jan 2026 at 16:38, Valentin Schneider <vschneid@redhat.com> wrote:
> On 06/01/26 14:37, Shrikanth Hegde wrote:
> > On 1/6/26 9:12 AM, Aaron Tomlin wrote:
> >> Hi Ingo, Peter, Juri, Vincent,
> >>
> >> This patch introduces a new scheduler feature, RT_SUPPRESS_FAIR_SERVER,
> >> designed to ensure strict NOHZ_FULL isolation for SCHED_FIFO workloads,
> >> particularly in the presence of resident CFS tasks.
> >>
> >> In strictly partitioned, latency-critical environments (such as High
> >> Frequency Trading platforms) administrators frequently employ fully
> >> adaptive-tick CPUs to execute pinned SCHED_FIFO workloads. The fundamental
> >> requirement is "zero OS noise"; specifically, the scheduler clock-tick must
> >> remain suppressed ("offloaded"), given that standard SCHED_FIFO semantics
> >> dictate no forced preemption between tasks of identical priority.
> >
> > If all your SCHED_FIFO is pinned and their scheduling decisions
> > are managed in userspace, using isolcpus would offer you better
> > isolations compared to nohz_full.
> >
>
> Right, that's the part I don't get; why not use CPU isolation / cpusets to
> isolate the CPUs running those NOHZ_FULL applications? Regardless of the
> deadline server, if CFS tasks get scheduled on the same CPU as your
> latency-sensitive tasks then something's not right.
Some kernel workers and threaded interrupt handlers can be local/pinned, right?
For example this is usually (was often?) visible with DPDK
applications like FlexRAN/OpenRAN, etc.
And Aaron has mentioned high speed trading before.
On Tue, Jan 06, 2026 at 05:38:17PM +0100, Daniel Vacek wrote:
> On Tue, 6 Jan 2026 at 16:38, Valentin Schneider <vschneid@redhat.com> wrote:
> > On 06/01/26 14:37, Shrikanth Hegde wrote:
> > > On 1/6/26 9:12 AM, Aaron Tomlin wrote:
> > >> Hi Ingo, Peter, Juri, Vincent,
> > >>
> > >> This patch introduces a new scheduler feature, RT_SUPPRESS_FAIR_SERVER,
> > >> designed to ensure strict NOHZ_FULL isolation for SCHED_FIFO workloads,
> > >> particularly in the presence of resident CFS tasks.
> > >>
> > >> In strictly partitioned, latency-critical environments (such as High
> > >> Frequency Trading platforms) administrators frequently employ fully
> > >> adaptive-tick CPUs to execute pinned SCHED_FIFO workloads. The fundamental
> > >> requirement is "zero OS noise"; specifically, the scheduler clock-tick must
> > >> remain suppressed ("offloaded"), given that standard SCHED_FIFO semantics
> > >> dictate no forced preemption between tasks of identical priority.
> > >
> > > If all your SCHED_FIFO is pinned and their scheduling decisions
> > > are managed in userspace, using isolcpus would offer you better
> > > isolations compared to nohz_full.
> > >
> >
> > Right, that's the part I don't get; why not use CPU isolation / cpusets to
> > isolate the CPUs running those NOHZ_FULL applications? Regardless of the
> > deadline server, if CFS tasks get scheduled on the same CPU as your
> > latency-sensitive tasks then something's not right.
>
> Some kernel workers and threaded interrupt handlers can be local/pinned, right?
>
> For example this is usually (was often?) visible with DPDK
> applications like FlexRAN/OpenRAN, etc.
> And Aaron has mentioned high speed trading before.
Hi Valentin, Daniel,
I must offer my apologies for the confusion; I neglected to mention in the
cover letter that isolcpus=domain is indeed deployed in this environment.
Consequently, standard load-balancing is effectively disabled. You are
quite right that standard CFS tasks should not appear on these cores; any
SCHED_NORMAL entities that do appear are not the result of leakage or
misconfiguration, but are rather unavoidable CPU-specific kthreads or
explicit migrations initiated by user-space.
Kind regards,
--
Aaron Tomlin
On 07/01/26 15:25, Aaron Tomlin wrote: > On Tue, Jan 06, 2026 at 05:38:17PM +0100, Daniel Vacek wrote: >> On Tue, 6 Jan 2026 at 16:38, Valentin Schneider <vschneid@redhat.com> wrote: >> > >> > Right, that's the part I don't get; why not use CPU isolation / cpusets to >> > isolate the CPUs running those NOHZ_FULL applications? Regardless of the >> > deadline server, if CFS tasks get scheduled on the same CPU as your >> > latency-sensitive tasks then something's not right. >> >> Some kernel workers and threaded interrupt handlers can be local/pinned, right? >> >> For example this is usually (was often?) visible with DPDK >> applications like FlexRAN/OpenRAN, etc. >> And Aaron has mentioned high speed trading before. > > Hi Valentin, Daniel, > > I must offer my apologies for the confusion; I neglected to mention in the > cover letter that isolcpus=domain is indeed deployed in this environment. > > Consequently, standard load-balancing is effectively disabled. You are > quite right that standard CFS tasks should not appear on these cores; any > SCHED_NORMAL entities that do appear are not the result of leakage or > misconfiguration, but are rather unavoidable CPU-specific kthreads or > explicit migrations initiated by user-space. > Gotcha. Do you have any specific examples for these per-CPU kthreads? We should have features to prevent most of these (e.g. workqueue cpumasks), and if not then that's something we could look into. As for userspace messing things up... Well, not much we can do here, other than preventing that via e.g. cpusets so only your latency-sensitive tasks are allowed to be migrated on the isolated CPUs. > > Kind regards, > -- > Aaron Tomlin
On Fri, Jan 09, 2026 at 10:21:07AM +0100, Valentin Schneider wrote: > Gotcha. Do you have any specific examples for these per-CPU kthreads? We > should have features to prevent most of these (e.g. workqueue cpumasks), > and if not then that's something we could look into. > > As for userspace messing things up... Well, not much we can do here, other > than preventing that via e.g. cpusets so only your latency-sensitive tasks > are allowed to be migrated on the isolated CPUs. Hi Valentin, To your point regarding specific examples of these per-CPU kthreads, I do not have any illustrative cases to hand at the moment. However, I shall attempt to reproduce the scenario to identify which specific threads are eluding our current isolation boundaries. I certainly concur with your final observation regarding userspace interference; there is, indeed, little to be done beyond enforcing strict partitioning via cpusets to ensure only latency-sensitive tasks are permitted to migrate to isolated CPUs. That being said, the suggestion made by Peter, namely, to prevent the enqueue on the isolated CPU - is a particularly compelling one. Please see [1]. [1]: https://lore.kernel.org/lkml/zmjr43kk2m52huk2vvetvwefil7waletzuijiu5y34v3n4slgi@3wdtd3xckx7m/ Kind regards, -- Aaron Tomlin
On Tue, Jan 06, 2026 at 02:37:49PM +0530, Shrikanth Hegde wrote: > If all your SCHED_FIFO is pinned and their scheduling decisions > are managed in userspace, using isolcpus would offer you better > isolations compared to nohz_full. Hi Shrikanth, You are entirely correct; isolcpus=domain (or isolcpus= without flags as per housekeeping_isolcpus_setup()) indeed offers superior isolation by removing the CPU from the scheduler load-balancing domains. I must apologise for the omission in my previous correspondence. I neglected to mention that our specific configuration utilises isolcpus= in conjunction with nohz_full=. > > However, the extant "Fair Server" (Deadline Server) architecture > > compromises this isolation guarantee. At present, should a background > > SCHED_OTHER task be enqueued, the scheduler initiates the Fair Server > > (dl_server_start). As the Fair Server functions as a SCHED_DEADLINE entity, > > its activation increments rq->dl.dl_nr_running. > > > > There is runtime allocated to fair server. If you make them 0 on CPUs of > interest, wouldn't that work? > > /sys/kernel/debug/sched/fair_server/<cpu>/runtime Yes, you are quite right; setting the fair server runtime to 0 (via /sys/kernel/debug/sched/fair_server/[cpu]/runtime) does indeed achieve the desired effect. In my testing, the SCHED_FIFO task on the fully adaptive-tick CPU remains uninterrupted by the restored clock-tick when this configuration is applied. Thank you. However, I believe it would be beneficial if this scheduling feature were available as an automatic kernel detection mechanism. While the manual runtime adjustment works, having the kernel automatically detect the condition - where an RT task is running and bandwidth enforcement is disabled - would provide a more seamless and robust solution for partitioned systems without requiring external intervention. I may consider an improved version of the patch that includes a "Fair server disabled" warning much like in sched_fair_server_write(). Kind regards, -- Aaron Tomlin
Hello! On 06/01/26 09:49, Aaron Tomlin wrote: > On Tue, Jan 06, 2026 at 02:37:49PM +0530, Shrikanth Hegde wrote: > > If all your SCHED_FIFO is pinned and their scheduling decisions > > are managed in userspace, using isolcpus would offer you better > > isolations compared to nohz_full. > > Hi Shrikanth, > > You are entirely correct; isolcpus=domain (or isolcpus= without flags as > per housekeeping_isolcpus_setup()) indeed offers superior isolation by > removing the CPU from the scheduler load-balancing domains. > > I must apologise for the omission in my previous correspondence. I > neglected to mention that our specific configuration utilises isolcpus= in > conjunction with nohz_full=. > > > > However, the extant "Fair Server" (Deadline Server) architecture > > > compromises this isolation guarantee. At present, should a background > > > SCHED_OTHER task be enqueued, the scheduler initiates the Fair Server > > > (dl_server_start). As the Fair Server functions as a SCHED_DEADLINE entity, > > > its activation increments rq->dl.dl_nr_running. > > > > > > > There is runtime allocated to fair server. If you make them 0 on CPUs of > > interest, wouldn't that work? > > > > /sys/kernel/debug/sched/fair_server/<cpu>/runtime > > Yes, you are quite right; setting the fair server runtime to 0 (via > /sys/kernel/debug/sched/fair_server/[cpu]/runtime) does indeed achieve the > desired effect. In my testing, the SCHED_FIFO task on the fully > adaptive-tick CPU remains uninterrupted by the restored clock-tick when > this configuration is applied. Thank you. > > However, I believe it would be beneficial if this scheduling feature were > available as an automatic kernel detection mechanism. While the manual > runtime adjustment works, having the kernel automatically detect the > condition - where an RT task is running and bandwidth enforcement is > disabled - would provide a more seamless and robust solution for > partitioned systems without requiring external intervention. > I may consider an improved version of the patch that includes a "Fair > server disabled" warning much like in sched_fair_server_write(). I am not sure either we need/want the automatic mechanism, as we already have the fair_server interface. I kind of think that if any (kthread included) CFS task is enqueued on an "isolated" CPU the problem might reside in sub-optimal isolation (usually a config issue or a kernel issue that might need solving - e.g. a for_each_cpu loop that needs changing). Starving such tasks might anyway end in a system crash of sort. Thanks, Juri
On Wed, Jan 07, 2026 at 10:48:12AM +0100, Juri Lelli wrote: > Hello! > > On 06/01/26 09:49, Aaron Tomlin wrote: > > On Tue, Jan 06, 2026 at 02:37:49PM +0530, Shrikanth Hegde wrote: > > > If all your SCHED_FIFO is pinned and their scheduling decisions > > > are managed in userspace, using isolcpus would offer you better > > > isolations compared to nohz_full. > > > > Hi Shrikanth, > > > > You are entirely correct; isolcpus=domain (or isolcpus= without flags as > > per housekeeping_isolcpus_setup()) indeed offers superior isolation by > > removing the CPU from the scheduler load-balancing domains. > > > > I must apologise for the omission in my previous correspondence. I > > neglected to mention that our specific configuration utilises isolcpus= in > > conjunction with nohz_full=. > > > > > > However, the extant "Fair Server" (Deadline Server) architecture > > > > compromises this isolation guarantee. At present, should a background > > > > SCHED_OTHER task be enqueued, the scheduler initiates the Fair Server > > > > (dl_server_start). As the Fair Server functions as a SCHED_DEADLINE entity, > > > > its activation increments rq->dl.dl_nr_running. > > > > > > > > > > There is runtime allocated to fair server. If you make them 0 on CPUs of > > > interest, wouldn't that work? > > > > > > /sys/kernel/debug/sched/fair_server/<cpu>/runtime > > > > Yes, you are quite right; setting the fair server runtime to 0 (via > > /sys/kernel/debug/sched/fair_server/[cpu]/runtime) does indeed achieve the > > desired effect. In my testing, the SCHED_FIFO task on the fully > > adaptive-tick CPU remains uninterrupted by the restored clock-tick when > > this configuration is applied. Thank you. > > > > However, I believe it would be beneficial if this scheduling feature were > > available as an automatic kernel detection mechanism. While the manual > > runtime adjustment works, having the kernel automatically detect the > > condition - where an RT task is running and bandwidth enforcement is > > disabled - would provide a more seamless and robust solution for > > partitioned systems without requiring external intervention. > > I may consider an improved version of the patch that includes a "Fair > > server disabled" warning much like in sched_fair_server_write(). > > I am not sure either we need/want the automatic mechanism, as we already > have the fair_server interface. I kind of think that if any (kthread > included) CFS task is enqueued on an "isolated" CPU the problem might > reside in sub-optimal isolation (usually a config issue or a kernel > issue that might need solving - e.g. a for_each_cpu loop that needs > changing). Starving such tasks might anyway end in a system crash of > sort. We must not starve fair tasks -- this can severely affect the system health. Specifically per-cpu kthreads getting starved can cause complete system lockup when other CPUs go wait for completion and such. We must not disable the fair server, ever. Doing do means you get to keep the pieces. The only sane way is to ensure these tasks do not get queued in the first place.
On Wed, Jan 07, 2026 at 11:26:59AM +0100, Peter Zijlstra wrote:
> We must not starve fair tasks -- this can severely affect the system
> health.
>
> Specifically per-cpu kthreads getting starved can cause complete system
> lockup when other CPUs go wait for completion and such.
>
> We must not disable the fair server, ever. Doing do means you get to
> keep the pieces.
>
> The only sane way is to ensure these tasks do not get queued in the
> first place.
Hi Peter,
To your point, in an effort to steer CFS (SCHED_NORMAL) tasks away from
isolated, RT-busy CPUs, I would be interested in your thoughts on the
following approach. By redirecting these "leaked" CFS tasks to housekeeping
CPUs prior to enqueueing, we ensure that rq->cfs.h_nr_queued remains at
zero on the isolated core. This prevents the activation of the Fair Server
and preserves the silence of the adaptive-tick mode.
While a race condition exists - specifically, an RT task could wake up on the
target CPU after our check returns false - this is likely acceptable. Should
an RT task wake up later, it will preempt the CFS task regardless;
consequently, the next time the CFS task sleeps and wakes, the logic will
intercept and redirect it, I think.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index da46c3164537..3db7a590a24d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8526,6 +8526,32 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
/* SD_flags and WF_flags share the first nibble */
int sd_flag = wake_flags & 0xF;
+ /*
+ * When RT_SUPPRESS_FAIR_SERVER is enabled, we proactively steer CFS tasks
+ * away from isolated CPUs that are currently executing Real-Time tasks.
+ *
+ * Enqueuing a CFS task on such a CPU would trigger dl_server_start(),
+ * which in turn restarts the tick to enforce bandwidth control. By
+ * redirecting the task to a housekeeping CPU during the selection
+ * phase, we preserve strict isolation and silence on the target CPU.
+ */
+#if defined(CONFIG_NO_HZ_FULL)
+ if (sched_feat(RT_SUPPRESS_FAIR_SERVER) && !rt_bandwidth_enabled()
+ && housekeeping_enabled(HK_TYPE_KERNEL_NOISE)) {
+ struct rq *target_rq = cpu_rq(prev_cpu);
+ /*
+ * Use READ_ONCE() to safely load the remote CPU's current task
+ * pointer without holding the rq lock.
+ */
+ struct task_struct *curr = READ_ONCE(target_rq->curr);
+
+ /* If the target CPU is isolated and busy with RT, redirect */
+ if (rt_task(curr) &&
+ !housekeeping_test_cpu(prev_cpu, HK_TYPE_KERNEL_NOISE)) {
+ return housekeeping_any_cpu(HK_TYPE_KERNEL_NOISE);
+ }
+ }
+#endif
/*
* required for stable ->cpus_allowed
*/
--
Aaron Tomlin
On Wed, Jan 07, 2026 at 11:26:59AM +0100, Peter Zijlstra wrote: > On Wed, Jan 07, 2026 at 10:48:12AM +0100, Juri Lelli wrote: > > Hello! > > > > On 06/01/26 09:49, Aaron Tomlin wrote: > > > On Tue, Jan 06, 2026 at 02:37:49PM +0530, Shrikanth Hegde wrote: > > > > If all your SCHED_FIFO is pinned and their scheduling decisions > > > > are managed in userspace, using isolcpus would offer you better > > > > isolations compared to nohz_full. > > > > > > Hi Shrikanth, > > > > > > You are entirely correct; isolcpus=domain (or isolcpus= without flags as > > > per housekeeping_isolcpus_setup()) indeed offers superior isolation by > > > removing the CPU from the scheduler load-balancing domains. > > > > > > I must apologise for the omission in my previous correspondence. I > > > neglected to mention that our specific configuration utilises isolcpus= in > > > conjunction with nohz_full=. > > > > > > > > However, the extant "Fair Server" (Deadline Server) architecture > > > > > compromises this isolation guarantee. At present, should a background > > > > > SCHED_OTHER task be enqueued, the scheduler initiates the Fair Server > > > > > (dl_server_start). As the Fair Server functions as a SCHED_DEADLINE entity, > > > > > its activation increments rq->dl.dl_nr_running. > > > > > > > > > > > > > There is runtime allocated to fair server. If you make them 0 on CPUs of > > > > interest, wouldn't that work? > > > > > > > > /sys/kernel/debug/sched/fair_server/<cpu>/runtime > > > > > > Yes, you are quite right; setting the fair server runtime to 0 (via > > > /sys/kernel/debug/sched/fair_server/[cpu]/runtime) does indeed achieve the > > > desired effect. In my testing, the SCHED_FIFO task on the fully > > > adaptive-tick CPU remains uninterrupted by the restored clock-tick when > > > this configuration is applied. Thank you. > > > > > > However, I believe it would be beneficial if this scheduling feature were > > > available as an automatic kernel detection mechanism. While the manual > > > runtime adjustment works, having the kernel automatically detect the > > > condition - where an RT task is running and bandwidth enforcement is > > > disabled - would provide a more seamless and robust solution for > > > partitioned systems without requiring external intervention. > > > I may consider an improved version of the patch that includes a "Fair > > > server disabled" warning much like in sched_fair_server_write(). > > > > I am not sure either we need/want the automatic mechanism, as we already > > have the fair_server interface. I kind of think that if any (kthread > > included) CFS task is enqueued on an "isolated" CPU the problem might > > reside in sub-optimal isolation (usually a config issue or a kernel > > issue that might need solving - e.g. a for_each_cpu loop that needs > > changing). Starving such tasks might anyway end in a system crash of > > sort. > > We must not starve fair tasks -- this can severely affect the system > health. > > Specifically per-cpu kthreads getting starved can cause complete system > lockup when other CPUs go wait for completion and such. > > We must not disable the fair server, ever. Doing do means you get to > keep the pieces. > > The only sane way is to ensure these tasks do not get queued in the > first place. Hi Shrikanth, Valentin, Juri, Daniel, Peter, I fully appreciate your concerns regarding system health and the critical nature of per-CPU kthreads. I agree that under standard operation, disabling the Fair Server presents a significant risk of system lockup. Your suggestion to ensure such tasks are prevented from being queued in the first instance is an interesting proposition and certainly merits further consideration - I will look into it. However, I would respectfully submit that the kernel currently affords users the capability to manually disable runtime for each CFS task via /sys/kernel/debug/sched/fair_server/. This establishes a precedent wherein the user is permitted to assume full responsibility for the scheduler's behaviour on specific cores. If my understanding is correct, in the current scenario, should a user manually set the CFS runtime to zero, CPU-specific kthreads operating as SCHED_NORMAL are already precluded from circumvention by the Fair Server mechanism when a real-Time task is executing. The risk you describe is, therefore, already present for those who utilise the debug interface, I think. The rationale behind introducing RT_SUPPRESS_FAIR_SERVER is to formalise this behaviour for a specific, highly educated class of user (e.g., HFT or HPC operators) who explicitly prioritise absolute determinism over general system stability, for a period of time - we still maintain the ability to terminate/interrupt a real-time task via a signal (e.g., SIGINT). As this scheduling feature is disabled by default, the user must actively opt-in, thereby signalling their willingness to "sacrifice" safety guarantees and accept the potential consequences - or "keep the pieces," as it were. I believe this approach provides a necessary tool for extreme latency-sensitive partitions without compromising the safety of the general-purpose kernel. Kind regards, -- Aaron Tomlin
© 2016 - 2026 Red Hat, Inc.