[RFC PATCH 0/1] sched/fair: Feature to suppress Fair Server for NOHZ_FULL isolation

Aaron Tomlin posted 1 patch 1 month ago
kernel/sched/fair.c     | 19 ++++++++++++++++++-
kernel/sched/features.h |  9 +++++++++
2 files changed, 27 insertions(+), 1 deletion(-)
[RFC PATCH 0/1] sched/fair: Feature to suppress Fair Server for NOHZ_FULL isolation
Posted by Aaron Tomlin 1 month ago
Hi Ingo, Peter, Juri, Vincent,

This patch introduces a new scheduler feature, RT_SUPPRESS_FAIR_SERVER,
designed to ensure strict NOHZ_FULL isolation for SCHED_FIFO workloads,
particularly in the presence of resident CFS tasks.

In strictly partitioned, latency-critical environments (such as High
Frequency Trading platforms) administrators frequently employ fully
adaptive-tick CPUs to execute pinned SCHED_FIFO workloads. The fundamental
requirement is "zero OS noise"; specifically, the scheduler clock-tick must
remain suppressed ("offloaded"), given that standard SCHED_FIFO semantics
dictate no forced preemption between tasks of identical priority.

However, the extant "Fair Server" (Deadline Server) architecture
compromises this isolation guarantee. At present, should a background
SCHED_OTHER task be enqueued, the scheduler initiates the Fair Server
(dl_server_start). As the Fair Server functions as a SCHED_DEADLINE entity,
its activation increments rq->dl.dl_nr_running.

This condition compels sched_can_stop_tick() to return false, thereby
restarting the periodic tick to enforce the server's runtime.
To address this, the patch introduces a new scheduler feature control,
RT_SUPPRESS_FAIR_SERVER.

When engaged, this modification amends enqueue_task_fair() to forego the
invocation of dl_server_start() if, and only if, the following conditions
are met:

	1. A Real-Time task (SCHED_FIFO/SCHED_RR) is currently in execution
	2. RT bandwidth enforcement (rt_bandwidth_enabled()) is inactive

By precluding the server's initiation, rq->dl.dl_nr_running is maintained
at zero. This permits the tick logic to defer to the standard SCHED_FIFO
protocol, thereby ensuring the tick remains suppressed.

Considerations: This serves as a precision instrument for specialised
contexts. It explicitly prioritises determinism over fairness. Whilst
enabled, queued CFS tasks shall endure total starvation until such time as
the RT task voluntarily yields. I believe this is acceptable for
partitioned architectures where housekeeping duties are allocated to
alternative cores; however, I have guarded this capability within
CONFIG_NO_HZ_FULL and a default-disabled feature flag to obviate the risk
of inadvertent starvation on general-purpose systems.

I welcome your thoughts on this approach.


Aaron Tomlin (1):
  sched/fair: Introduce RT_SUPPRESS_FAIR_SERVER to optimise NOHZ_FULL
    isolation

 kernel/sched/fair.c     | 19 ++++++++++++++++++-
 kernel/sched/features.h |  9 +++++++++
 2 files changed, 27 insertions(+), 1 deletion(-)

-- 
2.51.0
Re: [RFC PATCH 0/1] sched/fair: Feature to suppress Fair Server for NOHZ_FULL isolation
Posted by Shrikanth Hegde 1 month ago

On 1/6/26 9:12 AM, Aaron Tomlin wrote:
> Hi Ingo, Peter, Juri, Vincent,
> 
> This patch introduces a new scheduler feature, RT_SUPPRESS_FAIR_SERVER,
> designed to ensure strict NOHZ_FULL isolation for SCHED_FIFO workloads,
> particularly in the presence of resident CFS tasks.
> 
> In strictly partitioned, latency-critical environments (such as High
> Frequency Trading platforms) administrators frequently employ fully
> adaptive-tick CPUs to execute pinned SCHED_FIFO workloads. The fundamental
> requirement is "zero OS noise"; specifically, the scheduler clock-tick must
> remain suppressed ("offloaded"), given that standard SCHED_FIFO semantics
> dictate no forced preemption between tasks of identical priority.

If all your SCHED_FIFO is pinned and their scheduling decisions
are managed in userspace, using isolcpus would offer you better
isolations compared to nohz_full.

> 
> However, the extant "Fair Server" (Deadline Server) architecture
> compromises this isolation guarantee. At present, should a background
> SCHED_OTHER task be enqueued, the scheduler initiates the Fair Server
> (dl_server_start). As the Fair Server functions as a SCHED_DEADLINE entity,
> its activation increments rq->dl.dl_nr_running.
> 

There is runtime allocated to fair server. If you make them 0 on CPUs of
interest, wouldn't that work?

/sys/kernel/debug/sched/fair_server/<cpu>/runtime

> This condition compels sched_can_stop_tick() to return false, thereby
> restarting the periodic tick to enforce the server's runtime.
> To address this, the patch introduces a new scheduler feature control,
> RT_SUPPRESS_FAIR_SERVER.
> 
> When engaged, this modification amends enqueue_task_fair() to forego the
> invocation of dl_server_start() if, and only if, the following conditions
> are met:
> 
> 	1. A Real-Time task (SCHED_FIFO/SCHED_RR) is currently in execution
> 	2. RT bandwidth enforcement (rt_bandwidth_enabled()) is inactive
> 
> By precluding the server's initiation, rq->dl.dl_nr_running is maintained
> at zero. This permits the tick logic to defer to the standard SCHED_FIFO
> protocol, thereby ensuring the tick remains suppressed.
> 
> Considerations: This serves as a precision instrument for specialised
> contexts. It explicitly prioritises determinism over fairness. Whilst
> enabled, queued CFS tasks shall endure total starvation until such time as
> the RT task voluntarily yields. I believe this is acceptable for
> partitioned architectures where housekeeping duties are allocated to
> alternative cores; however, I have guarded this capability within
> CONFIG_NO_HZ_FULL and a default-disabled feature flag to obviate the risk
> of inadvertent starvation on general-purpose systems.
> 
> I welcome your thoughts on this approach.
> 
> 
> Aaron Tomlin (1):
>    sched/fair: Introduce RT_SUPPRESS_FAIR_SERVER to optimise NOHZ_FULL
>      isolation
> 
>   kernel/sched/fair.c     | 19 ++++++++++++++++++-
>   kernel/sched/features.h |  9 +++++++++
>   2 files changed, 27 insertions(+), 1 deletion(-)
>
Re: [RFC PATCH 0/1] sched/fair: Feature to suppress Fair Server for NOHZ_FULL isolation
Posted by Valentin Schneider 1 month ago
On 06/01/26 14:37, Shrikanth Hegde wrote:
> On 1/6/26 9:12 AM, Aaron Tomlin wrote:
>> Hi Ingo, Peter, Juri, Vincent,
>>
>> This patch introduces a new scheduler feature, RT_SUPPRESS_FAIR_SERVER,
>> designed to ensure strict NOHZ_FULL isolation for SCHED_FIFO workloads,
>> particularly in the presence of resident CFS tasks.
>>
>> In strictly partitioned, latency-critical environments (such as High
>> Frequency Trading platforms) administrators frequently employ fully
>> adaptive-tick CPUs to execute pinned SCHED_FIFO workloads. The fundamental
>> requirement is "zero OS noise"; specifically, the scheduler clock-tick must
>> remain suppressed ("offloaded"), given that standard SCHED_FIFO semantics
>> dictate no forced preemption between tasks of identical priority.
>
> If all your SCHED_FIFO is pinned and their scheduling decisions
> are managed in userspace, using isolcpus would offer you better
> isolations compared to nohz_full.
>

Right, that's the part I don't get; why not use CPU isolation / cpusets to
isolate the CPUs running those NOHZ_FULL applications? Regardless of the
deadline server, if CFS tasks get scheduled on the same CPU as your
latency-sensitive tasks then something's not right.
Re: [RFC PATCH 0/1] sched/fair: Feature to suppress Fair Server for NOHZ_FULL isolation
Posted by Daniel Vacek 1 month ago
On Tue, 6 Jan 2026 at 16:38, Valentin Schneider <vschneid@redhat.com> wrote:
> On 06/01/26 14:37, Shrikanth Hegde wrote:
> > On 1/6/26 9:12 AM, Aaron Tomlin wrote:
> >> Hi Ingo, Peter, Juri, Vincent,
> >>
> >> This patch introduces a new scheduler feature, RT_SUPPRESS_FAIR_SERVER,
> >> designed to ensure strict NOHZ_FULL isolation for SCHED_FIFO workloads,
> >> particularly in the presence of resident CFS tasks.
> >>
> >> In strictly partitioned, latency-critical environments (such as High
> >> Frequency Trading platforms) administrators frequently employ fully
> >> adaptive-tick CPUs to execute pinned SCHED_FIFO workloads. The fundamental
> >> requirement is "zero OS noise"; specifically, the scheduler clock-tick must
> >> remain suppressed ("offloaded"), given that standard SCHED_FIFO semantics
> >> dictate no forced preemption between tasks of identical priority.
> >
> > If all your SCHED_FIFO is pinned and their scheduling decisions
> > are managed in userspace, using isolcpus would offer you better
> > isolations compared to nohz_full.
> >
>
> Right, that's the part I don't get; why not use CPU isolation / cpusets to
> isolate the CPUs running those NOHZ_FULL applications? Regardless of the
> deadline server, if CFS tasks get scheduled on the same CPU as your
> latency-sensitive tasks then something's not right.

Some kernel workers and threaded interrupt handlers can be local/pinned, right?

For example this is usually (was often?) visible with DPDK
applications like FlexRAN/OpenRAN, etc.
And Aaron has mentioned high speed trading before.
Re: [RFC PATCH 0/1] sched/fair: Feature to suppress Fair Server for NOHZ_FULL isolation
Posted by Aaron Tomlin 1 month ago
On Tue, Jan 06, 2026 at 05:38:17PM +0100, Daniel Vacek wrote:
> On Tue, 6 Jan 2026 at 16:38, Valentin Schneider <vschneid@redhat.com> wrote:
> > On 06/01/26 14:37, Shrikanth Hegde wrote:
> > > On 1/6/26 9:12 AM, Aaron Tomlin wrote:
> > >> Hi Ingo, Peter, Juri, Vincent,
> > >>
> > >> This patch introduces a new scheduler feature, RT_SUPPRESS_FAIR_SERVER,
> > >> designed to ensure strict NOHZ_FULL isolation for SCHED_FIFO workloads,
> > >> particularly in the presence of resident CFS tasks.
> > >>
> > >> In strictly partitioned, latency-critical environments (such as High
> > >> Frequency Trading platforms) administrators frequently employ fully
> > >> adaptive-tick CPUs to execute pinned SCHED_FIFO workloads. The fundamental
> > >> requirement is "zero OS noise"; specifically, the scheduler clock-tick must
> > >> remain suppressed ("offloaded"), given that standard SCHED_FIFO semantics
> > >> dictate no forced preemption between tasks of identical priority.
> > >
> > > If all your SCHED_FIFO is pinned and their scheduling decisions
> > > are managed in userspace, using isolcpus would offer you better
> > > isolations compared to nohz_full.
> > >
> >
> > Right, that's the part I don't get; why not use CPU isolation / cpusets to
> > isolate the CPUs running those NOHZ_FULL applications? Regardless of the
> > deadline server, if CFS tasks get scheduled on the same CPU as your
> > latency-sensitive tasks then something's not right.
> 
> Some kernel workers and threaded interrupt handlers can be local/pinned, right?
> 
> For example this is usually (was often?) visible with DPDK
> applications like FlexRAN/OpenRAN, etc.
> And Aaron has mentioned high speed trading before.

Hi Valentin, Daniel,

I must offer my apologies for the confusion; I neglected to mention in the
cover letter that isolcpus=domain is indeed deployed in this environment.

Consequently, standard load-balancing is effectively disabled. You are
quite right that standard CFS tasks should not appear on these cores; any
SCHED_NORMAL entities that do appear are not the result of leakage or
misconfiguration, but are rather unavoidable CPU-specific kthreads or
explicit migrations initiated by user-space.


Kind regards,
-- 
Aaron Tomlin
Re: [RFC PATCH 0/1] sched/fair: Feature to suppress Fair Server for NOHZ_FULL isolation
Posted by Valentin Schneider 4 weeks, 1 day ago
On 07/01/26 15:25, Aaron Tomlin wrote:
> On Tue, Jan 06, 2026 at 05:38:17PM +0100, Daniel Vacek wrote:
>> On Tue, 6 Jan 2026 at 16:38, Valentin Schneider <vschneid@redhat.com> wrote:
>> >
>> > Right, that's the part I don't get; why not use CPU isolation / cpusets to
>> > isolate the CPUs running those NOHZ_FULL applications? Regardless of the
>> > deadline server, if CFS tasks get scheduled on the same CPU as your
>> > latency-sensitive tasks then something's not right.
>>
>> Some kernel workers and threaded interrupt handlers can be local/pinned, right?
>>
>> For example this is usually (was often?) visible with DPDK
>> applications like FlexRAN/OpenRAN, etc.
>> And Aaron has mentioned high speed trading before.
>
> Hi Valentin, Daniel,
>
> I must offer my apologies for the confusion; I neglected to mention in the
> cover letter that isolcpus=domain is indeed deployed in this environment.
>
> Consequently, standard load-balancing is effectively disabled. You are
> quite right that standard CFS tasks should not appear on these cores; any
> SCHED_NORMAL entities that do appear are not the result of leakage or
> misconfiguration, but are rather unavoidable CPU-specific kthreads or
> explicit migrations initiated by user-space.
>

Gotcha. Do you have any specific examples for these per-CPU kthreads? We
should have features to prevent most of these (e.g. workqueue cpumasks),
and if not then that's something we could look into.

As for userspace messing things up... Well, not much we can do here, other
than preventing that via e.g. cpusets so only your latency-sensitive tasks
are allowed to be migrated on the isolated CPUs.

>
> Kind regards,
> --
> Aaron Tomlin
Re: [RFC PATCH 0/1] sched/fair: Feature to suppress Fair Server for NOHZ_FULL isolation
Posted by Aaron Tomlin 3 weeks, 4 days ago
On Fri, Jan 09, 2026 at 10:21:07AM +0100, Valentin Schneider wrote:
> Gotcha. Do you have any specific examples for these per-CPU kthreads? We
> should have features to prevent most of these (e.g. workqueue cpumasks),
> and if not then that's something we could look into.
> 
> As for userspace messing things up... Well, not much we can do here, other
> than preventing that via e.g. cpusets so only your latency-sensitive tasks
> are allowed to be migrated on the isolated CPUs.

Hi Valentin,

To your point regarding specific examples of these per-CPU kthreads, I do
not have any illustrative cases to hand at the moment. However, I shall
attempt to reproduce the scenario to identify which specific threads are
eluding our current isolation boundaries.

I certainly concur with your final observation regarding userspace
interference; there is, indeed, little to be done beyond enforcing strict
partitioning via cpusets to ensure only latency-sensitive tasks are
permitted to migrate to isolated CPUs.

That being said, the suggestion made by Peter, namely, to prevent the
enqueue on the isolated CPU - is a particularly compelling one.

Please see [1].


[1]: https://lore.kernel.org/lkml/zmjr43kk2m52huk2vvetvwefil7waletzuijiu5y34v3n4slgi@3wdtd3xckx7m/

Kind regards,
-- 
Aaron Tomlin
Re: [RFC PATCH 0/1] sched/fair: Feature to suppress Fair Server for NOHZ_FULL isolation
Posted by Aaron Tomlin 1 month ago
On Tue, Jan 06, 2026 at 02:37:49PM +0530, Shrikanth Hegde wrote:
> If all your SCHED_FIFO is pinned and their scheduling decisions
> are managed in userspace, using isolcpus would offer you better
> isolations compared to nohz_full.

Hi Shrikanth,

You are entirely correct; isolcpus=domain (or isolcpus= without flags as
per housekeeping_isolcpus_setup()) indeed offers superior isolation by
removing the CPU from the scheduler load-balancing domains.

I must apologise for the omission in my previous correspondence. I
neglected to mention that our specific configuration utilises isolcpus= in
conjunction with nohz_full=.

> > However, the extant "Fair Server" (Deadline Server) architecture
> > compromises this isolation guarantee. At present, should a background
> > SCHED_OTHER task be enqueued, the scheduler initiates the Fair Server
> > (dl_server_start). As the Fair Server functions as a SCHED_DEADLINE entity,
> > its activation increments rq->dl.dl_nr_running.
> > 
> 
> There is runtime allocated to fair server. If you make them 0 on CPUs of
> interest, wouldn't that work?
> 
> /sys/kernel/debug/sched/fair_server/<cpu>/runtime

Yes, you are quite right; setting the fair server runtime to 0 (via
/sys/kernel/debug/sched/fair_server/[cpu]/runtime) does indeed achieve the
desired effect. In my testing, the SCHED_FIFO task on the fully
adaptive-tick CPU remains uninterrupted by the restored clock-tick when
this configuration is applied. Thank you.

However, I believe it would be beneficial if this scheduling feature were
available as an automatic kernel detection mechanism. While the manual
runtime adjustment works, having the kernel automatically detect the
condition - where an RT task is running and bandwidth enforcement is
disabled - would provide a more seamless and robust solution for
partitioned systems without requiring external intervention.
I may consider an improved version of the patch that includes a "Fair
server disabled" warning much like in sched_fair_server_write().


Kind regards,
-- 
Aaron Tomlin
Re: [RFC PATCH 0/1] sched/fair: Feature to suppress Fair Server for NOHZ_FULL isolation
Posted by Juri Lelli 1 month ago
Hello!

On 06/01/26 09:49, Aaron Tomlin wrote:
> On Tue, Jan 06, 2026 at 02:37:49PM +0530, Shrikanth Hegde wrote:
> > If all your SCHED_FIFO is pinned and their scheduling decisions
> > are managed in userspace, using isolcpus would offer you better
> > isolations compared to nohz_full.
> 
> Hi Shrikanth,
> 
> You are entirely correct; isolcpus=domain (or isolcpus= without flags as
> per housekeeping_isolcpus_setup()) indeed offers superior isolation by
> removing the CPU from the scheduler load-balancing domains.
> 
> I must apologise for the omission in my previous correspondence. I
> neglected to mention that our specific configuration utilises isolcpus= in
> conjunction with nohz_full=.
> 
> > > However, the extant "Fair Server" (Deadline Server) architecture
> > > compromises this isolation guarantee. At present, should a background
> > > SCHED_OTHER task be enqueued, the scheduler initiates the Fair Server
> > > (dl_server_start). As the Fair Server functions as a SCHED_DEADLINE entity,
> > > its activation increments rq->dl.dl_nr_running.
> > > 
> > 
> > There is runtime allocated to fair server. If you make them 0 on CPUs of
> > interest, wouldn't that work?
> > 
> > /sys/kernel/debug/sched/fair_server/<cpu>/runtime
> 
> Yes, you are quite right; setting the fair server runtime to 0 (via
> /sys/kernel/debug/sched/fair_server/[cpu]/runtime) does indeed achieve the
> desired effect. In my testing, the SCHED_FIFO task on the fully
> adaptive-tick CPU remains uninterrupted by the restored clock-tick when
> this configuration is applied. Thank you.
> 
> However, I believe it would be beneficial if this scheduling feature were
> available as an automatic kernel detection mechanism. While the manual
> runtime adjustment works, having the kernel automatically detect the
> condition - where an RT task is running and bandwidth enforcement is
> disabled - would provide a more seamless and robust solution for
> partitioned systems without requiring external intervention.
> I may consider an improved version of the patch that includes a "Fair
> server disabled" warning much like in sched_fair_server_write().

I am not sure either we need/want the automatic mechanism, as we already
have the fair_server interface. I kind of think that if any (kthread
included) CFS task is enqueued on an "isolated" CPU the problem might
reside in sub-optimal isolation (usually a config issue or a kernel
issue that might need solving - e.g. a for_each_cpu loop that needs
changing). Starving such tasks might anyway end in a system crash of
sort.

Thanks,
Juri
Re: [RFC PATCH 0/1] sched/fair: Feature to suppress Fair Server for NOHZ_FULL isolation
Posted by Peter Zijlstra 1 month ago
On Wed, Jan 07, 2026 at 10:48:12AM +0100, Juri Lelli wrote:
> Hello!
> 
> On 06/01/26 09:49, Aaron Tomlin wrote:
> > On Tue, Jan 06, 2026 at 02:37:49PM +0530, Shrikanth Hegde wrote:
> > > If all your SCHED_FIFO is pinned and their scheduling decisions
> > > are managed in userspace, using isolcpus would offer you better
> > > isolations compared to nohz_full.
> > 
> > Hi Shrikanth,
> > 
> > You are entirely correct; isolcpus=domain (or isolcpus= without flags as
> > per housekeeping_isolcpus_setup()) indeed offers superior isolation by
> > removing the CPU from the scheduler load-balancing domains.
> > 
> > I must apologise for the omission in my previous correspondence. I
> > neglected to mention that our specific configuration utilises isolcpus= in
> > conjunction with nohz_full=.
> > 
> > > > However, the extant "Fair Server" (Deadline Server) architecture
> > > > compromises this isolation guarantee. At present, should a background
> > > > SCHED_OTHER task be enqueued, the scheduler initiates the Fair Server
> > > > (dl_server_start). As the Fair Server functions as a SCHED_DEADLINE entity,
> > > > its activation increments rq->dl.dl_nr_running.
> > > > 
> > > 
> > > There is runtime allocated to fair server. If you make them 0 on CPUs of
> > > interest, wouldn't that work?
> > > 
> > > /sys/kernel/debug/sched/fair_server/<cpu>/runtime
> > 
> > Yes, you are quite right; setting the fair server runtime to 0 (via
> > /sys/kernel/debug/sched/fair_server/[cpu]/runtime) does indeed achieve the
> > desired effect. In my testing, the SCHED_FIFO task on the fully
> > adaptive-tick CPU remains uninterrupted by the restored clock-tick when
> > this configuration is applied. Thank you.
> > 
> > However, I believe it would be beneficial if this scheduling feature were
> > available as an automatic kernel detection mechanism. While the manual
> > runtime adjustment works, having the kernel automatically detect the
> > condition - where an RT task is running and bandwidth enforcement is
> > disabled - would provide a more seamless and robust solution for
> > partitioned systems without requiring external intervention.
> > I may consider an improved version of the patch that includes a "Fair
> > server disabled" warning much like in sched_fair_server_write().
> 
> I am not sure either we need/want the automatic mechanism, as we already
> have the fair_server interface. I kind of think that if any (kthread
> included) CFS task is enqueued on an "isolated" CPU the problem might
> reside in sub-optimal isolation (usually a config issue or a kernel
> issue that might need solving - e.g. a for_each_cpu loop that needs
> changing). Starving such tasks might anyway end in a system crash of
> sort.

We must not starve fair tasks -- this can severely affect the system
health.

Specifically per-cpu kthreads getting starved can cause complete system
lockup when other CPUs go wait for completion and such.

We must not disable the fair server, ever. Doing do means you get to
keep the pieces.

The only sane way is to ensure these tasks do not get queued in the
first place.
Re: [RFC PATCH 0/1] sched/fair: Feature to suppress Fair Server for NOHZ_FULL isolation
Posted by Aaron Tomlin 3 weeks, 4 days ago
On Wed, Jan 07, 2026 at 11:26:59AM +0100, Peter Zijlstra wrote:
> We must not starve fair tasks -- this can severely affect the system
> health.
> 
> Specifically per-cpu kthreads getting starved can cause complete system
> lockup when other CPUs go wait for completion and such.
> 
> We must not disable the fair server, ever. Doing do means you get to
> keep the pieces.
> 
> The only sane way is to ensure these tasks do not get queued in the
> first place.

Hi Peter,

To your point, in an effort to steer CFS (SCHED_NORMAL) tasks away from
isolated, RT-busy CPUs, I would be interested in your thoughts on the
following approach. By redirecting these "leaked" CFS tasks to housekeeping
CPUs prior to enqueueing, we ensure that rq->cfs.h_nr_queued remains at
zero on the isolated core. This prevents the activation of the Fair Server
and preserves the silence of the adaptive-tick mode.

While a race condition exists - specifically, an RT task could wake up on the
target CPU after our check returns false - this is likely acceptable. Should
an RT task wake up later, it will preempt the CFS task regardless;
consequently, the next time the CFS task sleeps and wakes, the logic will
intercept and redirect it, I think.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index da46c3164537..3db7a590a24d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8526,6 +8526,32 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
 	/* SD_flags and WF_flags share the first nibble */
 	int sd_flag = wake_flags & 0xF;
 
+	/*
+	 * When RT_SUPPRESS_FAIR_SERVER is enabled, we proactively steer CFS tasks
+	 * away from isolated CPUs that are currently executing Real-Time tasks.
+	 *
+	 * Enqueuing a CFS task on such a CPU would trigger dl_server_start(),
+	 * which in turn restarts the tick to enforce bandwidth control. By
+	 * redirecting the task to a housekeeping CPU during the selection
+	 * phase, we preserve strict isolation and silence on the target CPU.
+	 */
+#if defined(CONFIG_NO_HZ_FULL)
+	if (sched_feat(RT_SUPPRESS_FAIR_SERVER) && !rt_bandwidth_enabled()
+			&& housekeeping_enabled(HK_TYPE_KERNEL_NOISE)) {
+		struct rq *target_rq = cpu_rq(prev_cpu);
+		/*
+		 * Use READ_ONCE() to safely load the remote CPU's current task
+		 * pointer without holding the rq lock.
+		 */
+		struct task_struct *curr = READ_ONCE(target_rq->curr);
+
+		/* If the target CPU is isolated and busy with RT, redirect */
+		if (rt_task(curr) &&
+			!housekeeping_test_cpu(prev_cpu, HK_TYPE_KERNEL_NOISE)) {
+			return housekeeping_any_cpu(HK_TYPE_KERNEL_NOISE);
+		}
+	}
+#endif
 	/*
 	 * required for stable ->cpus_allowed
 	 */


-- 
Aaron Tomlin
Re: [RFC PATCH 0/1] sched/fair: Feature to suppress Fair Server for NOHZ_FULL isolation
Posted by Aaron Tomlin 1 month ago
On Wed, Jan 07, 2026 at 11:26:59AM +0100, Peter Zijlstra wrote:
> On Wed, Jan 07, 2026 at 10:48:12AM +0100, Juri Lelli wrote:
> > Hello!
> > 
> > On 06/01/26 09:49, Aaron Tomlin wrote:
> > > On Tue, Jan 06, 2026 at 02:37:49PM +0530, Shrikanth Hegde wrote:
> > > > If all your SCHED_FIFO is pinned and their scheduling decisions
> > > > are managed in userspace, using isolcpus would offer you better
> > > > isolations compared to nohz_full.
> > > 
> > > Hi Shrikanth,
> > > 
> > > You are entirely correct; isolcpus=domain (or isolcpus= without flags as
> > > per housekeeping_isolcpus_setup()) indeed offers superior isolation by
> > > removing the CPU from the scheduler load-balancing domains.
> > > 
> > > I must apologise for the omission in my previous correspondence. I
> > > neglected to mention that our specific configuration utilises isolcpus= in
> > > conjunction with nohz_full=.
> > > 
> > > > > However, the extant "Fair Server" (Deadline Server) architecture
> > > > > compromises this isolation guarantee. At present, should a background
> > > > > SCHED_OTHER task be enqueued, the scheduler initiates the Fair Server
> > > > > (dl_server_start). As the Fair Server functions as a SCHED_DEADLINE entity,
> > > > > its activation increments rq->dl.dl_nr_running.
> > > > > 
> > > > 
> > > > There is runtime allocated to fair server. If you make them 0 on CPUs of
> > > > interest, wouldn't that work?
> > > > 
> > > > /sys/kernel/debug/sched/fair_server/<cpu>/runtime
> > > 
> > > Yes, you are quite right; setting the fair server runtime to 0 (via
> > > /sys/kernel/debug/sched/fair_server/[cpu]/runtime) does indeed achieve the
> > > desired effect. In my testing, the SCHED_FIFO task on the fully
> > > adaptive-tick CPU remains uninterrupted by the restored clock-tick when
> > > this configuration is applied. Thank you.
> > > 
> > > However, I believe it would be beneficial if this scheduling feature were
> > > available as an automatic kernel detection mechanism. While the manual
> > > runtime adjustment works, having the kernel automatically detect the
> > > condition - where an RT task is running and bandwidth enforcement is
> > > disabled - would provide a more seamless and robust solution for
> > > partitioned systems without requiring external intervention.
> > > I may consider an improved version of the patch that includes a "Fair
> > > server disabled" warning much like in sched_fair_server_write().
> > 
> > I am not sure either we need/want the automatic mechanism, as we already
> > have the fair_server interface. I kind of think that if any (kthread
> > included) CFS task is enqueued on an "isolated" CPU the problem might
> > reside in sub-optimal isolation (usually a config issue or a kernel
> > issue that might need solving - e.g. a for_each_cpu loop that needs
> > changing). Starving such tasks might anyway end in a system crash of
> > sort.
> 
> We must not starve fair tasks -- this can severely affect the system
> health.
> 
> Specifically per-cpu kthreads getting starved can cause complete system
> lockup when other CPUs go wait for completion and such.
> 
> We must not disable the fair server, ever. Doing do means you get to
> keep the pieces.
> 
> The only sane way is to ensure these tasks do not get queued in the
> first place.

Hi Shrikanth, Valentin, Juri, Daniel, Peter, 

I fully appreciate your concerns regarding system health and the critical
nature of per-CPU kthreads. I agree that under standard operation,
disabling the Fair Server presents a significant risk of system lockup.
Your suggestion to ensure such tasks are prevented from being queued in the
first instance is an interesting proposition and certainly merits further
consideration - I will look into it.

However, I would respectfully submit that the kernel currently affords
users the capability to manually disable runtime for each CFS task via
/sys/kernel/debug/sched/fair_server/. This establishes a precedent wherein
the user is permitted to assume full responsibility for the scheduler's
behaviour on specific cores.

If my understanding is correct, in the current scenario, should a user
manually set the CFS runtime to zero, CPU-specific kthreads operating as
SCHED_NORMAL are already precluded from circumvention by the Fair Server
mechanism when a real-Time task is executing. The risk you describe is,
therefore, already present for those who utilise the debug interface, I
think.

The rationale behind introducing RT_SUPPRESS_FAIR_SERVER is to formalise
this behaviour for a specific, highly educated class of user (e.g., HFT or
HPC operators) who explicitly prioritise absolute determinism over general
system stability, for a period of time - we still maintain the ability to
terminate/interrupt a real-time task via a signal (e.g., SIGINT). As this
scheduling feature is disabled by default, the user must actively opt-in,
thereby signalling their willingness to "sacrifice" safety guarantees and
accept the potential consequences - or "keep the pieces," as it were.

I believe this approach provides a necessary tool for extreme
latency-sensitive partitions without compromising the safety of the
general-purpose kernel.


Kind regards,
-- 
Aaron Tomlin