[PATCH v2] sched/rt: Have RT_PUSH_IPI be default off for non PREEMPT_RT

Steven Rostedt posted 1 patch 1 week, 2 days ago
kernel/sched/features.h | 8 ++++++++
1 file changed, 8 insertions(+)
[PATCH v2] sched/rt: Have RT_PUSH_IPI be default off for non PREEMPT_RT
Posted by Steven Rostedt 1 week, 2 days ago
From: Steven Rostedt <rostedt@goodmis.org>

RT migration is done aggressively. When a CPU schedules out a high
priority RT task for a lower priority task, it will look to see if there's
any RT tasks that are waiting to run on another CPU that is of higher
priority than the task this CPU is about to run. If it finds one, it will
pull that task over to the CPU and allow it to run there instead.

Normally, this pulling is done by looking at the RT overloaded mask (rto)
which contains all the CPUs in the scheduler domain with RT tasks that are
waiting to run due to a higher priority RT task currently running on their
CPU. The CPU that is about to schedule a lower priority task will grab the
rq lock of the overloaded CPU and move the RT task from that CPU's runqueue
to the local one and schedule the higher priority RT task.

This caused issues when a lot of CPUs would schedule a lower priority task
at the same time. They would all try to grab the same runqueue lock of
the CPU with the overloaded RT tasks. Only the first CPU that got in will
get that task. All the others would wait until they got the runqueue lock
and see there's nothing to pull and do nothing. On systems with lots of
CPUs, this caused a large latency (up to 500us) which is beyond what
PREEMPT_RT is to allow.

The solution to that was to create an RT_PUSH_IPI logic. When any CPU
wanted to pull a task, instead of grabbing the runqueue lock of the
overloaded CPU, it would start by sending an IPI to the overloaded CPU,
and that IPI handler would have the CPU with the waiting RT task do a push
instead. Then that handler would send an IPI to the next CPU with
overloaded RT tasks, and so on. Note, after the first CPU starts this
process, if another CPU wanted to do a pull, it would see that the process
has already begun and would only increment a counter to have the IPIs
continue again.

The RT_PUSH_IPI solved the latency problem with PREEMPT_RT but could cause
a new issue with non PREEMPT_RT. Namely, softirqs run in a threaded
context on PREEMPT_RT but they can run in an interrupt context in non-RT.

If an IPI lands on a CPU that has just woken up multiple RT tasks and the
current CPU is running a non RT or a low priority RT task, instead of
doing a push, it would simply do a schedule on that CPU. But if a softirq
was also executing on this CPU, the schedule would need to wait until the
softirq finished. Until then, the CPU would still be considered overloaded
as there are RT tasks still waiting to run on it.

A live lock occurred on a workload that was doing heavy networking traffic
on a large machine where the softirqs would run 500us out of 750us. And it
would also be waking up RT tasks, causing the RT pull logic to be
constantly executed.

When a softirq triggered on a CPU with RT tasks queued but not running
yet, and the other CPUs would see this CPU as being overloaded, they would
send an IPI over to it. The CPU would notice that the waiting RT tasks are
of higher priority than the currently running task and simply schedule
that CPU instead. But because the softirq was executing, before it could
schedule, it would receive another IPI to do the same. The amount of IPIs
would slow down the currently running softirq so much that before it could
return back to task context, it would execute another softirq never
allowing the CPU to schedule. This live locked that CPU.

As RT_PUSH_IPI was created to help PREEMPT_RT, make it default off if
PREEMPT_RT is not enabled.

Cc: stable@vger.kernel.org
Fixes: b6366f048e0c ("sched/rt: Use IPI to trigger RT task push migration instead of pulling")
Reported-by: Tejun Heo <tj@kernel.org>
Closes: https://lore.kernel.org/all/20260506235716.2530720-1-tj@kernel.org/
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
Changes since v1: https://patch.msgid.link/20260515103110.51a598dc@gandalf.local.home

- Indent #else and #endif to match #ifdef

 kernel/sched/features.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 84c4fe3abd74..8f0dee8fc475 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -110,8 +110,16 @@ SCHED_FEAT(WARN_DOUBLE_CLOCK, false)
  * rq lock and possibly create a large contention, sending an
  * IPI to that CPU and let that CPU push the RT task to where
  * it should go may be a better scenario.
+ *
+ * This is best for PREEMPT_RT, but for non-RT it can cause issues
+ * when preemption is disabled for long periods of time. Have
+ * it only default enabled for PREEMPT_RT.
  */
+# ifdef CONFIG_PREEMPT_RT
 SCHED_FEAT(RT_PUSH_IPI, true)
+# else
+SCHED_FEAT(RT_PUSH_IPI, false)
+# endif
 #endif
 
 SCHED_FEAT(RT_RUNTIME_SHARE, false)
-- 
2.53.0
Re: [PATCH v2] sched/rt: Have RT_PUSH_IPI be default off for non PREEMPT_RT
Posted by Valentin Schneider 6 days, 14 hours ago
On 15/05/26 10:37, Steven Rostedt wrote:
> A live lock occurred on a workload that was doing heavy networking traffic
> on a large machine where the softirqs would run 500us out of 750us. And it
> would also be waking up RT tasks, causing the RT pull logic to be
> constantly executed.
>
> When a softirq triggered on a CPU with RT tasks queued but not running
> yet, and the other CPUs would see this CPU as being overloaded, they would
> send an IPI over to it. The CPU would notice that the waiting RT tasks are
> of higher priority than the currently running task and simply schedule
> that CPU instead. But because the softirq was executing, before it could
> schedule, it would receive another IPI to do the same. The amount of IPIs
> would slow down the currently running softirq so much that before it could
> return back to task context, it would execute another softirq never
> allowing the CPU to schedule. This live locked that CPU.
>

I got a bit confused here, please correct me if I didn't get this right:

Per handle_softirqs(), we can't restart the softirq handling loop if
need_resched() is true (which would be the case here, per what we'd have
done it push_rt_task(@pull=true)).
I thought this meant softirqs couldn't be the issue, however this is only
valid within the scope of a single handle_softirqs() invocation.

AIUI here we're being hammered by IRQs, thus under this sort of pattern:

<IRQ>
  __irq_exit_rcu()
    invoke_softirq()
      handle_softirqs()
        // handle NET_RX_SOFTIRQ
        // need_resched() is true so don't loop
</IRQ>

// Barely any progress made here towards actually executing __schedule()

<IRQ>
  __irq_exit_rcu()
    invoke_softirq()
      handle_softirqs()
        // handle NET_RX_SOFTIRQ and wake up some tasks
        // need_resched() is true so don't loop
</IRQ>

& repeat ad nauseam. Did I get this right?
Re: [PATCH v2] sched/rt: Have RT_PUSH_IPI be default off for non PREEMPT_RT
Posted by Shrikanth Hegde 4 days, 16 hours ago

On 5/18/26 2:17 PM, Valentin Schneider wrote:
> On 15/05/26 10:37, Steven Rostedt wrote:

> I got a bit confused here, please correct me if I didn't get this right:
> 
> Per handle_softirqs(), we can't restart the softirq handling loop if
> need_resched() is true (which would be the case here, per what we'd have
> done it push_rt_task(@pull=true)).
> I thought this meant softirqs couldn't be the issue, however this is only
> valid within the scope of a single handle_softirqs() invocation.
> 
> AIUI here we're being hammered by IRQs, thus under this sort of pattern:
> 
> <IRQ>
>    __irq_exit_rcu()
>      invoke_softirq()
>        handle_softirqs()
>          // handle NET_RX_SOFTIRQ
>          // need_resched() is true so don't loop
> </IRQ>
> 

Wouldn't IRQ exit call schedule if need_resched is set? via irqentry_exit_cond_resched ?

> // Barely any progress made here towards actually executing __schedule()
> 
> <IRQ>
>    __irq_exit_rcu()
>      invoke_softirq()
>        handle_softirqs()
>          // handle NET_RX_SOFTIRQ and wake up some tasks
>          // need_resched() is true so don't loop
> </IRQ>
> 
> & repeat ad nauseam. Did I get this right?
>
Re: [PATCH v2] sched/rt: Have RT_PUSH_IPI be default off for non PREEMPT_RT
Posted by Valentin Schneider 3 days, 13 hours ago
On 20/05/26 12:03, Shrikanth Hegde wrote:
> On 5/18/26 2:17 PM, Valentin Schneider wrote:
>> On 15/05/26 10:37, Steven Rostedt wrote:
>
>> I got a bit confused here, please correct me if I didn't get this right:
>>
>> Per handle_softirqs(), we can't restart the softirq handling loop if
>> need_resched() is true (which would be the case here, per what we'd have
>> done it push_rt_task(@pull=true)).
>> I thought this meant softirqs couldn't be the issue, however this is only
>> valid within the scope of a single handle_softirqs() invocation.
>>
>> AIUI here we're being hammered by IRQs, thus under this sort of pattern:
>>
>> <IRQ>
>>    __irq_exit_rcu()
>>      invoke_softirq()
>>        handle_softirqs()
>>          // handle NET_RX_SOFTIRQ
>>          // need_resched() is true so don't loop
>> </IRQ>
>>
>
> Wouldn't IRQ exit call schedule if need_resched is set? via irqentry_exit_cond_resched ?
>

So as pointed out further down the thread it may not because of
PREEMPT_NONE.

Also; schedule is entered with preemption disabled and IRQs enabled, which
I think means in a stupidly network-busy scenario and !PREEMPT_RT,
IRQ+invoke_softirq() can significantly slow down the top half of
__schedule() itself.
Re: [PATCH v2] sched/rt: Have RT_PUSH_IPI be default off for non PREEMPT_RT
Posted by Steven Rostedt 4 days, 6 hours ago
On Wed, 20 May 2026 12:03:37 +0530
Shrikanth Hegde <sshegde@linux.ibm.com> wrote:

> On 5/18/26 2:17 PM, Valentin Schneider wrote:
> > On 15/05/26 10:37, Steven Rostedt wrote:  
> 
> > I got a bit confused here, please correct me if I didn't get this right:
> > 
> > Per handle_softirqs(), we can't restart the softirq handling loop if
> > need_resched() is true (which would be the case here, per what we'd have
> > done it push_rt_task(@pull=true)).
> > I thought this meant softirqs couldn't be the issue, however this is only
> > valid within the scope of a single handle_softirqs() invocation.
> > 
> > AIUI here we're being hammered by IRQs, thus under this sort of pattern:
> > 
> > <IRQ>
> >    __irq_exit_rcu()
> >      invoke_softirq()
> >        handle_softirqs()
> >          // handle NET_RX_SOFTIRQ
> >          // need_resched() is true so don't loop
> > </IRQ>
> >   
> 
> Wouldn't IRQ exit call schedule if need_resched is set? via irqentry_exit_cond_resched ?

I haven't looked at the networking code, but does NAPI just continue
looping while there are packets to process? Or is there a check if
need_resched() is set, that it will exit out early?

Yeah, I'm still wondering why this causes a live lock.

-- Steve


> 
> > // Barely any progress made here towards actually executing __schedule()
> > 
> > <IRQ>
> >    __irq_exit_rcu()
> >      invoke_softirq()
> >        handle_softirqs()
> >          // handle NET_RX_SOFTIRQ and wake up some tasks
> >          // need_resched() is true so don't loop
> > </IRQ>
> > 
> > & repeat ad nauseam. Did I get this right?
> >
Re: [PATCH v2] sched/rt: Have RT_PUSH_IPI be default off for non PREEMPT_RT
Posted by Shrikanth Hegde 4 days, 6 hours ago

On 5/20/26 10:22 PM, Steven Rostedt wrote:
> On Wed, 20 May 2026 12:03:37 +0530
> Shrikanth Hegde <sshegde@linux.ibm.com> wrote:
> 
>> On 5/18/26 2:17 PM, Valentin Schneider wrote:
>>> On 15/05/26 10:37, Steven Rostedt wrote:
>>
>>> I got a bit confused here, please correct me if I didn't get this right:
>>>
>>> Per handle_softirqs(), we can't restart the softirq handling loop if
>>> need_resched() is true (which would be the case here, per what we'd have
>>> done it push_rt_task(@pull=true)).
>>> I thought this meant softirqs couldn't be the issue, however this is only
>>> valid within the scope of a single handle_softirqs() invocation.
>>>
>>> AIUI here we're being hammered by IRQs, thus under this sort of pattern:
>>>
>>> <IRQ>
>>>     __irq_exit_rcu()
>>>       invoke_softirq()
>>>         handle_softirqs()
>>>           // handle NET_RX_SOFTIRQ
>>>           // need_resched() is true so don't loop
>>> </IRQ>
>>>    
>>
>> Wouldn't IRQ exit call schedule if need_resched is set? via irqentry_exit_cond_resched ?
> 
> I haven't looked at the networking code, but does NAPI just continue
> looping while there are packets to process? Or is there a check if
> need_resched() is set, that it will exit out early?
> 
> Yeah, I'm still wondering why this causes a live lock.
> 
> -- Steve
> 
> 

By any chance it is running with preempt=none/voluntary?
If so it might never call schedule until it goes back to user space.

>>
>>> // Barely any progress made here towards actually executing __schedule()
>>>
>>> <IRQ>
>>>     __irq_exit_rcu()
>>>       invoke_softirq()
>>>         handle_softirqs()
>>>           // handle NET_RX_SOFTIRQ and wake up some tasks
>>>           // need_resched() is true so don't loop
>>> </IRQ>
>>>
>>> & repeat ad nauseam. Did I get this right?
>>>    
>
Re: [PATCH v2] sched/rt: Have RT_PUSH_IPI be default off for non PREEMPT_RT
Posted by Tejun Heo 4 days, 5 hours ago
On Wed, May 20, 2026 at 10:34:35PM +0530, Shrikanth Hegde wrote:
> By any chance it is running with preempt=none/voluntary?
> If so it might never call schedule until it goes back to user space.

I think this might be it. In the production dump, the locked up cpus were
often running btrfs compression / decompression. The kernel is PREEMPT_NONE
and while that path has resched_curr()'s, with high enough irq frequency, it
wouldn't be that difficult to catch the cpu enough times before it reaches
the resched point.

Thanks.

-- 
tejun
Re: [PATCH v2] sched/rt: Have RT_PUSH_IPI be default off for non PREEMPT_RT
Posted by Steven Rostedt 4 days, 4 hours ago
On Wed, 20 May 2026 08:09:04 -1000
Tejun Heo <tj@kernel.org> wrote:

> On Wed, May 20, 2026 at 10:34:35PM +0530, Shrikanth Hegde wrote:
> > By any chance it is running with preempt=none/voluntary?
> > If so it might never call schedule until it goes back to user space.  
> 
> I think this might be it. In the production dump, the locked up cpus were
> often running btrfs compression / decompression. The kernel is PREEMPT_NONE
> and while that path has resched_curr()'s, with high enough irq frequency, it
> wouldn't be that difficult to catch the cpu enough times before it reaches
> the resched point.

But it should still be making forward progress. The IPI handler is very short.

Are the other CPUs running very short lived RT tasks that constantly
trigger the IPI push logic? I mean it would need to run a lot of RT
tasks that keep going to sleep to cause a IPI storm to trigger.

-- Steve
Re: [PATCH v2] sched/rt: Have RT_PUSH_IPI be default off for non PREEMPT_RT
Posted by Tejun Heo 4 days, 3 hours ago
Hello,

On Wed, May 20, 2026 at 03:02:44PM -0400, Steven Rostedt wrote:
> But it should still be making forward progress. The IPI handler is very short.
> 
> Are the other CPUs running very short lived RT tasks that constantly
> trigger the IPI push logic? I mean it would need to run a lot of RT
> tasks that keep going to sleep to cause a IPI storm to trigger.

I don't know for sure. At lower load level, there seem to a bit more than
1000 mpi3 irqs. The stalls were happening when load level was pushed up due
to maintenance going through the region. Let's be aggressive and say that
the rate was four times and each IRQ triggers the irq thread to be woken up
and go back to sleep. That'd be a transition out of RT every 250us across
the system. That's a lot but I'm not sure that's enough to stall for tens of
seconds. So, I'm not sure.

Thanks.

-- 
tejun
Re: [PATCH v2] sched/rt: Have RT_PUSH_IPI be default off for non PREEMPT_RT
Posted by Tejun Heo 1 week, 2 days ago
On Fri, May 15, 2026 at 10:37:40AM -0400, Steven Rostedt wrote:
...
> A live lock occurred on a workload that was doing heavy networking traffic
> on a large machine where the softirqs would run 500us out of 750us. And it
> would also be waking up RT tasks, causing the RT pull logic to be
> constantly executed.

nit: 500/750us is the synthetic repro. The actual case is 30-40% cpu util,
so substantially lower.

> Reported-by: Tejun Heo <tj@kernel.org>
> Closes: https://lore.kernel.org/all/20260506235716.2530720-1-tj@kernel.org/
> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

Tested-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun
Re: [PATCH v2] sched/rt: Have RT_PUSH_IPI be default off for non PREEMPT_RT
Posted by Steven Rostedt 1 week, 2 days ago
On Fri, 15 May 2026 07:17:44 -1000
Tejun Heo <tj@kernel.org> wrote:

> On Fri, May 15, 2026 at 10:37:40AM -0400, Steven Rostedt wrote:
> ...
> > A live lock occurred on a workload that was doing heavy networking traffic
> > on a large machine where the softirqs would run 500us out of 750us. And it
> > would also be waking up RT tasks, causing the RT pull logic to be
> > constantly executed.  
> 
> nit: 500/750us is the synthetic repro. The actual case is 30-40% cpu util,
> so substantially lower.

Hmm, but I'm guessing that it still was long enough to slow down the
softirq to trigger it again and prevent scheduling?

-- Steve


> 
> > Reported-by: Tejun Heo <tj@kernel.org>
> > Closes: https://lore.kernel.org/all/20260506235716.2530720-1-tj@kernel.org/
> > Signed-off-by: Steven Rostedt <rostedt@goodmis.org>  
> 
> Tested-by: Tejun Heo <tj@kernel.org>
> 
> Thanks.
>
Re: [PATCH v2] sched/rt: Have RT_PUSH_IPI be default off for non PREEMPT_RT
Posted by Tejun Heo 1 week, 2 days ago
Hello, Steven.

On Fri, May 15, 2026 at 02:38:47PM -0400, Steven Rostedt wrote:
> Hmm, but I'm guessing that it still was long enough to slow down the
> softirq to trigger it again and prevent scheduling?

Yes, seems that way. I think the load is lower and each softirq invocation
is relatively short (just going through the packets received in the
meantime) but the frequency is higher.

Thanks.

-- 
tejun
Re: [PATCH v2] sched/rt: Have RT_PUSH_IPI be default off for non PREEMPT_RT
Posted by Steven Rostedt 1 week, 2 days ago
On Fri, 15 May 2026 08:47:49 -1000
Tejun Heo <tj@kernel.org> wrote:

> Hello, Steven.
> 
> On Fri, May 15, 2026 at 02:38:47PM -0400, Steven Rostedt wrote:
> > Hmm, but I'm guessing that it still was long enough to slow down the
> > softirq to trigger it again and prevent scheduling?  
> 
> Yes, seems that way. I think the load is lower and each softirq invocation
> is relatively short (just going through the packets received in the
> meantime) but the frequency is higher.

I just want to make sure that my analysis is correct. Since only one CPU
can initiate the RT_PUSH_IPI. That for this to be a problem, other CPUs
need to be constantly running RT tasks for short periods of time so that
when the RT task schedules off the CPU, the CPU then initiates the "pull".
And it's not that it happens all at once. It's more serialized where the RT
tasks are scheduling off at different times to constantly feed the
RT_PUSH_IPI logic without seeing that it's already in place.

It would require this to happen enough times to keep the overloaded CPU
from finishing the softirq until a time when the softirq is scheduled
again. And it would maintain this abuse for long enough to trigger the
watchdog.

Does that fit the scenario of your environment?

-- Steve
Re: [PATCH v2] sched/rt: Have RT_PUSH_IPI be default off for non PREEMPT_RT
Posted by Tejun Heo 1 week, 2 days ago
Hello,

On Fri, May 15, 2026 at 02:56:08PM -0400, Steven Rostedt wrote:
> I just want to make sure that my analysis is correct. Since only one CPU
> can initiate the RT_PUSH_IPI. That for this to be a problem, other CPUs
> need to be constantly running RT tasks for short periods of time so that
> when the RT task schedules off the CPU, the CPU then initiates the "pull".
> And it's not that it happens all at once. It's more serialized where the RT
> tasks are scheduling off at different times to constantly feed the
> RT_PUSH_IPI logic without seeing that it's already in place.
> 
> It would require this to happen enough times to keep the overloaded CPU
> from finishing the softirq until a time when the softirq is scheduled
> again. And it would maintain this abuse for long enough to trigger the
> watchdog.
> 
> Does that fit the scenario of your environment?

Yeah, I think that's coming from the FIFO threaded irq handling for mpi3mr.
We tried two mitigations - one dropping the irq threads to SCHED_OTHER and
NO_RT_PUSH_IPI. Both worked. While the former is not conclusive in itself,
it is in line with the theory at least.

Thanks.

-- 
tejun
[tip: sched/core] sched/rt: Have RT_PUSH_IPI be default off for non PREEMPT_RT
Posted by tip-bot2 for Steven Rostedt 4 days, 14 hours ago
The following commit has been merged into the sched/core branch of tip:

Commit-ID:     dd29c017aed628076e915fe4cdfb5392fd4c5cab
Gitweb:        https://git.kernel.org/tip/dd29c017aed628076e915fe4cdfb5392fd4c5cab
Author:        Steven Rostedt <rostedt@goodmis.org>
AuthorDate:    Fri, 15 May 2026 10:37:40 -04:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 19 May 2026 12:17:39 +02:00

sched/rt: Have RT_PUSH_IPI be default off for non PREEMPT_RT

RT migration is done aggressively. When a CPU schedules out a high
priority RT task for a lower priority task, it will look to see if there's
any RT tasks that are waiting to run on another CPU that is of higher
priority than the task this CPU is about to run. If it finds one, it will
pull that task over to the CPU and allow it to run there instead.

Normally, this pulling is done by looking at the RT overloaded mask (rto)
which contains all the CPUs in the scheduler domain with RT tasks that are
waiting to run due to a higher priority RT task currently running on their
CPU. The CPU that is about to schedule a lower priority task will grab the
rq lock of the overloaded CPU and move the RT task from that CPU's runqueue
to the local one and schedule the higher priority RT task.

This caused issues when a lot of CPUs would schedule a lower priority task
at the same time. They would all try to grab the same runqueue lock of
the CPU with the overloaded RT tasks. Only the first CPU that got in will
get that task. All the others would wait until they got the runqueue lock
and see there's nothing to pull and do nothing. On systems with lots of
CPUs, this caused a large latency (up to 500us) which is beyond what
PREEMPT_RT is to allow.

The solution to that was to create an RT_PUSH_IPI logic. When any CPU
wanted to pull a task, instead of grabbing the runqueue lock of the
overloaded CPU, it would start by sending an IPI to the overloaded CPU,
and that IPI handler would have the CPU with the waiting RT task do a push
instead. Then that handler would send an IPI to the next CPU with
overloaded RT tasks, and so on. Note, after the first CPU starts this
process, if another CPU wanted to do a pull, it would see that the process
has already begun and would only increment a counter to have the IPIs
continue again.

The RT_PUSH_IPI solved the latency problem with PREEMPT_RT but could cause
a new issue with non PREEMPT_RT. Namely, softirqs run in a threaded
context on PREEMPT_RT but they can run in an interrupt context in non-RT.

If an IPI lands on a CPU that has just woken up multiple RT tasks and the
current CPU is running a non RT or a low priority RT task, instead of
doing a push, it would simply do a schedule on that CPU. But if a softirq
was also executing on this CPU, the schedule would need to wait until the
softirq finished. Until then, the CPU would still be considered overloaded
as there are RT tasks still waiting to run on it.

A live lock occurred on a workload that was doing heavy networking traffic
on a large machine where the softirqs would run 500us out of 750us. And it
would also be waking up RT tasks, causing the RT pull logic to be
constantly executed.

When a softirq triggered on a CPU with RT tasks queued but not running
yet, and the other CPUs would see this CPU as being overloaded, they would
send an IPI over to it. The CPU would notice that the waiting RT tasks are
of higher priority than the currently running task and simply schedule
that CPU instead. But because the softirq was executing, before it could
schedule, it would receive another IPI to do the same. The amount of IPIs
would slow down the currently running softirq so much that before it could
return back to task context, it would execute another softirq never
allowing the CPU to schedule. This live locked that CPU.

As RT_PUSH_IPI was created to help PREEMPT_RT, make it default off if
PREEMPT_RT is not enabled.

Fixes: b6366f048e0c ("sched/rt: Use IPI to trigger RT task push migration instead of pulling")
Closes: https://lore.kernel.org/all/20260506235716.2530720-1-tj@kernel.org/
Reported-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20260515103740.25ccbed8@gandalf.local.home
---
 kernel/sched/features.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 84c4fe3..8f0dee8 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -110,8 +110,16 @@ SCHED_FEAT(WARN_DOUBLE_CLOCK, false)
  * rq lock and possibly create a large contention, sending an
  * IPI to that CPU and let that CPU push the RT task to where
  * it should go may be a better scenario.
+ *
+ * This is best for PREEMPT_RT, but for non-RT it can cause issues
+ * when preemption is disabled for long periods of time. Have
+ * it only default enabled for PREEMPT_RT.
  */
+# ifdef CONFIG_PREEMPT_RT
 SCHED_FEAT(RT_PUSH_IPI, true)
+# else
+SCHED_FEAT(RT_PUSH_IPI, false)
+# endif
 #endif
 
 SCHED_FEAT(RT_RUNTIME_SHARE, false)