sched_ext: Skip stack trace capture in NMI context

[RFT] sched_ext: Skip stack trace capture in NMI context

Posted by Joel Fernandes 1 month, 2 weeks ago

stack_trace_save() is not guaranteed to be NMI-safe on all
architectures.

The hardlockup detector calls into sched_ext via the following call
chain when an NMI occurs:

  watchdog_overflow_callback()
    watchdog_hardlockup_check()
      scx_hardlockup()
        stack_trace_save()

Skip stack trace capture when in_nmi() returns true to prevent
potential deadlocks.

Fixes: 582f700e1bdc ("sched_ext: Hook up hardlockup detector")
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
 kernel/sched/ext.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 05f5a49e9649..a96255ca3a08 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -4678,7 +4678,8 @@ static bool scx_vexit(struct scx_sched *sch,
 
 	ei->exit_code = exit_code;
 #ifdef CONFIG_STACKTRACE
-	if (kind >= SCX_EXIT_ERROR)
+	/* Skip stack trace capture in NMI context as its unsafe. */
+	if (kind >= SCX_EXIT_ERROR && !in_nmi())
 		ei->bt_len = stack_trace_save(ei->bt, SCX_EXIT_BT_LEN, 1);
 #endif
 	vscnprintf(ei->msg, SCX_EXIT_MSG_LEN, fmt, args);
-- 
2.34.1

Re: [RFT] sched_ext: Skip stack trace capture in NMI context

Posted by Tejun Heo 1 month, 2 weeks ago

Hello,

On Mon, Dec 22, 2025 at 07:50:37PM -0500, Joel Fernandes wrote:
> stack_trace_save() is not guaranteed to be NMI-safe on all
> architectures.
> 
> The hardlockup detector calls into sched_ext via the following call
> chain when an NMI occurs:
> 
>   watchdog_overflow_callback()
>     watchdog_hardlockup_check()
>       scx_hardlockup()
>         stack_trace_save()
> 
> Skip stack trace capture when in_nmi() returns true to prevent
> potential deadlocks.
> 
> Fixes: 582f700e1bdc ("sched_ext: Hook up hardlockup detector")
> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>

This does work on x86 (right?) and is useful in understanding what the
underlying problem is. It'd be great if there's a config flag we can test
but if not can we specifically exclude archs which are known to not work?

Thanks.

-- 
tejun

Re: [RFT] sched_ext: Skip stack trace capture in NMI context

Posted by Joel Fernandes 1 month, 2 weeks ago

> On Dec 22, 2025, at 9:44 PM, Tejun Heo <tj@kernel.org> wrote:
> 
> Hello,
> 
>> On Mon, Dec 22, 2025 at 07:50:37PM -0500, Joel Fernandes wrote:
>> stack_trace_save() is not guaranteed to be NMI-safe on all
>> architectures.
>> 
>> The hardlockup detector calls into sched_ext via the following call
>> chain when an NMI occurs:
>> 
>>  watchdog_overflow_callback()
>>    watchdog_hardlockup_check()
>>      scx_hardlockup()
>>        stack_trace_save()
>> 
>> Skip stack trace capture when in_nmi() returns true to prevent
>> potential deadlocks.
>> 
>> Fixes: 582f700e1bdc ("sched_ext: Hook up hardlockup detector")
>> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
> 
> This does work on x86 (right?) and is useful in understanding what the
> underlying problem is. It'd be great if there's a config flag we can test
> but if not can we specifically exclude archs which are known to not work?

You are right that we will miss out on architectures where this is safe. We should make it more specific. I am wondering if Steven Rostedt has any thoughts here since he is actively working on stack tracing/unwinding and has made similar commits in the past where he restricted stack tracing in an NMI context.

Per my understanding, stack trace unwinding is not safe/valid to do on architectures where the NMI context does not have its own stack. But I could stand corrected, hence I marked this as an RFT.  It is safe to do on 64-bit x86, but not on 32-bit x86 and other same-stack architectures.

If we feel that this is not an issue, then that is fine with me (and sorry for the noise), but I just wanted to raise it anyway just in case. Sooner or later someone running scx on an odd architecture might complaint.

Thanks!

 - Joel

> 
> Thanks.
> 
> --
> tejun

Re: [RFT] sched_ext: Skip stack trace capture in NMI context

Posted by Steven Rostedt 1 month, 2 weeks ago

On Tue, 23 Dec 2025 04:34:00 +0000
Joel Fernandes <joelagnelf@nvidia.com> wrote:

> > This does work on x86 (right?) and is useful in understanding what the
> > underlying problem is. It'd be great if there's a config flag we can test
> > but if not can we specifically exclude archs which are known to not work?  
> 
> You are right that we will miss out on architectures where this is safe.
> We should make it more specific. I am wondering if Steven Rostedt has any
> thoughts here since he is actively working on stack tracing/unwinding and
> has made similar commits in the past where he restricted stack tracing in
> an NMI context.

[ Fixes line wrap, ug it's hard to read emails that go across 300 characters! ]

Well, we do kernel stack tracing in NMI context all the time with no issue
(but I mostly work on x86).

> 
> Per my understanding, stack trace unwinding is not safe/valid to do on
> architectures where the NMI context does not have its own stack. But I

Hmm, no, I think it's fine to do it on archs where NMI doesn't have its own
stack. It works on 32bit x86, where the NMI shares the kernel stack.

Which architecture had an issue with a stack trace?

-- Steve


> could stand corrected, hence I marked this as an RFT.  It is safe to do
> on 64-bit x86, but not on 32-bit x86 and other same-stack architectures.
> 
> If we feel that this is not an issue, then that is fine with me (and
> sorry for the noise), but I just wanted to raise it anyway just in case.
> Sooner or later someone running scx on an odd architecture might
> complaint.

Re: [RFT] sched_ext: Skip stack trace capture in NMI context

Posted by Joel Fernandes 1 month, 2 weeks ago

On Tue, Dec 23, 2025 at 03:31:36PM -0500, Steven Rostedt wrote:
> On Tue, 23 Dec 2025 04:34:00 +0000
> Joel Fernandes <joelagnelf@nvidia.com> wrote:
> 
> > > This does work on x86 (right?) and is useful in understanding what the
> > > underlying problem is. It'd be great if there's a config flag we can test
> > > but if not can we specifically exclude archs which are known to not work?  
> > 
> > You are right that we will miss out on architectures where this is safe.
> > We should make it more specific. I am wondering if Steven Rostedt has any
> > thoughts here since he is actively working on stack tracing/unwinding and
> > has made similar commits in the past where he restricted stack tracing in
> > an NMI context.
> 
> [ Fixes line wrap, ug it's hard to read emails that go across 300 characters! ]

Sorry about that. Thank you.

> Well, we do kernel stack tracing in NMI context all the time with no issue
> (but I mostly work on x86).
> 
> > 
> > Per my understanding, stack trace unwinding is not safe/valid to do on
> > architectures where the NMI context does not have its own stack. But I
> 
> Hmm, no, I think it's fine to do it on archs where NMI doesn't have its own
> stack. It works on 32bit x86, where the NMI shares the kernel stack.
> 
> Which architecture had an issue with a stack trace?

On 32 bit what happens if NMI hits during stack frame setup? Can the unwinder
misbehave if base pointer has not yet been setup and NMI starts using same
stack?

Not sure.

Some documentation suggests IST is required for reliable NMI stack tracing
[1] [2] which 32-bit does not have.
”If an interrupt or other exception is taken while the stack or other unwind
state is in an inconsistent state, it may not be possible to reliably unwind,
and it may not be possible to identify whether such unwinding will be
reliable. See below for examples.“

Probably the issue happens to be more of printing garbage than crashing the
kernel, but I am not convinced it is stable. Hmm.

[1] https://www.kernel.org/doc/html/v6.16/arch/x86/kernel-stacks.html
[2] https://docs.kernel.org/livepatch/reliable-stacktrace.html

thanks,

 - Joel


> 
> -- Steve
> 
> 
> > could stand corrected, hence I marked this as an RFT.  It is safe to do
> > on 64-bit x86, but not on 32-bit x86 and other same-stack architectures.
> > 
> > If we feel that this is not an issue, then that is fine with me (and
> > sorry for the noise), but I just wanted to raise it anyway just in case.
> > Sooner or later someone running scx on an odd architecture might
> > complaint.
> 
>

Re: [RFT] sched_ext: Skip stack trace capture in NMI context

Posted by Steven Rostedt 1 month, 2 weeks ago

On Tue, 23 Dec 2025 18:58:33 -0500
Joel Fernandes <joelagnelf@nvidia.com> wrote:

> Some documentation suggests IST is required for reliable NMI stack tracing
> [1] [2] which 32-bit does not have.
> ”If an interrupt or other exception is taken while the stack or other unwind
> state is in an inconsistent state, it may not be possible to reliably unwind,
> and it may not be possible to identify whether such unwinding will be
> reliable. See below for examples.“
> 
> Probably the issue happens to be more of printing garbage than crashing the
> kernel, but I am not convinced it is stable. Hmm.

Correct. It's about reliable stack traces, as live kernel patching requires
that the stack it looks at is reliable before it can modify the code. What
happens if it's not reliable, means it will just stop at the interrupt
handler and you don't get to see the rest (or you'll see a bunch of
functions with "?" in front of them).

-- Steve

Re: [RFT] sched_ext: Skip stack trace capture in NMI context

Posted by Joel Fernandes 1 month, 2 weeks ago


> On Dec 24, 2025, at 9:17 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> On Tue, 23 Dec 2025 18:58:33 -0500
> Joel Fernandes <joelagnelf@nvidia.com> wrote:
> 
>> Some documentation suggests IST is required for reliable NMI stack tracing
>> [1] [2] which 32-bit does not have.
>> ”If an interrupt or other exception is taken while the stack or other unwind
>> state is in an inconsistent state, it may not be possible to reliably unwind,
>> and it may not be possible to identify whether such unwinding will be
>> reliable. See below for examples.“
>> 
>> Probably the issue happens to be more of printing garbage than crashing the
>> kernel, but I am not convinced it is stable. Hmm.
> 
> Correct. It's about reliable stack traces, as live kernel patching requires
> that the stack it looks at is reliable before it can modify the code. What
> happens if it's not reliable, means it will just stop at the interrupt
> handler and you don't get to see the rest (or you'll see a bunch of
> functions with "?" in front of them).

Ah, thanks Steve for clarifying!

 - Joel


> 
> -- Steve

Re: [RFT] sched_ext: Skip stack trace capture in NMI context

Posted by Andrea Righi 1 month, 2 weeks ago

On Mon, Dec 22, 2025 at 07:50:37PM -0500, Joel Fernandes wrote:
> stack_trace_save() is not guaranteed to be NMI-safe on all
> architectures.
> 
> The hardlockup detector calls into sched_ext via the following call
> chain when an NMI occurs:
> 
>   watchdog_overflow_callback()
>     watchdog_hardlockup_check()
>       scx_hardlockup()
>         stack_trace_save()
> 
> Skip stack trace capture when in_nmi() returns true to prevent
> potential deadlocks.
> 
> Fixes: 582f700e1bdc ("sched_ext: Hook up hardlockup detector")
> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
> ---
>  kernel/sched/ext.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 05f5a49e9649..a96255ca3a08 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -4678,7 +4678,8 @@ static bool scx_vexit(struct scx_sched *sch,
>  
>  	ei->exit_code = exit_code;
>  #ifdef CONFIG_STACKTRACE
> -	if (kind >= SCX_EXIT_ERROR)
> +	/* Skip stack trace capture in NMI context as its unsafe. */

nit: s/its/it's/

> +	if (kind >= SCX_EXIT_ERROR && !in_nmi())
>  		ei->bt_len = stack_trace_save(ei->bt, SCX_EXIT_BT_LEN, 1);

If stack_trace_save() isn't NMI-safe on certain architectures, shouldn't we
fix this inside stack_trace_save()?

There are probably other places where we call stack_trace_save() without
checking in_nmi(). Making stack_trace_save() handle the NMI case would
solve all of them.

>  #endif
>  	vscnprintf(ei->msg, SCX_EXIT_MSG_LEN, fmt, args);
> -- 
> 2.34.1
> 

Thanks,
-Andrea