kernel/sched/ext.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
stack_trace_save() is not guaranteed to be NMI-safe on all
architectures.
The hardlockup detector calls into sched_ext via the following call
chain when an NMI occurs:
watchdog_overflow_callback()
watchdog_hardlockup_check()
scx_hardlockup()
stack_trace_save()
Skip stack trace capture when in_nmi() returns true to prevent
potential deadlocks.
Fixes: 582f700e1bdc ("sched_ext: Hook up hardlockup detector")
Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
---
kernel/sched/ext.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index 05f5a49e9649..a96255ca3a08 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -4678,7 +4678,8 @@ static bool scx_vexit(struct scx_sched *sch,
ei->exit_code = exit_code;
#ifdef CONFIG_STACKTRACE
- if (kind >= SCX_EXIT_ERROR)
+ /* Skip stack trace capture in NMI context as its unsafe. */
+ if (kind >= SCX_EXIT_ERROR && !in_nmi())
ei->bt_len = stack_trace_save(ei->bt, SCX_EXIT_BT_LEN, 1);
#endif
vscnprintf(ei->msg, SCX_EXIT_MSG_LEN, fmt, args);
--
2.34.1
Hello,
On Mon, Dec 22, 2025 at 07:50:37PM -0500, Joel Fernandes wrote:
> stack_trace_save() is not guaranteed to be NMI-safe on all
> architectures.
>
> The hardlockup detector calls into sched_ext via the following call
> chain when an NMI occurs:
>
> watchdog_overflow_callback()
> watchdog_hardlockup_check()
> scx_hardlockup()
> stack_trace_save()
>
> Skip stack trace capture when in_nmi() returns true to prevent
> potential deadlocks.
>
> Fixes: 582f700e1bdc ("sched_ext: Hook up hardlockup detector")
> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
This does work on x86 (right?) and is useful in understanding what the
underlying problem is. It'd be great if there's a config flag we can test
but if not can we specifically exclude archs which are known to not work?
Thanks.
--
tejun
> On Dec 22, 2025, at 9:44 PM, Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
>> On Mon, Dec 22, 2025 at 07:50:37PM -0500, Joel Fernandes wrote:
>> stack_trace_save() is not guaranteed to be NMI-safe on all
>> architectures.
>>
>> The hardlockup detector calls into sched_ext via the following call
>> chain when an NMI occurs:
>>
>> watchdog_overflow_callback()
>> watchdog_hardlockup_check()
>> scx_hardlockup()
>> stack_trace_save()
>>
>> Skip stack trace capture when in_nmi() returns true to prevent
>> potential deadlocks.
>>
>> Fixes: 582f700e1bdc ("sched_ext: Hook up hardlockup detector")
>> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
>
> This does work on x86 (right?) and is useful in understanding what the
> underlying problem is. It'd be great if there's a config flag we can test
> but if not can we specifically exclude archs which are known to not work?
You are right that we will miss out on architectures where this is safe. We should make it more specific. I am wondering if Steven Rostedt has any thoughts here since he is actively working on stack tracing/unwinding and has made similar commits in the past where he restricted stack tracing in an NMI context.
Per my understanding, stack trace unwinding is not safe/valid to do on architectures where the NMI context does not have its own stack. But I could stand corrected, hence I marked this as an RFT. It is safe to do on 64-bit x86, but not on 32-bit x86 and other same-stack architectures.
If we feel that this is not an issue, then that is fine with me (and sorry for the noise), but I just wanted to raise it anyway just in case. Sooner or later someone running scx on an odd architecture might complaint.
Thanks!
- Joel
>
> Thanks.
>
> --
> tejun
On Tue, 23 Dec 2025 04:34:00 +0000 Joel Fernandes <joelagnelf@nvidia.com> wrote: > > This does work on x86 (right?) and is useful in understanding what the > > underlying problem is. It'd be great if there's a config flag we can test > > but if not can we specifically exclude archs which are known to not work? > > You are right that we will miss out on architectures where this is safe. > We should make it more specific. I am wondering if Steven Rostedt has any > thoughts here since he is actively working on stack tracing/unwinding and > has made similar commits in the past where he restricted stack tracing in > an NMI context. [ Fixes line wrap, ug it's hard to read emails that go across 300 characters! ] Well, we do kernel stack tracing in NMI context all the time with no issue (but I mostly work on x86). > > Per my understanding, stack trace unwinding is not safe/valid to do on > architectures where the NMI context does not have its own stack. But I Hmm, no, I think it's fine to do it on archs where NMI doesn't have its own stack. It works on 32bit x86, where the NMI shares the kernel stack. Which architecture had an issue with a stack trace? -- Steve > could stand corrected, hence I marked this as an RFT. It is safe to do > on 64-bit x86, but not on 32-bit x86 and other same-stack architectures. > > If we feel that this is not an issue, then that is fine with me (and > sorry for the noise), but I just wanted to raise it anyway just in case. > Sooner or later someone running scx on an odd architecture might > complaint.
On Tue, Dec 23, 2025 at 03:31:36PM -0500, Steven Rostedt wrote: > On Tue, 23 Dec 2025 04:34:00 +0000 > Joel Fernandes <joelagnelf@nvidia.com> wrote: > > > > This does work on x86 (right?) and is useful in understanding what the > > > underlying problem is. It'd be great if there's a config flag we can test > > > but if not can we specifically exclude archs which are known to not work? > > > > You are right that we will miss out on architectures where this is safe. > > We should make it more specific. I am wondering if Steven Rostedt has any > > thoughts here since he is actively working on stack tracing/unwinding and > > has made similar commits in the past where he restricted stack tracing in > > an NMI context. > > [ Fixes line wrap, ug it's hard to read emails that go across 300 characters! ] Sorry about that. Thank you. > Well, we do kernel stack tracing in NMI context all the time with no issue > (but I mostly work on x86). > > > > > Per my understanding, stack trace unwinding is not safe/valid to do on > > architectures where the NMI context does not have its own stack. But I > > Hmm, no, I think it's fine to do it on archs where NMI doesn't have its own > stack. It works on 32bit x86, where the NMI shares the kernel stack. > > Which architecture had an issue with a stack trace? On 32 bit what happens if NMI hits during stack frame setup? Can the unwinder misbehave if base pointer has not yet been setup and NMI starts using same stack? Not sure. Some documentation suggests IST is required for reliable NMI stack tracing [1] [2] which 32-bit does not have. ”If an interrupt or other exception is taken while the stack or other unwind state is in an inconsistent state, it may not be possible to reliably unwind, and it may not be possible to identify whether such unwinding will be reliable. See below for examples.“ Probably the issue happens to be more of printing garbage than crashing the kernel, but I am not convinced it is stable. Hmm. [1] https://www.kernel.org/doc/html/v6.16/arch/x86/kernel-stacks.html [2] https://docs.kernel.org/livepatch/reliable-stacktrace.html thanks, - Joel > > -- Steve > > > > could stand corrected, hence I marked this as an RFT. It is safe to do > > on 64-bit x86, but not on 32-bit x86 and other same-stack architectures. > > > > If we feel that this is not an issue, then that is fine with me (and > > sorry for the noise), but I just wanted to raise it anyway just in case. > > Sooner or later someone running scx on an odd architecture might > > complaint. > >
On Tue, 23 Dec 2025 18:58:33 -0500 Joel Fernandes <joelagnelf@nvidia.com> wrote: > Some documentation suggests IST is required for reliable NMI stack tracing > [1] [2] which 32-bit does not have. > ”If an interrupt or other exception is taken while the stack or other unwind > state is in an inconsistent state, it may not be possible to reliably unwind, > and it may not be possible to identify whether such unwinding will be > reliable. See below for examples.“ > > Probably the issue happens to be more of printing garbage than crashing the > kernel, but I am not convinced it is stable. Hmm. Correct. It's about reliable stack traces, as live kernel patching requires that the stack it looks at is reliable before it can modify the code. What happens if it's not reliable, means it will just stop at the interrupt handler and you don't get to see the rest (or you'll see a bunch of functions with "?" in front of them). -- Steve
> On Dec 24, 2025, at 9:17 AM, Steven Rostedt <rostedt@goodmis.org> wrote: > > On Tue, 23 Dec 2025 18:58:33 -0500 > Joel Fernandes <joelagnelf@nvidia.com> wrote: > >> Some documentation suggests IST is required for reliable NMI stack tracing >> [1] [2] which 32-bit does not have. >> ”If an interrupt or other exception is taken while the stack or other unwind >> state is in an inconsistent state, it may not be possible to reliably unwind, >> and it may not be possible to identify whether such unwinding will be >> reliable. See below for examples.“ >> >> Probably the issue happens to be more of printing garbage than crashing the >> kernel, but I am not convinced it is stable. Hmm. > > Correct. It's about reliable stack traces, as live kernel patching requires > that the stack it looks at is reliable before it can modify the code. What > happens if it's not reliable, means it will just stop at the interrupt > handler and you don't get to see the rest (or you'll see a bunch of > functions with "?" in front of them). Ah, thanks Steve for clarifying! - Joel > > -- Steve
On Mon, Dec 22, 2025 at 07:50:37PM -0500, Joel Fernandes wrote:
> stack_trace_save() is not guaranteed to be NMI-safe on all
> architectures.
>
> The hardlockup detector calls into sched_ext via the following call
> chain when an NMI occurs:
>
> watchdog_overflow_callback()
> watchdog_hardlockup_check()
> scx_hardlockup()
> stack_trace_save()
>
> Skip stack trace capture when in_nmi() returns true to prevent
> potential deadlocks.
>
> Fixes: 582f700e1bdc ("sched_ext: Hook up hardlockup detector")
> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
> ---
> kernel/sched/ext.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 05f5a49e9649..a96255ca3a08 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -4678,7 +4678,8 @@ static bool scx_vexit(struct scx_sched *sch,
>
> ei->exit_code = exit_code;
> #ifdef CONFIG_STACKTRACE
> - if (kind >= SCX_EXIT_ERROR)
> + /* Skip stack trace capture in NMI context as its unsafe. */
nit: s/its/it's/
> + if (kind >= SCX_EXIT_ERROR && !in_nmi())
> ei->bt_len = stack_trace_save(ei->bt, SCX_EXIT_BT_LEN, 1);
If stack_trace_save() isn't NMI-safe on certain architectures, shouldn't we
fix this inside stack_trace_save()?
There are probably other places where we call stack_trace_save() without
checking in_nmi(). Making stack_trace_save() handle the NMI case would
solve all of them.
> #endif
> vscnprintf(ei->msg, SCX_EXIT_MSG_LEN, fmt, args);
> --
> 2.34.1
>
Thanks,
-Andrea
© 2016 - 2026 Red Hat, Inc.