arm64/trap: fix broken ct->nmi_nesting when die() is called in a kthread

[PATCH] arm64/trap: fix broken ct->nmi_nesting when die() is called in a kthread

Posted by Yeoreum Yun 8 months, 2 weeks ago

When a kernel thread hits BUG() outside of an interrupt handler and
panic_on_oops is not set, it exits via make_task_dead(), which is called by die().
In this case, the nmi_nesting value in context_tracking becomes
inconsistent because the proper context tracking exit functions are not reached.

Here’s an example scenario on arm64:
  1. A kernel thread hits the BUG() macro outside an interrupt handler,
     and panic_on_oops is not set (ct->nmi_nesting == CT_NESTING_IRQ_NONIDLE).

  2. The exception handler jumps to el1_dbg() and calls arm64_enter_el1_dbg(),
     which invokes ct_nmi_enter(). (ct->nmi_nesting == CT_NESTING_IRQ_NONIDLE + 2)

  3. bug_handler() runs, and if the bug type is BUG_TRAP_TYPE_BUG, it calls die().

  4. die() then calls make_task_dead(), which terminates the kernel thread and
     schedules another task—assume this is the idle_task.

  5. The idle_task attempts to enter the idle state by calling ct_idle_enter().
     However, since the current ct->nmi_nesting value is CT_NESTING_IRQ_NONIDLE + 2,
     ct_kernel_exit() triggers a WARN_ON_ONCE() warning.

Because the kernel thread couldn’t call the appropriate context tracking exit
function in step 3, the ct->nmi_nesting value remains incorrect.
This leads to warnings like the following:

[    7.133093] ------------[ cut here ]------------
[    7.133129] WARNING: CPU: 2 PID: 0 at kernel/context_tracking.c:127 ct_kernel
[    7.134157] Modules linked in:
[    7.134158]     not ok 62 kasan_strings
[    7.134280]
[    7.134506] CPU: 2 UID: 0 PID: 0 Comm: swapper/2 Tainted: G    B D W        N
[    7.134930] Tainted: [B]=BAD_PAGE, [D]=DIE, [W]=WARN, [N]=TEST
[    7.135150] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    7.135441] pc : ct_kernel_exit+0xa4/0xb0
[    7.135656] lr : ct_kernel_exit+0x1c/0xb0
[    7.135866] sp : ffff8000829bbd90
[    7.136011] x29: ffff8000829bbd90 x28: ffff80008224ecf0 x27: 0000000000000004
[    7.136379] x26: ffff80008228ed30 x25: ffff80008228e000 x24: 0000000000000000
[    7.137016] x23: f3ff000800a52280 x22: 0000000000000000 x21: ffff00087b6c7408
[    7.137380] x20: ffff80008224b408 x19: 0000000000000005 x18: 0000000000000000
[    7.137741] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
[    7.311316] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
[    7.311673] x11: 0000000000000000 x10: 0000000000000000 x9 : 4000000000000000
[    7.312031] x8 : 4000000000000002 x7 : 0000000000000000 x6 : 0000000000000000
[    7.312387] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000
[    7.312740] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff8007f947c000
[    7.313096] Call trace:
[    7.313210]  ct_kernel_exit+0xa4/0xb0 (P)
[    7.313445]  ct_idle_enter+0x14/0x28
[    7.313666]  default_idle_call+0x2c/0x60
[    7.313902]  do_idle+0xec/0x320
[    7.314104]  cpu_startup_entry+0x40/0x50
[    7.314331]  secondary_start_kernel+0x120/0x1a0

This behavior is specific to the arm64 architecture, where ct_nmi_enter()
is called when handling a debug exception.
In contrast, on other architectures, ct_nmi_enter() is not called when
handling BUG().
(i.e) when handling X86_TRAP_UD via handle_invalid_op(), it doesn't call
ct_nmi_enter(). so it doesn’t cause any issues
(While irq_entry_enter() does call ct_nmi_enter() for idle tasks,
that doesn’t apply to debug exception handling).

To address the issue of a corrupted ct->nmi_nesting value,
add a check before calling make_task_dead() in die().
If the current CPU’s ct->nmi_nesting is non-zero and not equal to
CT_NESTING_IRQ_NONIDLE, then ct_nmi_exit() should be called.

Fixes: 2a9b3e6ac69a ("arm64: entry: fix EL1 debug transitions")
Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
---
 arch/arm64/kernel/traps.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
index 529cff825531..9cf03b9ce691 100644
--- a/arch/arm64/kernel/traps.c
+++ b/arch/arm64/kernel/traps.c
@@ -227,8 +227,14 @@ void die(const char *str, struct pt_regs *regs, long err)

 	raw_spin_unlock_irqrestore(&die_lock, flags);

-	if (ret != NOTIFY_STOP)
+	if (ret != NOTIFY_STOP) {
+#ifdef CONFIG_CONTEXT_TRACKING_IDLE
+		long nmi_nesting = ct_nmi_nesting();
+		if (nmi_nesting && nmi_nesting != CT_NESTING_IRQ_NONIDLE)
+			ct_nmi_exit();
+#endif
 		make_task_dead(SIGSEGV);
+	}
 }

 static void arm64_show_signal(int signo, const char *str)
--
LEVI:{C3F47F37-75D8-414A-A8BA-3980EC8A46D7}

Re: [PATCH] arm64/trap: fix broken ct->nmi_nesting when die() is called in a kthread

Posted by Will Deacon 8 months, 1 week ago

[+Ada]

On Fri, May 30, 2025 at 10:27:23AM +0100, Yeoreum Yun wrote:
> When a kernel thread hits BUG() outside of an interrupt handler and
> panic_on_oops is not set, it exits via make_task_dead(), which is called by die().
> In this case, the nmi_nesting value in context_tracking becomes
> inconsistent because the proper context tracking exit functions are not reached.
> 
> Here’s an example scenario on arm64:
>   1. A kernel thread hits the BUG() macro outside an interrupt handler,
>      and panic_on_oops is not set (ct->nmi_nesting == CT_NESTING_IRQ_NONIDLE).
> 
>   2. The exception handler jumps to el1_dbg() and calls arm64_enter_el1_dbg(),
>      which invokes ct_nmi_enter(). (ct->nmi_nesting == CT_NESTING_IRQ_NONIDLE + 2)
> 
>   3. bug_handler() runs, and if the bug type is BUG_TRAP_TYPE_BUG, it calls die().
> 
>   4. die() then calls make_task_dead(), which terminates the kernel thread and
>      schedules another task—assume this is the idle_task.

This sounds like there is a genuine imbalance, then, as we're scheduling
in the context of an exception taken from EL1.

>   5. The idle_task attempts to enter the idle state by calling ct_idle_enter().
>      However, since the current ct->nmi_nesting value is CT_NESTING_IRQ_NONIDLE + 2,
>      ct_kernel_exit() triggers a WARN_ON_ONCE() warning.
> 
> Because the kernel thread couldn’t call the appropriate context tracking exit
> function in step 3, the ct->nmi_nesting value remains incorrect.
> This leads to warnings like the following:
> 
> [    7.133093] ------------[ cut here ]------------
> [    7.133129] WARNING: CPU: 2 PID: 0 at kernel/context_tracking.c:127 ct_kernel
> [    7.134157] Modules linked in:
> [    7.134158]     not ok 62 kasan_strings
> [    7.134280]
> [    7.134506] CPU: 2 UID: 0 PID: 0 Comm: swapper/2 Tainted: G    B D W        N
> [    7.134930] Tainted: [B]=BAD_PAGE, [D]=DIE, [W]=WARN, [N]=TEST
> [    7.135150] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [    7.135441] pc : ct_kernel_exit+0xa4/0xb0
> [    7.135656] lr : ct_kernel_exit+0x1c/0xb0
> [    7.135866] sp : ffff8000829bbd90
> [    7.136011] x29: ffff8000829bbd90 x28: ffff80008224ecf0 x27: 0000000000000004
> [    7.136379] x26: ffff80008228ed30 x25: ffff80008228e000 x24: 0000000000000000
> [    7.137016] x23: f3ff000800a52280 x22: 0000000000000000 x21: ffff00087b6c7408
> [    7.137380] x20: ffff80008224b408 x19: 0000000000000005 x18: 0000000000000000
> [    7.137741] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
> [    7.311316] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
> [    7.311673] x11: 0000000000000000 x10: 0000000000000000 x9 : 4000000000000000
> [    7.312031] x8 : 4000000000000002 x7 : 0000000000000000 x6 : 0000000000000000
> [    7.312387] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000
> [    7.312740] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff8007f947c000
> [    7.313096] Call trace:
> [    7.313210]  ct_kernel_exit+0xa4/0xb0 (P)
> [    7.313445]  ct_idle_enter+0x14/0x28
> [    7.313666]  default_idle_call+0x2c/0x60
> [    7.313902]  do_idle+0xec/0x320
> [    7.314104]  cpu_startup_entry+0x40/0x50
> [    7.314331]  secondary_start_kernel+0x120/0x1a0
> 
> This behavior is specific to the arm64 architecture, where ct_nmi_enter()
> is called when handling a debug exception.
> In contrast, on other architectures, ct_nmi_enter() is not called when
> handling BUG().
> (i.e) when handling X86_TRAP_UD via handle_invalid_op(), it doesn't call
> ct_nmi_enter(). so it doesn’t cause any issues
> (While irq_entry_enter() does call ct_nmi_enter() for idle tasks,
> that doesn’t apply to debug exception handling).

It sounds like you're suggesting that we don't update the
context-tracking NMI state for BRK exceptions from EL1, to align
with x86. I think Ada's pending series might make that easier, but then
the patch you propose:

> diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
> index 529cff825531..9cf03b9ce691 100644
> --- a/arch/arm64/kernel/traps.c
> +++ b/arch/arm64/kernel/traps.c
> @@ -227,8 +227,14 @@ void die(const char *str, struct pt_regs *regs, long err)
> 
>  	raw_spin_unlock_irqrestore(&die_lock, flags);
> 
> -	if (ret != NOTIFY_STOP)
> +	if (ret != NOTIFY_STOP) {
> +#ifdef CONFIG_CONTEXT_TRACKING_IDLE
> +		long nmi_nesting = ct_nmi_nesting();
> +		if (nmi_nesting && nmi_nesting != CT_NESTING_IRQ_NONIDLE)
> +			ct_nmi_exit();
> +#endif

tries to undo the context-tracking when we realise we're going to kill
the task, which feels like a hack.

Will

Re: [PATCH] arm64/trap: fix broken ct->nmi_nesting when die() is called in a kthread

Posted by Yeoreum Yun 8 months, 1 week ago

Hi Will,

> [+Ada]
>
> On Fri, May 30, 2025 at 10:27:23AM +0100, Yeoreum Yun wrote:
> > When a kernel thread hits BUG() outside of an interrupt handler and
> > panic_on_oops is not set, it exits via make_task_dead(), which is called by die().
> > In this case, the nmi_nesting value in context_tracking becomes
> > inconsistent because the proper context tracking exit functions are not reached.
> >
> > Here’s an example scenario on arm64:
> >   1. A kernel thread hits the BUG() macro outside an interrupt handler,
> >      and panic_on_oops is not set (ct->nmi_nesting == CT_NESTING_IRQ_NONIDLE).
> >
> >   2. The exception handler jumps to el1_dbg() and calls arm64_enter_el1_dbg(),
> >      which invokes ct_nmi_enter(). (ct->nmi_nesting == CT_NESTING_IRQ_NONIDLE + 2)
> >
> >   3. bug_handler() runs, and if the bug type is BUG_TRAP_TYPE_BUG, it calls die().
> >
> >   4. die() then calls make_task_dead(), which terminates the kernel thread and
> >      schedules another task—assume this is the idle_task.
>
> This sounds like there is a genuine imbalance, then, as we're scheduling
> in the context of an exception taken from EL1.

TBH, this "scheduling" is called in do_exit() to kill BUG()
happend task:

 el1_dbg()
    -> arm64_enter_el1_dbg()
      -> do_debug_exception()
        -> die()
         -> make_task_dead
           -> do_exit()
             -> schedule()
    // unreachable
    -> arm64_exit_el1_dbg()

Because arm64_enter_el1_dbg() always call ct_nmi_enter(),
If do_debug_exception determined to call die(), there is no point to
call ct_nmi_exit().

>
> >   5. The idle_task attempts to enter the idle state by calling ct_idle_enter().
> >      However, since the current ct->nmi_nesting value is CT_NESTING_IRQ_NONIDLE + 2,
> >      ct_kernel_exit() triggers a WARN_ON_ONCE() warning.
> >
> > Because the kernel thread couldn’t call the appropriate context tracking exit
> > function in step 3, the ct->nmi_nesting value remains incorrect.
> > This leads to warnings like the following:
> >
> > [    7.133093] ------------[ cut here ]------------
> > [    7.133129] WARNING: CPU: 2 PID: 0 at kernel/context_tracking.c:127 ct_kernel
> > [    7.134157] Modules linked in:
> > [    7.134158]     not ok 62 kasan_strings
> > [    7.134280]
> > [    7.134506] CPU: 2 UID: 0 PID: 0 Comm: swapper/2 Tainted: G    B D W        N
> > [    7.134930] Tainted: [B]=BAD_PAGE, [D]=DIE, [W]=WARN, [N]=TEST
> > [    7.135150] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > [    7.135441] pc : ct_kernel_exit+0xa4/0xb0
> > [    7.135656] lr : ct_kernel_exit+0x1c/0xb0
> > [    7.135866] sp : ffff8000829bbd90
> > [    7.136011] x29: ffff8000829bbd90 x28: ffff80008224ecf0 x27: 0000000000000004
> > [    7.136379] x26: ffff80008228ed30 x25: ffff80008228e000 x24: 0000000000000000
> > [    7.137016] x23: f3ff000800a52280 x22: 0000000000000000 x21: ffff00087b6c7408
> > [    7.137380] x20: ffff80008224b408 x19: 0000000000000005 x18: 0000000000000000
> > [    7.137741] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
> > [    7.311316] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
> > [    7.311673] x11: 0000000000000000 x10: 0000000000000000 x9 : 4000000000000000
> > [    7.312031] x8 : 4000000000000002 x7 : 0000000000000000 x6 : 0000000000000000
> > [    7.312387] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000
> > [    7.312740] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff8007f947c000
> > [    7.313096] Call trace:
> > [    7.313210]  ct_kernel_exit+0xa4/0xb0 (P)
> > [    7.313445]  ct_idle_enter+0x14/0x28
> > [    7.313666]  default_idle_call+0x2c/0x60
> > [    7.313902]  do_idle+0xec/0x320
> > [    7.314104]  cpu_startup_entry+0x40/0x50
> > [    7.314331]  secondary_start_kernel+0x120/0x1a0
> >
> > This behavior is specific to the arm64 architecture, where ct_nmi_enter()
> > is called when handling a debug exception.
> > In contrast, on other architectures, ct_nmi_enter() is not called when
> > handling BUG().
> > (i.e) when handling X86_TRAP_UD via handle_invalid_op(), it doesn't call
> > ct_nmi_enter(). so it doesn’t cause any issues
> > (While irq_entry_enter() does call ct_nmi_enter() for idle tasks,
> > that doesn’t apply to debug exception handling).
>
> It sounds like you're suggesting that we don't update the
> context-tracking NMI state for BRK exceptions from EL1, to align
> with x86.

If el1_dbg() doesn't be called in idle_task(),
I think it doesn't need to call ct_nmi_enter() in arm64_enter_el1_debug()
since its nmi_nesting is always >= CT_NESTING_IRQ_NONIDLE and RCU wathcing this cpu.

But, it seems el1_dbg() could be called ct_idle_enter() and ct_idle_exit().
actually this situation seems be possible in theory when
some idle code have BUG() -- i.e) cpuidle driver's enter callback have BUG().
However, this case triggers another quetions. what happen if idle_task was
killed (I think it seems panic() case...)

So, If arm64_enter_el1_debug() doesn't need to call the ct_nmi_enter()
instead, __nmi_enter() should be called only for idle_task().

Am I wrong?

> I think Ada's pending series might make that easier, but then
> the patch you propose:
>
> > diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
> > index 529cff825531..9cf03b9ce691 100644
> > --- a/arch/arm64/kernel/traps.c
> > +++ b/arch/arm64/kernel/traps.c
> > @@ -227,8 +227,14 @@ void die(const char *str, struct pt_regs *regs, long err)
> >
> >  	raw_spin_unlock_irqrestore(&die_lock, flags);
> >
> > -	if (ret != NOTIFY_STOP)
> > +	if (ret != NOTIFY_STOP) {
> > +#ifdef CONFIG_CONTEXT_TRACKING_IDLE
> > +		long nmi_nesting = ct_nmi_nesting();
> > +		if (nmi_nesting && nmi_nesting != CT_NESTING_IRQ_NONIDLE)
> > +			ct_nmi_exit();
> > +#endif
>
> tries to undo the context-tracking when we realise we're going to kill
> the task, which feels like a hack.

Although her patches is applied,
I think this problem still exist if arm64_enter_el1_dbg() calls ct_nmi_enter().
I agree it's a hacky way for handling kernel task die() in debug
exception since in case of user task will be killed via signal.
However, unless arm64_enter_el1_dbg() calls ct_nmi_enter(),
In my narrow view, it seems the best...

--
Sincerely,
Yeoreum Yun

Re: [PATCH] arm64/trap: fix broken ct->nmi_nesting when die() is called in a kthread

Posted by Mark Rutland 8 months, 1 week ago

On Mon, Jun 02, 2025 at 03:54:19PM +0100, Yeoreum Yun wrote:
> Hi Will,
> 
> > [+Ada]
> >
> > On Fri, May 30, 2025 at 10:27:23AM +0100, Yeoreum Yun wrote:
> > > When a kernel thread hits BUG() outside of an interrupt handler and
> > > panic_on_oops is not set, it exits via make_task_dead(), which is called by die().
> > > In this case, the nmi_nesting value in context_tracking becomes
> > > inconsistent because the proper context tracking exit functions are not reached.
> > >
> > > Here’s an example scenario on arm64:
> > >   1. A kernel thread hits the BUG() macro outside an interrupt handler,
> > >      and panic_on_oops is not set (ct->nmi_nesting == CT_NESTING_IRQ_NONIDLE).
> > >
> > >   2. The exception handler jumps to el1_dbg() and calls arm64_enter_el1_dbg(),
> > >      which invokes ct_nmi_enter(). (ct->nmi_nesting == CT_NESTING_IRQ_NONIDLE + 2)
> > >
> > >   3. bug_handler() runs, and if the bug type is BUG_TRAP_TYPE_BUG, it calls die().
> > >
> > >   4. die() then calls make_task_dead(), which terminates the kernel thread and
> > >      schedules another task—assume this is the idle_task.
> >
> > This sounds like there is a genuine imbalance, then, as we're scheduling
> > in the context of an exception taken from EL1.
> 
> TBH, this "scheduling" is called in do_exit() to kill BUG()
> happend task:
> 
>  el1_dbg()
>     -> arm64_enter_el1_dbg()
>       -> do_debug_exception()
>         -> die()
>          -> make_task_dead
>            -> do_exit()
>              -> schedule()
>     // unreachable
>     -> arm64_exit_el1_dbg()
> 
> Because arm64_enter_el1_dbg() always call ct_nmi_enter(),
> If do_debug_exception determined to call die(), there is no point to
> call ct_nmi_exit().

One of the reasons we treak BRK as an NMI is that exception entry for
BRK will leave all DAIF bits set, whereas schedule() should be called
with debug and SError unmasked (but IRQ+FIQ masked). Generally, calling
ct_nmi_enter() prevents preemption (and hence calls to schedule()).

Another is that we may have a BUG() or WARN() in entry code where the
task could be in an inconsistent state, and we need to treat the
exception like an NMI to avoid consuming that inconsistent state.

To handle that properly, we need to:

(a) Figure out what to do with entry code. Last I looked I was under the
    impression that x86 either didn't have a problem here, or simply
    ignored it.

(b) Handle BUG/WARN traps separately from other BRKs, such that we can
    use local_daif_inherit(), and treat this as a special function call
    rather than an NMI.

(c) Somehow teach make_task_dead() to handle the case where DAIF.D
    and/or DAIF.A are set. Most likely we simply have to panic() here,
    as with BUG() in interrupt context.

> > >   5. The idle_task attempts to enter the idle state by calling ct_idle_enter().
> > >      However, since the current ct->nmi_nesting value is CT_NESTING_IRQ_NONIDLE + 2,
> > >      ct_kernel_exit() triggers a WARN_ON_ONCE() warning.
> > >
> > > Because the kernel thread couldn’t call the appropriate context tracking exit
> > > function in step 3, the ct->nmi_nesting value remains incorrect.
> > > This leads to warnings like the following:
> > >
> > > [    7.133093] ------------[ cut here ]------------
> > > [    7.133129] WARNING: CPU: 2 PID: 0 at kernel/context_tracking.c:127 ct_kernel
> > > [    7.134157] Modules linked in:
> > > [    7.134158]     not ok 62 kasan_strings
> > > [    7.134280]
> > > [    7.134506] CPU: 2 UID: 0 PID: 0 Comm: swapper/2 Tainted: G    B D W        N
> > > [    7.134930] Tainted: [B]=BAD_PAGE, [D]=DIE, [W]=WARN, [N]=TEST
> > > [    7.135150] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > [    7.135441] pc : ct_kernel_exit+0xa4/0xb0
> > > [    7.135656] lr : ct_kernel_exit+0x1c/0xb0
> > > [    7.135866] sp : ffff8000829bbd90
> > > [    7.136011] x29: ffff8000829bbd90 x28: ffff80008224ecf0 x27: 0000000000000004
> > > [    7.136379] x26: ffff80008228ed30 x25: ffff80008228e000 x24: 0000000000000000
> > > [    7.137016] x23: f3ff000800a52280 x22: 0000000000000000 x21: ffff00087b6c7408
> > > [    7.137380] x20: ffff80008224b408 x19: 0000000000000005 x18: 0000000000000000
> > > [    7.137741] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
> > > [    7.311316] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
> > > [    7.311673] x11: 0000000000000000 x10: 0000000000000000 x9 : 4000000000000000
> > > [    7.312031] x8 : 4000000000000002 x7 : 0000000000000000 x6 : 0000000000000000
> > > [    7.312387] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000
> > > [    7.312740] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff8007f947c000
> > > [    7.313096] Call trace:
> > > [    7.313210]  ct_kernel_exit+0xa4/0xb0 (P)
> > > [    7.313445]  ct_idle_enter+0x14/0x28
> > > [    7.313666]  default_idle_call+0x2c/0x60
> > > [    7.313902]  do_idle+0xec/0x320
> > > [    7.314104]  cpu_startup_entry+0x40/0x50
> > > [    7.314331]  secondary_start_kernel+0x120/0x1a0
> > >
> > > This behavior is specific to the arm64 architecture, where ct_nmi_enter()
> > > is called when handling a debug exception.
> > > In contrast, on other architectures, ct_nmi_enter() is not called when
> > > handling BUG().
> > > (i.e) when handling X86_TRAP_UD via handle_invalid_op(), it doesn't call
> > > ct_nmi_enter(). so it doesn’t cause any issues
> > > (While irq_entry_enter() does call ct_nmi_enter() for idle tasks,
> > > that doesn’t apply to debug exception handling).
> >
> > It sounds like you're suggesting that we don't update the
> > context-tracking NMI state for BRK exceptions from EL1, to align
> > with x86.
> 
> If el1_dbg() doesn't be called in idle_task(),
> I think it doesn't need to call ct_nmi_enter() in arm64_enter_el1_debug()
> since its nmi_nesting is always >= CT_NESTING_IRQ_NONIDLE and RCU wathcing this cpu.
> 
> But, it seems el1_dbg() could be called ct_idle_enter() and ct_idle_exit().
> actually this situation seems be possible in theory when
> some idle code have BUG() -- i.e) cpuidle driver's enter callback have BUG().
> However, this case triggers another quetions. what happen if idle_task was
> killed (I think it seems panic() case...)
> 
> So, If arm64_enter_el1_debug() doesn't need to call the ct_nmi_enter()
> instead, __nmi_enter() should be called only for idle_task().
> 
> Am I wrong?

As above, I do not think that this is sufficient.

> > I think Ada's pending series might make that easier, but then
> > the patch you propose:
> >
> > > diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
> > > index 529cff825531..9cf03b9ce691 100644
> > > --- a/arch/arm64/kernel/traps.c
> > > +++ b/arch/arm64/kernel/traps.c
> > > @@ -227,8 +227,14 @@ void die(const char *str, struct pt_regs *regs, long err)
> > >
> > >  	raw_spin_unlock_irqrestore(&die_lock, flags);
> > >
> > > -	if (ret != NOTIFY_STOP)
> > > +	if (ret != NOTIFY_STOP) {
> > > +#ifdef CONFIG_CONTEXT_TRACKING_IDLE
> > > +		long nmi_nesting = ct_nmi_nesting();
> > > +		if (nmi_nesting && nmi_nesting != CT_NESTING_IRQ_NONIDLE)
> > > +			ct_nmi_exit();
> > > +#endif
> >
> > tries to undo the context-tracking when we realise we're going to kill
> > the task, which feels like a hack.
> 
> Although her patches is applied,
> I think this problem still exist if arm64_enter_el1_dbg() calls ct_nmi_enter().

The idea is that Ada's series will make it *possible* to handle this
correctly.

> I agree it's a hacky way for handling kernel task die() in debug
> exception since in case of user task will be killed via signal.
> However, unless arm64_enter_el1_dbg() calls ct_nmi_enter(),
> In my narrow view, it seems the best...

As-is, I think an extra warning in the case of a BUG() is fine given
the larger functional issues.

I do not think this patch is correct as-is.

Mark.

Re: [PATCH] arm64/trap: fix broken ct->nmi_nesting when die() is called in a kthread

Posted by Yeoreum Yun 8 months, 1 week ago

Hi Mark,

> > Hi Will,
> >
> > > [+Ada]
> > >
> > > On Fri, May 30, 2025 at 10:27:23AM +0100, Yeoreum Yun wrote:
> > > > When a kernel thread hits BUG() outside of an interrupt handler and
> > > > panic_on_oops is not set, it exits via make_task_dead(), which is called by die().
> > > > In this case, the nmi_nesting value in context_tracking becomes
> > > > inconsistent because the proper context tracking exit functions are not reached.
> > > >
> > > > Here’s an example scenario on arm64:
> > > >   1. A kernel thread hits the BUG() macro outside an interrupt handler,
> > > >      and panic_on_oops is not set (ct->nmi_nesting == CT_NESTING_IRQ_NONIDLE).
> > > >
> > > >   2. The exception handler jumps to el1_dbg() and calls arm64_enter_el1_dbg(),
> > > >      which invokes ct_nmi_enter(). (ct->nmi_nesting == CT_NESTING_IRQ_NONIDLE + 2)
> > > >
> > > >   3. bug_handler() runs, and if the bug type is BUG_TRAP_TYPE_BUG, it calls die().
> > > >
> > > >   4. die() then calls make_task_dead(), which terminates the kernel thread and
> > > >      schedules another task—assume this is the idle_task.
> > >
> > > This sounds like there is a genuine imbalance, then, as we're scheduling
> > > in the context of an exception taken from EL1.
> >
> > TBH, this "scheduling" is called in do_exit() to kill BUG()
> > happend task:
> >
> >  el1_dbg()
> >     -> arm64_enter_el1_dbg()
> >       -> do_debug_exception()
> >         -> die()
> >          -> make_task_dead
> >            -> do_exit()
> >              -> schedule()
> >     // unreachable
> >     -> arm64_exit_el1_dbg()
> >
> > Because arm64_enter_el1_dbg() always call ct_nmi_enter(),
> > If do_debug_exception determined to call die(), there is no point to
> > call ct_nmi_exit().
>
> One of the reasons we treak BRK as an NMI is that exception entry for
> BRK will leave all DAIF bits set, whereas schedule() should be called
> with debug and SError unmasked (but IRQ+FIQ masked). Generally, calling
> ct_nmi_enter() prevents preemption (and hence calls to schedule()).

I think ct_nmi_enter() doesn't prevents preemption but
debug_exception_enter() disables preemption.


> Another is that we may have a BUG() or WARN() in entry code where the
> task could be in an inconsistent state, and we need to treat the
> exception like an NMI to avoid consuming that inconsistent state.

So, let's think the "inconsistent" state like:
  -> el0_enter()
	  -> enter_from_user_mode()
		  -> before update ct_state (context_tracking.state), call BUG()/WARN()
			  -> el1_dbg()

It need to call ct_nmi_enter() in el1_dbg() right?

> To handle that properly, we need to:
>
> (a) Figure out what to do with entry code. Last I looked I was under the
>     impression that x86 either didn't have a problem here, or simply
>     ignored it.

TBH, in above case, x86 seems context_traking.state will be broken...
>
> (b) Handle BUG/WARN traps separately from other BRKs, such that we can
>     use local_daif_inherit(), and treat this as a special function call
>     rather than an NMI.
>
> (c) Somehow teach make_task_dead() to handle the case where DAIF.D
>     and/or DAIF.A are set. Most likely we simply have to panic() here,
>     as with BUG() in interrupt context.

Right... It should handle for DAIF.D and DAIF.A bits...

>
> > > >   5. The idle_task attempts to enter the idle state by calling ct_idle_enter().
> > > >      However, since the current ct->nmi_nesting value is CT_NESTING_IRQ_NONIDLE + 2,
> > > >      ct_kernel_exit() triggers a WARN_ON_ONCE() warning.
> > > >
> > > > Because the kernel thread couldn’t call the appropriate context tracking exit
> > > > function in step 3, the ct->nmi_nesting value remains incorrect.
> > > > This leads to warnings like the following:
> > > >
> > > > [    7.133093] ------------[ cut here ]------------
> > > > [    7.133129] WARNING: CPU: 2 PID: 0 at kernel/context_tracking.c:127 ct_kernel
> > > > [    7.134157] Modules linked in:
> > > > [    7.134158]     not ok 62 kasan_strings
> > > > [    7.134280]
> > > > [    7.134506] CPU: 2 UID: 0 PID: 0 Comm: swapper/2 Tainted: G    B D W        N
> > > > [    7.134930] Tainted: [B]=BAD_PAGE, [D]=DIE, [W]=WARN, [N]=TEST
> > > > [    7.135150] pstate: 204003c5 (nzCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > > [    7.135441] pc : ct_kernel_exit+0xa4/0xb0
> > > > [    7.135656] lr : ct_kernel_exit+0x1c/0xb0
> > > > [    7.135866] sp : ffff8000829bbd90
> > > > [    7.136011] x29: ffff8000829bbd90 x28: ffff80008224ecf0 x27: 0000000000000004
> > > > [    7.136379] x26: ffff80008228ed30 x25: ffff80008228e000 x24: 0000000000000000
> > > > [    7.137016] x23: f3ff000800a52280 x22: 0000000000000000 x21: ffff00087b6c7408
> > > > [    7.137380] x20: ffff80008224b408 x19: 0000000000000005 x18: 0000000000000000
> > > > [    7.137741] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
> > > > [    7.311316] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
> > > > [    7.311673] x11: 0000000000000000 x10: 0000000000000000 x9 : 4000000000000000
> > > > [    7.312031] x8 : 4000000000000002 x7 : 0000000000000000 x6 : 0000000000000000
> > > > [    7.312387] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000
> > > > [    7.312740] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff8007f947c000
> > > > [    7.313096] Call trace:
> > > > [    7.313210]  ct_kernel_exit+0xa4/0xb0 (P)
> > > > [    7.313445]  ct_idle_enter+0x14/0x28
> > > > [    7.313666]  default_idle_call+0x2c/0x60
> > > > [    7.313902]  do_idle+0xec/0x320
> > > > [    7.314104]  cpu_startup_entry+0x40/0x50
> > > > [    7.314331]  secondary_start_kernel+0x120/0x1a0
> > > >
> > > > This behavior is specific to the arm64 architecture, where ct_nmi_enter()
> > > > is called when handling a debug exception.
> > > > In contrast, on other architectures, ct_nmi_enter() is not called when
> > > > handling BUG().
> > > > (i.e) when handling X86_TRAP_UD via handle_invalid_op(), it doesn't call
> > > > ct_nmi_enter(). so it doesn’t cause any issues
> > > > (While irq_entry_enter() does call ct_nmi_enter() for idle tasks,
> > > > that doesn’t apply to debug exception handling).
> > >
> > > It sounds like you're suggesting that we don't update the
> > > context-tracking NMI state for BRK exceptions from EL1, to align
> > > with x86.
> >
> > If el1_dbg() doesn't be called in idle_task(),
> > I think it doesn't need to call ct_nmi_enter() in arm64_enter_el1_debug()
> > since its nmi_nesting is always >= CT_NESTING_IRQ_NONIDLE and RCU wathcing this cpu.
> >
> > But, it seems el1_dbg() could be called ct_idle_enter() and ct_idle_exit().
> > actually this situation seems be possible in theory when
> > some idle code have BUG() -- i.e) cpuidle driver's enter callback have BUG().
> > However, this case triggers another quetions. what happen if idle_task was
> > killed (I think it seems panic() case...)
> >
> > So, If arm64_enter_el1_debug() doesn't need to call the ct_nmi_enter()
> > instead, __nmi_enter() should be called only for idle_task().
> >
> > Am I wrong?
>
> As above, I do not think that this is sufficient.
>
> > > I think Ada's pending series might make that easier, but then
> > > the patch you propose:
> > >
> > > > diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
> > > > index 529cff825531..9cf03b9ce691 100644
> > > > --- a/arch/arm64/kernel/traps.c
> > > > +++ b/arch/arm64/kernel/traps.c
> > > > @@ -227,8 +227,14 @@ void die(const char *str, struct pt_regs *regs, long err)
> > > >
> > > >  	raw_spin_unlock_irqrestore(&die_lock, flags);
> > > >
> > > > -	if (ret != NOTIFY_STOP)
> > > > +	if (ret != NOTIFY_STOP) {
> > > > +#ifdef CONFIG_CONTEXT_TRACKING_IDLE
> > > > +		long nmi_nesting = ct_nmi_nesting();
> > > > +		if (nmi_nesting && nmi_nesting != CT_NESTING_IRQ_NONIDLE)
> > > > +			ct_nmi_exit();
> > > > +#endif
> > >
> > > tries to undo the context-tracking when we realise we're going to kill
> > > the task, which feels like a hack.
> >
> > Although her patches is applied,
> > I think this problem still exist if arm64_enter_el1_dbg() calls ct_nmi_enter().
>
> The idea is that Ada's series will make it *possible* to handle this
> correctly.
>
> > I agree it's a hacky way for handling kernel task die() in debug
> > exception since in case of user task will be killed via signal.
> > However, unless arm64_enter_el1_dbg() calls ct_nmi_enter(),
> > In my narrow view, it seems the best...
>
> As-is, I think an extra warning in the case of a BUG() is fine given
> the larger functional issues.
>
> I do not think this patch is correct as-is.

So, what I think:
  1. arm64_enter_el1_dbg() should ct_nmi_enter() as it is.
  2. in bug_handler() while handling BUG_TYPE, add above ct_nmi_exit()
     conditional call.
  3. DAIF.D and DAIF.A handling.

Is there any missing?

Thanks!

--
Sincerely,
Yeoreum Yun

Re: [PATCH] arm64/trap: fix broken ct->nmi_nesting when die() is called in a kthread

Posted by Mark Rutland 8 months, 1 week ago

On Mon, Jun 02, 2025 at 06:50:53PM +0100, Yeoreum Yun wrote:
> > One of the reasons we treak BRK as an NMI is that exception entry for
> > BRK will leave all DAIF bits set, whereas schedule() should be called
> > with debug and SError unmasked (but IRQ+FIQ masked). Generally, calling
> > ct_nmi_enter() prevents preemption (and hence calls to schedule()).
> 
> I think ct_nmi_enter() doesn't prevents preemption but
> debug_exception_enter() disables preemption.

Yep, sorry for the confusion there. I had erroneously pattern-matched on
the nmi_nesting values and I had confused that with the similar
manipulation of the preempt count.

> > Another is that we may have a BUG() or WARN() in entry code where the
> > task could be in an inconsistent state, and we need to treat the
> > exception like an NMI to avoid consuming that inconsistent state.
> 
> So, let's think the "inconsistent" state like:
>   -> el0_enter()
> 	  -> enter_from_user_mode()
> 		  -> before update ct_state (context_tracking.state), call BUG()/WARN()
> 			  -> el1_dbg()
> 
> It need to call ct_nmi_enter() in el1_dbg() right?

Yes. The critical things are that RCU may not be watching, and all other
entry accounting may be in an intermediate/inconsistent state, since the
BUG()/WARN() could be anywhere in that C code. Currently that means we
must call ct_nmi_enter().

The other problem to bear in mind is that we don't have a way to
distinguish these BUG()/WARN() cases from others throughout the kernel,
which is why we currently unconditionally treat this as an NMI entry.

> > To handle that properly, we need to:
> >
> > (a) Figure out what to do with entry code. Last I looked I was under the
> >     impression that x86 either didn't have a problem here, or simply
> >     ignored it.
> 
> TBH, in above case, x86 seems context_traking.state will be broken...

That's certainly possible, that was the impression I had last time I
looked, but I haven't looked at this in detail for a short while, and I
may have missed something.

> > (b) Handle BUG/WARN traps separately from other BRKs, such that we can
> >     use local_daif_inherit(), and treat this as a special function call
> >     rather than an NMI.
> >
> > (c) Somehow teach make_task_dead() to handle the case where DAIF.D
> >     and/or DAIF.A are set. Most likely we simply have to panic() here,
> >     as with BUG() in interrupt context.
> 
> Right... It should handle for DAIF.D and DAIF.A bits...

Yes.

[...]

> > As-is, I think an extra warning in the case of a BUG() is fine given
> > the larger functional issues.
> >
> > I do not think this patch is correct as-is.
> 
> So, what I think:
>   1. arm64_enter_el1_dbg() should ct_nmi_enter() as it is.
>   2. in bug_handler() while handling BUG_TYPE, add above ct_nmi_exit()
>      conditional call.
>   3. DAIF.D and DAIF.A handling.

No, that is not safe. In step 2, calling ct_nmi_exit() would undo *all*
of the ct_nmi_enter() logic, and may stop RCU from watching if the
exception was entered from some intermediate/inconsistent state.

If we want to change anything now, it should be the DAIF.DA handling,
but even for that I'm not sure what the best approach is, and that'll
require some changes to core code.

Please leave this as-if for now.

Mark.

Re: [PATCH] arm64/trap: fix broken ct->nmi_nesting when die() is called in a kthread

Posted by Yeoreum Yun 8 months, 1 week ago

Hi Mark,

> On Mon, Jun 02, 2025 at 06:50:53PM +0100, Yeoreum Yun wrote:
> > > One of the reasons we treak BRK as an NMI is that exception entry for
> > > BRK will leave all DAIF bits set, whereas schedule() should be called
> > > with debug and SError unmasked (but IRQ+FIQ masked). Generally, calling
> > > ct_nmi_enter() prevents preemption (and hence calls to schedule()).
> >
> > I think ct_nmi_enter() doesn't prevents preemption but
> > debug_exception_enter() disables preemption.
>
> Yep, sorry for the confusion there. I had erroneously pattern-matched on
> the nmi_nesting values and I had confused that with the similar
> manipulation of the preempt count.
>
> > > Another is that we may have a BUG() or WARN() in entry code where the
> > > task could be in an inconsistent state, and we need to treat the
> > > exception like an NMI to avoid consuming that inconsistent state.
> >
> > So, let's think the "inconsistent" state like:
> >   -> el0_enter()
> > 	  -> enter_from_user_mode()
> > 		  -> before update ct_state (context_tracking.state), call BUG()/WARN()
> > 			  -> el1_dbg()
> >
> > It need to call ct_nmi_enter() in el1_dbg() right?
>
> Yes. The critical things are that RCU may not be watching, and all other
> entry accounting may be in an intermediate/inconsistent state, since the
> BUG()/WARN() could be anywhere in that C code. Currently that means we
> must call ct_nmi_enter().
>
> The other problem to bear in mind is that we don't have a way to
> distinguish these BUG()/WARN() cases from others throughout the kernel,
> which is why we currently unconditionally treat this as an NMI entry.
>
> > > To handle that properly, we need to:
> > >
> > > (a) Figure out what to do with entry code. Last I looked I was under the
> > >     impression that x86 either didn't have a problem here, or simply
> > >     ignored it.
> >
> > TBH, in above case, x86 seems context_traking.state will be broken...
>
> That's certainly possible, that was the impression I had last time I
> looked, but I haven't looked at this in detail for a short while, and I
> may have missed something.
>
> > > (b) Handle BUG/WARN traps separately from other BRKs, such that we can
> > >     use local_daif_inherit(), and treat this as a special function call
> > >     rather than an NMI.
> > >
> > > (c) Somehow teach make_task_dead() to handle the case where DAIF.D
> > >     and/or DAIF.A are set. Most likely we simply have to panic() here,
> > >     as with BUG() in interrupt context.
> >
> > Right... It should handle for DAIF.D and DAIF.A bits...
>
> Yes.
>
> [...]

Thanks for clarficiation :D

>
> > > As-is, I think an extra warning in the case of a BUG() is fine given
> > > the larger functional issues.
> > >
> > > I do not think this patch is correct as-is.
> >
> > So, what I think:
> >   1. arm64_enter_el1_dbg() should ct_nmi_enter() as it is.
> >   2. in bug_handler() while handling BUG_TYPE, add above ct_nmi_exit()
> >      conditional call.
> >   3. DAIF.D and DAIF.A handling.
>
> No, that is not safe. In step 2, calling ct_nmi_exit() would undo *all*
> of the ct_nmi_enter() logic, and may stop RCU from watching if the
> exception was entered from some intermediate/inconsistent state.

Yes if call ct_nmi_enter() without condition.
But I imply with the condition check what I posted.
if CT_NESTING_IRQ_NONIDLE,
it wouldn't need call and that cpu can be watched by RCU.

>
> If we want to change anything now, it should be the DAIF.DA handling,
> but even for that I'm not sure what the best approach is, and that'll
> require some changes to core code.
>
> Please leave this as-if for now.
>

Not now. But waiting for Ada's patch merged.
and let me talk with you again please.

Thanks for your confirmation again!

--
Sincerely,
Yeoreum Yun

Re: [PATCH] arm64/trap: fix broken ct->nmi_nesting when die() is called in a kthread

Posted by Mark Rutland 8 months, 1 week ago

On Tue, Jun 03, 2025 at 12:14:18PM +0100, Yeoreum Yun wrote:
> > On Mon, Jun 02, 2025 at 06:50:53PM +0100, Yeoreum Yun wrote:
> > > So, what I think:
> > >   1. arm64_enter_el1_dbg() should ct_nmi_enter() as it is.
> > >   2. in bug_handler() while handling BUG_TYPE, add above ct_nmi_exit()
> > >      conditional call.
> > >   3. DAIF.D and DAIF.A handling.
> >
> > No, that is not safe. In step 2, calling ct_nmi_exit() would undo *all*
> > of the ct_nmi_enter() logic, and may stop RCU from watching if the
> > exception was entered from some intermediate/inconsistent state.
> 
> Yes if call ct_nmi_enter() without condition.
> But I imply with the condition check what I posted.
> if CT_NESTING_IRQ_NONIDLE,
> it wouldn't need call and that cpu can be watched by RCU.

I am not keen on conditionally calling ct_nmi_exit(), and would strongly
prefer to avoid that, regardless of where that lives in the flow.

I suspect that it would be bettter to triage the interrupted context
earlier, and rethink the way entry/exit works, but that's a much larger
bit of work and will take more thinking.

Mark.

Re: [PATCH] arm64/trap: fix broken ct->nmi_nesting when die() is called in a kthread

Posted by Yeoreum Yun 8 months, 1 week ago

Hi Mark,

> On Tue, Jun 03, 2025 at 12:14:18PM +0100, Yeoreum Yun wrote:
> > > On Mon, Jun 02, 2025 at 06:50:53PM +0100, Yeoreum Yun wrote:
> > > > So, what I think:
> > > >   1. arm64_enter_el1_dbg() should ct_nmi_enter() as it is.
> > > >   2. in bug_handler() while handling BUG_TYPE, add above ct_nmi_exit()
> > > >      conditional call.
> > > >   3. DAIF.D and DAIF.A handling.
> > >
> > > No, that is not safe. In step 2, calling ct_nmi_exit() would undo *all*
> > > of the ct_nmi_enter() logic, and may stop RCU from watching if the
> > > exception was entered from some intermediate/inconsistent state.
> >
> > Yes if call ct_nmi_enter() without condition.
> > But I imply with the condition check what I posted.
> > if CT_NESTING_IRQ_NONIDLE,
> > it wouldn't need call and that cpu can be watched by RCU.
>
> I am not keen on conditionally calling ct_nmi_exit(), and would strongly
> prefer to avoid that, regardless of where that lives in the flow.
>
> I suspect that it would be bettter to triage the interrupted context
> earlier, and rethink the way entry/exit works, but that's a much larger
> bit of work and will take more thinking.

Thanks for sharing your thought.
I'll think about it and let me raise it again after ada's patchset is
merged.


--
Sincerely,
Yeoreum Yun