[PATCH] exit: skip IRQ disabled warning during power off

Breno Leitao posted 1 patch 10 months, 1 week ago
kernel/exit.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
[PATCH] exit: skip IRQ disabled warning during power off
Posted by Breno Leitao 10 months, 1 week ago
When the system is shutting down due to pid 1 exiting, which is common
on virtual machines, a warning message is printed.

	WARNING: CPU: 0 PID: 1 at kernel/exit.c:897 do_exit+0x7e3/0xab0

This occurs because do_exit() is called after kernel_power_off(), which
disables interrupts. native_machine_shutdown() expliclty disable
interrupt to avoid receiving the timer interrupt, forcing scheduler load
balance during the power off phase.

This is the simplified code path:

	kernel_power_off()
	  - native_machine_shutdown()
	  	- local_irq_disable()
	do_exit()

Modify the warning condition in do_exit() to only trigger the warning if
the system is not powering off, since it is expected to have the irq
disabled in that case.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 kernel/exit.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 3485e5fc499e4..97ec4f8bfd98f 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -878,7 +878,7 @@ void __noreturn do_exit(long code)
 	struct task_struct *tsk = current;
 	int group_dead;
 
-	WARN_ON(irqs_disabled());
+	WARN_ON(irqs_disabled() && system_state != SYSTEM_POWER_OFF);
 
 	synchronize_group_exit(tsk, code);
 

---
base-commit: bfc17c1658353f22843c7c13e27c2d31950f1887
change-id: 20250403-exit-bf815877a1cd

Best regards,
-- 
Breno Leitao <leitao@debian.org>
Re: [PATCH] exit: skip IRQ disabled warning during power off
Posted by Mateusz Guzik 10 months, 1 week ago
On Thu, Apr 3, 2025 at 8:01 PM Breno Leitao <leitao@debian.org> wrote:
>
> When the system is shutting down due to pid 1 exiting, which is common
> on virtual machines, a warning message is printed.
>
>         WARNING: CPU: 0 PID: 1 at kernel/exit.c:897 do_exit+0x7e3/0xab0
>
> This occurs because do_exit() is called after kernel_power_off(), which
> disables interrupts. native_machine_shutdown() expliclty disable
> interrupt to avoid receiving the timer interrupt, forcing scheduler load
> balance during the power off phase.
>
> This is the simplified code path:
>
>         kernel_power_off()
>           - native_machine_shutdown()
>                 - local_irq_disable()
>         do_exit()
>
> Modify the warning condition in do_exit() to only trigger the warning if
> the system is not powering off, since it is expected to have the irq
> disabled in that case.
>
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
>  kernel/exit.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 3485e5fc499e4..97ec4f8bfd98f 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -878,7 +878,7 @@ void __noreturn do_exit(long code)
>         struct task_struct *tsk = current;
>         int group_dead;
>
> -       WARN_ON(irqs_disabled());
> +       WARN_ON(irqs_disabled() && system_state != SYSTEM_POWER_OFF);
>
>         synchronize_group_exit(tsk, code);
>
>

Can you share the backtrace?

Note first thing synchronize_group_exit() is going to do is cycle
through an irq-protected lock, so by the time it unlocks irqs are
enabled again.

Preferably whatever the code path which ends up here would sort it out.

If that's not feasible, I think this warrants a comment above the warn.

-- 
Mateusz Guzik <mjguzik gmail.com>
Re: [PATCH] exit: skip IRQ disabled warning during power off
Posted by Breno Leitao 10 months, 1 week ago
Hello Mateusz,


On Fri, Apr 04, 2025 at 07:40:45AM +0200, Mateusz Guzik wrote:
> On Thu, Apr 3, 2025 at 8:01 PM Breno Leitao <leitao@debian.org> wrote:
> >
> > When the system is shutting down due to pid 1 exiting, which is common
> > on virtual machines, a warning message is printed.
> >
> >         WARNING: CPU: 0 PID: 1 at kernel/exit.c:897 do_exit+0x7e3/0xab0
> >
> > This occurs because do_exit() is called after kernel_power_off(), which
> > disables interrupts. native_machine_shutdown() expliclty disable
> > interrupt to avoid receiving the timer interrupt, forcing scheduler load
> > balance during the power off phase.
> >
> > This is the simplified code path:
> >
> >         kernel_power_off()
> >           - native_machine_shutdown()
> >                 - local_irq_disable()
> >         do_exit()
> >
> > Modify the warning condition in do_exit() to only trigger the warning if
> > the system is not powering off, since it is expected to have the irq
> > disabled in that case.
> >
> > Signed-off-by: Breno Leitao <leitao@debian.org>
> > ---
> >  kernel/exit.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/exit.c b/kernel/exit.c
> > index 3485e5fc499e4..97ec4f8bfd98f 100644
> > --- a/kernel/exit.c
> > +++ b/kernel/exit.c
> > @@ -878,7 +878,7 @@ void __noreturn do_exit(long code)
> >         struct task_struct *tsk = current;
> >         int group_dead;
> >
> > -       WARN_ON(irqs_disabled());
> > +       WARN_ON(irqs_disabled() && system_state != SYSTEM_POWER_OFF);
> >
> >         synchronize_group_exit(tsk, code);
> >
> >
> 
> Can you share the backtrace?

Sure. Here is the decoded stack from the the latest net-next
0907e7fb35756  ("Add linux-next specific files for 20250117")

	[  254.466712] ACPI: PM: Preparing to enter system sleep state S5
	[  254.474273] reboot: Power down
	[  254.479332] ------------[ cut here ]------------
	[  254.479934] WARNING: CPU: 0 PID: 1 at kernel/exit.c:881 do_exit (kernel/exit.c:881)
	[  254.480597] Modules linked in: evdev(E) serio_raw(E) button(E) virtio_mmio(E) 9pnet_virtio(E) 9p(E) 9pnet(E) netfs(E)
	[  254.483163] Tainted: [E]=UNSIGNED_MODULE, [N]=TEST
	[  254.483736] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
	[  254.484912] RIP: 0010:do_exit (kernel/exit.c:881)
	[ 254.485348] Code: 00 00 45 31 f6 f7 c3 00 02 00 00 41 0f 94 c6 48 c7 c7 48 8b 5d 87 44 89 f6 31 d2 31 c9 e8 c8 c5 41 00 f7 c3 00 02 00 00 75 02 <0f> 0b 48 c7 c7 78 8b 5d 87 44 89 f6 31 d2 31 c9 e8 ab c5 41 00 48
	All code
	========
	0:   00 00                   add    %al,(%rax)
	2:   45 31 f6                xor    %r14d,%r14d
	5:   f7 c3 00 02 00 00       test   $0x200,%ebx
	b:   41 0f 94 c6             sete   %r14b
	f:   48 c7 c7 48 8b 5d 87    mov    $0xffffffff875d8b48,%rdi
	16:   44 89 f6                mov    %r14d,%esi
	19:   31 d2                   xor    %edx,%edx
	1b:   31 c9                   xor    %ecx,%ecx
	1d:   e8 c8 c5 41 00          call   0x41c5ea
	22:   f7 c3 00 02 00 00       test   $0x200,%ebx
	28:   75 02                   jne    0x2c
	2a:*  0f 0b                   ud2             <-- trapping instruction
	2c:   48 c7 c7 78 8b 5d 87    mov    $0xffffffff875d8b78,%rdi
	33:   44 89 f6                mov    %r14d,%esi
	36:   31 d2                   xor    %edx,%edx
	38:   31 c9                   xor    %ecx,%ecx
	3a:   e8 ab c5 41 00          call   0x41c5ea
	3f:   48                      rex.W

	Code starting with the faulting instruction
	===========================================
	0:   0f 0b                   ud2
	2:   48 c7 c7 78 8b 5d 87    mov    $0xffffffff875d8b78,%rdi
	9:   44 89 f6                mov    %r14d,%esi
	c:   31 d2                   xor    %edx,%edx
	e:   31 c9                   xor    %ecx,%ecx
	10:   e8 ab c5 41 00          call   0x41c5c0
	15:   48                      rex.W
	[  254.487377] RSP: 0018:ffa000000001fb80 EFLAGS: 00010046
	[  254.487947] RAX: 49739800a6a6bf00 RBX: 0000000000000016 RCX: 0000000000000000
	[  254.488735] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff875d8b48
	[  254.489545] RBP: ffa000000001fd10 R08: dffffc0000000000 R09: 1ffffffff13312b6
	[  254.490406] R10: dffffc0000000000 R11: fffffbfff13312b7 R12: 000000004321fedc
	[  254.491330] R13: dffffc0000000000 R14: 0000000000000001 R15: dffffc0000000000
	[  254.492200] FS:  00007f658bad6780(0000) GS:ff110004c6000000(0000) knlGS:0000000000000000
	[  254.493005] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
	[  254.493667] CR2: 00007f7769811000 CR3: 000000010eebe004 CR4: 0000000000771ef0
	[  254.494555] PKRU: 55555554
	[  254.495006] Call Trace:
	[  254.495305]  <TASK>
	[  254.495609] ? __warn (kernel/panic.c:242 kernel/panic.c:748)
	[  254.496046] ? do_exit (kernel/exit.c:881)
	[  254.496675] ? do_exit (kernel/exit.c:881)
	[  254.497124] ? report_bug (lib/bug.c:? lib/bug.c:219)
	[  254.497564] ? handle_bug (arch/x86/kernel/traps.c:285)
	[  254.497984] ? exc_invalid_op (arch/x86/kernel/traps.c:309)
	[  254.498394] ? asm_exc_invalid_op (./arch/x86/include/asm/idtentry.h:621)
	[  254.498874] ? do_exit (kernel/exit.c:881)
	[  254.499355] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:182)
	[  254.499892] ? __rcu_read_unlock (kernel/rcu/tree_plugin.h:445)
	[  254.500311] ? atomic_notifier_call_chain (./include/linux/rcupdate.h:337 ./include/linux/rcupdate.h:849 kernel/notifier.c:222)
	[  254.500864] ? __pfx_do_exit (kernel/exit.c:877)
	[  254.501302] ? __pfx_ftrace_likely_update (kernel/trace/trace_branch.c:203)
	[  254.501825] ? native_machine_shutdown (arch/x86/kernel/reboot.c:765)
	[  254.502380] ? atomic_notifier_call_chain (./include/linux/rcupdate.h:337 ./include/linux/rcupdate.h:849 kernel/notifier.c:222)
	[  254.502984] __x64_sys_reboot (kernel/reboot.c:?)
	[  254.503407] ? __pfx___x64_sys_reboot (kernel/reboot.c:722)
	[  254.503952] ? __pfx_ftrace_likely_update (kernel/trace/trace_branch.c:203)
	[  254.504527] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:182)
	[  254.505067] ? __pfx_ftrace_likely_update (kernel/trace/trace_branch.c:203)
	[  254.505599] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:182)
	[  254.506159] ? __pfx_ftrace_likely_update (kernel/trace/trace_branch.c:203)
	[  254.506712] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:182)
	[  254.507263] ? do_syscall_64 (./arch/x86/include/asm/irqflags.h:42 ./arch/x86/include/asm/irqflags.h:97 ./include/linux/entry-common.h:198 arch/x86/entry/common.c:79)
	[  254.507865] do_syscall_64 (arch/x86/entry/common.c:83)
	[  254.508317] ? exc_page_fault (arch/x86/mm/fault.c:1542)
	[  254.508781] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
	[  254.509315] RIP: 0033:0x7f658b904a27
	[ 254.509879] Code: 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 89 fa be 69 19 12 28 bf ad de e1 fe b8 a9 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 01 c3 48 8b 15 c1 43 0f 00 f7 d8 64 89 02 b8
	All code
	========
	0:   64 89 01                mov    %eax,%fs:(%rcx)
	3:   48 83 c8 ff             or     $0xffffffffffffffff,%rax
	7:   c3                      ret
	8:   66 2e 0f 1f 84 00 00    cs nopw 0x0(%rax,%rax,1)
	f:   00 00 00
	12:   90                      nop
	13:   f3 0f 1e fa             endbr64
	17:   89 fa                   mov    %edi,%edx
	19:   be 69 19 12 28          mov    $0x28121969,%esi
	1e:   bf ad de e1 fe          mov    $0xfee1dead,%edi
	23:   b8 a9 00 00 00          mov    $0xa9,%eax
	28:   0f 05                   syscall
	2a:*  48 3d 00 f0 ff ff       cmp    $0xfffffffffffff000,%rax         <-- trapping instruction
	30:   77 01                   ja     0x33
	32:   c3                      ret
	33:   48 8b 15 c1 43 0f 00    mov    0xf43c1(%rip),%rdx        # 0xf43fb
	3a:   f7 d8                   neg    %eax
	3c:   64 89 02                mov    %eax,%fs:(%rdx)
	3f:   b8                      .byte 0xb8

	Code starting with the faulting instruction
	===========================================
	0:   48 3d 00 f0 ff ff       cmp    $0xfffffffffffff000,%rax
	6:   77 01                   ja     0x9
	8:   c3                      ret
	9:   48 8b 15 c1 43 0f 00    mov    0xf43c1(%rip),%rdx        # 0xf43d1
	10:   f7 d8                   neg    %eax
	12:   64 89 02                mov    %eax,%fs:(%rdx)
	15:   b8                      .byte 0xb8
	[  254.511720] RSP: 002b:00007ffec6b608c8 EFLAGS: 00000217 ORIG_RAX: 00000000000000a9
	[  254.512535] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f658b904a27
	[  254.513332] RDX: 000000004321fedc RSI: 0000000028121969 RDI: 00000000fee1dead
	[  254.514243] RBP: 000000000000000a R08: 000055b7f5e58690 R09: 0000000000000000
	[  254.515058] R10: 0000000000000000 R11: 0000000000000217 R12: 0000000000000011
	[  254.515833] R13: 00007ffec6b60c88 R14: 000055b7f5e3828c R15: 000000000000000a
	[  254.516702]  </TASK>
	[  254.517021] irq event stamp: 1438412
	[  254.517510] hardirqs last enabled at (1438411): _raw_spin_unlock_irqrestore (./arch/x86/include/asm/irqflags.h:26 ./arch/x86/include/asm/irqflags.h:87 ./arch/x86/include/asm/irqflags.h:147 ./include/linux/spinlock_api_smp.h:151 kernel/locking/spinlock.c:194)
	[  254.518554] hardirqs last disabled at (1438412): native_machine_shutdown (arch/x86/kernel/reboot.c:?)
	[  254.519466] softirqs last enabled at (1438324): handle_softirqs (./arch/x86/include/asm/preempt.h:26 kernel/softirq.c:408 kernel/softirq.c:589)
	[  254.520398] softirqs last disabled at (1438303): __irq_exit_rcu (./arch/x86/include/asm/jump_label.h:36 kernel/softirq.c:664)


> Note first thing synchronize_group_exit() is going to do is cycle
> through an irq-protected lock, so by the time it unlocks irqs are
> enabled again.

When pid=1 is being killed, then synchronize_group_exit() will be called
with irq enabled (as shown by the warning above), and
synchronize_group_exit()->spin_unlock_irq() will restore the interrupt
(once it got disabled in spin_lock_irq() pair).

On the other side, if irqs are disabled when synchronize_group_exit() is
called, then synchronize_group_exit->spin_unlock_irq() will not enable
the interrupts, right?

Am I following your comment properly?

> Preferably whatever the code path which ends up here would sort it out.
> 
> If that's not feasible, I think this warrants a comment above the warn.

Sorry, I am not sure what you meant by this. Would you mind rephrasing?

Thanks for the review,
--breno
Re: [PATCH] exit: skip IRQ disabled warning during power off
Posted by Mateusz Guzik 10 months, 1 week ago
On Fri, Apr 4, 2025 at 2:52 PM Breno Leitao <leitao@debian.org> wrote:
>
> Hello Mateusz,
>
>
> On Fri, Apr 04, 2025 at 07:40:45AM +0200, Mateusz Guzik wrote:
> > On Thu, Apr 3, 2025 at 8:01 PM Breno Leitao <leitao@debian.org> wrote:
> > >
> > > When the system is shutting down due to pid 1 exiting, which is common
> > > on virtual machines, a warning message is printed.
> > >
> > >         WARNING: CPU: 0 PID: 1 at kernel/exit.c:897 do_exit+0x7e3/0xab0
> > >
> > > This occurs because do_exit() is called after kernel_power_off(), which
> > > disables interrupts. native_machine_shutdown() expliclty disable
> > > interrupt to avoid receiving the timer interrupt, forcing scheduler load
> > > balance during the power off phase.
> > >
> > > This is the simplified code path:
> > >
> > >         kernel_power_off()
> > >           - native_machine_shutdown()
> > >                 - local_irq_disable()
> > >         do_exit()
> > >
> > > Modify the warning condition in do_exit() to only trigger the warning if
> > > the system is not powering off, since it is expected to have the irq
> > > disabled in that case.
> > >
> > > Signed-off-by: Breno Leitao <leitao@debian.org>
> > > ---
> > >  kernel/exit.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/exit.c b/kernel/exit.c
> > > index 3485e5fc499e4..97ec4f8bfd98f 100644
> > > --- a/kernel/exit.c
> > > +++ b/kernel/exit.c
> > > @@ -878,7 +878,7 @@ void __noreturn do_exit(long code)
> > >         struct task_struct *tsk = current;
> > >         int group_dead;
> > >
> > > -       WARN_ON(irqs_disabled());
> > > +       WARN_ON(irqs_disabled() && system_state != SYSTEM_POWER_OFF);
> > >
> > >         synchronize_group_exit(tsk, code);
> > >
> > >
> >
> > Can you share the backtrace?
>
> Sure. Here is the decoded stack from the the latest net-next
> 0907e7fb35756  ("Add linux-next specific files for 20250117")
>
>         [  254.466712] ACPI: PM: Preparing to enter system sleep state S5
>         [  254.474273] reboot: Power down
>         [  254.479332] ------------[ cut here ]------------
>         [  254.479934] WARNING: CPU: 0 PID: 1 at kernel/exit.c:881 do_exit (kernel/exit.c:881)
>         [  254.480597] Modules linked in: evdev(E) serio_raw(E) button(E) virtio_mmio(E) 9pnet_virtio(E) 9p(E) 9pnet(E) netfs(E)
>         [  254.483163] Tainted: [E]=UNSIGNED_MODULE, [N]=TEST
>         [  254.483736] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
>         [  254.484912] RIP: 0010:do_exit (kernel/exit.c:881)
>         [ 254.485348] Code: 00 00 45 31 f6 f7 c3 00 02 00 00 41 0f 94 c6 48 c7 c7 48 8b 5d 87 44 89 f6 31 d2 31 c9 e8 c8 c5 41 00 f7 c3 00 02 00 00 75 02 <0f> 0b 48 c7 c7 78 8b 5d 87 44 89 f6 31 d2 31 c9 e8 ab c5 41 00 48
>         All code
>         ========
>         0:   00 00                   add    %al,(%rax)
>         2:   45 31 f6                xor    %r14d,%r14d
>         5:   f7 c3 00 02 00 00       test   $0x200,%ebx
>         b:   41 0f 94 c6             sete   %r14b
>         f:   48 c7 c7 48 8b 5d 87    mov    $0xffffffff875d8b48,%rdi
>         16:   44 89 f6                mov    %r14d,%esi
>         19:   31 d2                   xor    %edx,%edx
>         1b:   31 c9                   xor    %ecx,%ecx
>         1d:   e8 c8 c5 41 00          call   0x41c5ea
>         22:   f7 c3 00 02 00 00       test   $0x200,%ebx
>         28:   75 02                   jne    0x2c
>         2a:*  0f 0b                   ud2             <-- trapping instruction
>         2c:   48 c7 c7 78 8b 5d 87    mov    $0xffffffff875d8b78,%rdi
>         33:   44 89 f6                mov    %r14d,%esi
>         36:   31 d2                   xor    %edx,%edx
>         38:   31 c9                   xor    %ecx,%ecx
>         3a:   e8 ab c5 41 00          call   0x41c5ea
>         3f:   48                      rex.W
>
>         Code starting with the faulting instruction
>         ===========================================
>         0:   0f 0b                   ud2
>         2:   48 c7 c7 78 8b 5d 87    mov    $0xffffffff875d8b78,%rdi
>         9:   44 89 f6                mov    %r14d,%esi
>         c:   31 d2                   xor    %edx,%edx
>         e:   31 c9                   xor    %ecx,%ecx
>         10:   e8 ab c5 41 00          call   0x41c5c0
>         15:   48                      rex.W
>         [  254.487377] RSP: 0018:ffa000000001fb80 EFLAGS: 00010046
>         [  254.487947] RAX: 49739800a6a6bf00 RBX: 0000000000000016 RCX: 0000000000000000
>         [  254.488735] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff875d8b48
>         [  254.489545] RBP: ffa000000001fd10 R08: dffffc0000000000 R09: 1ffffffff13312b6
>         [  254.490406] R10: dffffc0000000000 R11: fffffbfff13312b7 R12: 000000004321fedc
>         [  254.491330] R13: dffffc0000000000 R14: 0000000000000001 R15: dffffc0000000000
>         [  254.492200] FS:  00007f658bad6780(0000) GS:ff110004c6000000(0000) knlGS:0000000000000000
>         [  254.493005] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>         [  254.493667] CR2: 00007f7769811000 CR3: 000000010eebe004 CR4: 0000000000771ef0
>         [  254.494555] PKRU: 55555554
>         [  254.495006] Call Trace:
>         [  254.495305]  <TASK>
>         [  254.495609] ? __warn (kernel/panic.c:242 kernel/panic.c:748)
>         [  254.496046] ? do_exit (kernel/exit.c:881)
>         [  254.496675] ? do_exit (kernel/exit.c:881)
>         [  254.497124] ? report_bug (lib/bug.c:? lib/bug.c:219)
>         [  254.497564] ? handle_bug (arch/x86/kernel/traps.c:285)
>         [  254.497984] ? exc_invalid_op (arch/x86/kernel/traps.c:309)
>         [  254.498394] ? asm_exc_invalid_op (./arch/x86/include/asm/idtentry.h:621)
>         [  254.498874] ? do_exit (kernel/exit.c:881)
>         [  254.499355] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:182)
>         [  254.499892] ? __rcu_read_unlock (kernel/rcu/tree_plugin.h:445)
>         [  254.500311] ? atomic_notifier_call_chain (./include/linux/rcupdate.h:337 ./include/linux/rcupdate.h:849 kernel/notifier.c:222)
>         [  254.500864] ? __pfx_do_exit (kernel/exit.c:877)
>         [  254.501302] ? __pfx_ftrace_likely_update (kernel/trace/trace_branch.c:203)
>         [  254.501825] ? native_machine_shutdown (arch/x86/kernel/reboot.c:765)
>         [  254.502380] ? atomic_notifier_call_chain (./include/linux/rcupdate.h:337 ./include/linux/rcupdate.h:849 kernel/notifier.c:222)
>         [  254.502984] __x64_sys_reboot (kernel/reboot.c:?)
>         [  254.503407] ? __pfx___x64_sys_reboot (kernel/reboot.c:722)
>         [  254.503952] ? __pfx_ftrace_likely_update (kernel/trace/trace_branch.c:203)
>         [  254.504527] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:182)
>         [  254.505067] ? __pfx_ftrace_likely_update (kernel/trace/trace_branch.c:203)
>         [  254.505599] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:182)
>         [  254.506159] ? __pfx_ftrace_likely_update (kernel/trace/trace_branch.c:203)
>         [  254.506712] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:182)
>         [  254.507263] ? do_syscall_64 (./arch/x86/include/asm/irqflags.h:42 ./arch/x86/include/asm/irqflags.h:97 ./include/linux/entry-common.h:198 arch/x86/entry/common.c:79)
>         [  254.507865] do_syscall_64 (arch/x86/entry/common.c:83)
>         [  254.508317] ? exc_page_fault (arch/x86/mm/fault.c:1542)
>         [  254.508781] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
>         [  254.509315] RIP: 0033:0x7f658b904a27
>         [ 254.509879] Code: 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 89 fa be 69 19 12 28 bf ad de e1 fe b8 a9 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 01 c3 48 8b 15 c1 43 0f 00 f7 d8 64 89 02 b8
>         All code
>         ========
>         0:   64 89 01                mov    %eax,%fs:(%rcx)
>         3:   48 83 c8 ff             or     $0xffffffffffffffff,%rax
>         7:   c3                      ret
>         8:   66 2e 0f 1f 84 00 00    cs nopw 0x0(%rax,%rax,1)
>         f:   00 00 00
>         12:   90                      nop
>         13:   f3 0f 1e fa             endbr64
>         17:   89 fa                   mov    %edi,%edx
>         19:   be 69 19 12 28          mov    $0x28121969,%esi
>         1e:   bf ad de e1 fe          mov    $0xfee1dead,%edi
>         23:   b8 a9 00 00 00          mov    $0xa9,%eax
>         28:   0f 05                   syscall
>         2a:*  48 3d 00 f0 ff ff       cmp    $0xfffffffffffff000,%rax         <-- trapping instruction
>         30:   77 01                   ja     0x33
>         32:   c3                      ret
>         33:   48 8b 15 c1 43 0f 00    mov    0xf43c1(%rip),%rdx        # 0xf43fb
>         3a:   f7 d8                   neg    %eax
>         3c:   64 89 02                mov    %eax,%fs:(%rdx)
>         3f:   b8                      .byte 0xb8
>
>         Code starting with the faulting instruction
>         ===========================================
>         0:   48 3d 00 f0 ff ff       cmp    $0xfffffffffffff000,%rax
>         6:   77 01                   ja     0x9
>         8:   c3                      ret
>         9:   48 8b 15 c1 43 0f 00    mov    0xf43c1(%rip),%rdx        # 0xf43d1
>         10:   f7 d8                   neg    %eax
>         12:   64 89 02                mov    %eax,%fs:(%rdx)
>         15:   b8                      .byte 0xb8
>         [  254.511720] RSP: 002b:00007ffec6b608c8 EFLAGS: 00000217 ORIG_RAX: 00000000000000a9
>         [  254.512535] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f658b904a27
>         [  254.513332] RDX: 000000004321fedc RSI: 0000000028121969 RDI: 00000000fee1dead
>         [  254.514243] RBP: 000000000000000a R08: 000055b7f5e58690 R09: 0000000000000000
>         [  254.515058] R10: 0000000000000000 R11: 0000000000000217 R12: 0000000000000011
>         [  254.515833] R13: 00007ffec6b60c88 R14: 000055b7f5e3828c R15: 000000000000000a
>         [  254.516702]  </TASK>
>         [  254.517021] irq event stamp: 1438412
>         [  254.517510] hardirqs last enabled at (1438411): _raw_spin_unlock_irqrestore (./arch/x86/include/asm/irqflags.h:26 ./arch/x86/include/asm/irqflags.h:87 ./arch/x86/include/asm/irqflags.h:147 ./include/linux/spinlock_api_smp.h:151 kernel/locking/spinlock.c:194)
>         [  254.518554] hardirqs last disabled at (1438412): native_machine_shutdown (arch/x86/kernel/reboot.c:?)
>         [  254.519466] softirqs last enabled at (1438324): handle_softirqs (./arch/x86/include/asm/preempt.h:26 kernel/softirq.c:408 kernel/softirq.c:589)
>         [  254.520398] softirqs last disabled at (1438303): __irq_exit_rcu (./arch/x86/include/asm/jump_label.h:36 kernel/softirq.c:664)
>
>
> > Note first thing synchronize_group_exit() is going to do is cycle
> > through an irq-protected lock, so by the time it unlocks irqs are
> > enabled again.
>
> When pid=1 is being killed, then synchronize_group_exit() will be called
> with irq enabled (as shown by the warning above), and
> synchronize_group_exit()->spin_unlock_irq() will restore the interrupt
> (once it got disabled in spin_lock_irq() pair).
>
> On the other side, if irqs are disabled when synchronize_group_exit() is
> called, then synchronize_group_exit->spin_unlock_irq() will not enable
> the interrupts, right?

It will enable interrupts. spin_lock_irq() blindly disables them, but
in your case they are already off and this bit is a nop.
spin_unlock_irq() blindly enables them and this is where they are back
regardless of the initial state.

Thus even with the buggy caller interrupts are enabled after
synchronize_group_exit() returns.

I think this is the only reason there are no further warnings about
the state when proceeding with exit.

Preferably this would be fixed so that nobody calls here with irqs off.

If this is not feasible, a comment explaining why irqs can be off
should be placed above the WARN_ON you are modifying
-- 
Mateusz Guzik <mjguzik gmail.com>
Re: [PATCH] exit: skip IRQ disabled warning during power off
Posted by Oleg Nesterov 10 months, 1 week ago
On 04/04, Mateusz Guzik wrote:
>
> Preferably this would be fixed so that nobody calls here with irqs off.

This is not my area, I can't help, but do you understand why sys_reboot()
needs to call do_exit() ?

Say, why the stupid patch below is wrong?

Oleg.

diff --git a/kernel/reboot.c b/kernel/reboot.c
index ec087827c85c..0d0c8aa3ffcd 100644
--- a/kernel/reboot.c
+++ b/kernel/reboot.c
@@ -777,12 +777,11 @@ SYSCALL_DEFINE4(reboot, int, magic1, int, magic2, unsigned int, cmd,
 
 	case LINUX_REBOOT_CMD_HALT:
 		kernel_halt();
-		do_exit(0);
+		for (;;) cpu_relax();
 
 	case LINUX_REBOOT_CMD_POWER_OFF:
 		kernel_power_off();
-		do_exit(0);
-		break;
+		for (;;) cpu_relax();
 
 	case LINUX_REBOOT_CMD_RESTART2:
 		ret = strncpy_from_user(&buffer[0], arg, sizeof(buffer) - 1);
Re: [PATCH] exit: skip IRQ disabled warning during power off
Posted by Breno Leitao 10 months, 1 week ago
On Fri, Apr 04, 2025 at 04:20:53PM +0200, Mateusz Guzik wrote:
> On Fri, Apr 4, 2025 at 2:52 PM Breno Leitao <leitao@debian.org> wrote:
> >
> > Hello Mateusz,
> >
> >
> > On Fri, Apr 04, 2025 at 07:40:45AM +0200, Mateusz Guzik wrote:
> > > On Thu, Apr 3, 2025 at 8:01 PM Breno Leitao <leitao@debian.org> wrote:
> > > >
> > > > When the system is shutting down due to pid 1 exiting, which is common
> > > > on virtual machines, a warning message is printed.
> > > >
> > > >         WARNING: CPU: 0 PID: 1 at kernel/exit.c:897 do_exit+0x7e3/0xab0
> > > >
> > > > This occurs because do_exit() is called after kernel_power_off(), which
> > > > disables interrupts. native_machine_shutdown() expliclty disable
> > > > interrupt to avoid receiving the timer interrupt, forcing scheduler load
> > > > balance during the power off phase.
> > > >
> > > > This is the simplified code path:
> > > >
> > > >         kernel_power_off()
> > > >           - native_machine_shutdown()
> > > >                 - local_irq_disable()
> > > >         do_exit()
> > > >
> > > > Modify the warning condition in do_exit() to only trigger the warning if
> > > > the system is not powering off, since it is expected to have the irq
> > > > disabled in that case.
> > > >
> > > > Signed-off-by: Breno Leitao <leitao@debian.org>
> > > > ---
> > > >  kernel/exit.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/kernel/exit.c b/kernel/exit.c
> > > > index 3485e5fc499e4..97ec4f8bfd98f 100644
> > > > --- a/kernel/exit.c
> > > > +++ b/kernel/exit.c
> > > > @@ -878,7 +878,7 @@ void __noreturn do_exit(long code)
> > > >         struct task_struct *tsk = current;
> > > >         int group_dead;
> > > >
> > > > -       WARN_ON(irqs_disabled());
> > > > +       WARN_ON(irqs_disabled() && system_state != SYSTEM_POWER_OFF);
> > > >
> > > >         synchronize_group_exit(tsk, code);
> > > >
> > > >
> > >
> > > Can you share the backtrace?
> >
> > Sure. Here is the decoded stack from the the latest net-next
> > 0907e7fb35756  ("Add linux-next specific files for 20250117")
> >
> >         [  254.466712] ACPI: PM: Preparing to enter system sleep state S5
> >         [  254.474273] reboot: Power down
> >         [  254.479332] ------------[ cut here ]------------
> >         [  254.479934] WARNING: CPU: 0 PID: 1 at kernel/exit.c:881 do_exit (kernel/exit.c:881)
> >         [  254.480597] Modules linked in: evdev(E) serio_raw(E) button(E) virtio_mmio(E) 9pnet_virtio(E) 9p(E) 9pnet(E) netfs(E)
> >         [  254.483163] Tainted: [E]=UNSIGNED_MODULE, [N]=TEST
> >         [  254.483736] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> >         [  254.484912] RIP: 0010:do_exit (kernel/exit.c:881)
> >         [ 254.485348] Code: 00 00 45 31 f6 f7 c3 00 02 00 00 41 0f 94 c6 48 c7 c7 48 8b 5d 87 44 89 f6 31 d2 31 c9 e8 c8 c5 41 00 f7 c3 00 02 00 00 75 02 <0f> 0b 48 c7 c7 78 8b 5d 87 44 89 f6 31 d2 31 c9 e8 ab c5 41 00 48
> >         All code
> >         ========
> >         0:   00 00                   add    %al,(%rax)
> >         2:   45 31 f6                xor    %r14d,%r14d
> >         5:   f7 c3 00 02 00 00       test   $0x200,%ebx
> >         b:   41 0f 94 c6             sete   %r14b
> >         f:   48 c7 c7 48 8b 5d 87    mov    $0xffffffff875d8b48,%rdi
> >         16:   44 89 f6                mov    %r14d,%esi
> >         19:   31 d2                   xor    %edx,%edx
> >         1b:   31 c9                   xor    %ecx,%ecx
> >         1d:   e8 c8 c5 41 00          call   0x41c5ea
> >         22:   f7 c3 00 02 00 00       test   $0x200,%ebx
> >         28:   75 02                   jne    0x2c
> >         2a:*  0f 0b                   ud2             <-- trapping instruction
> >         2c:   48 c7 c7 78 8b 5d 87    mov    $0xffffffff875d8b78,%rdi
> >         33:   44 89 f6                mov    %r14d,%esi
> >         36:   31 d2                   xor    %edx,%edx
> >         38:   31 c9                   xor    %ecx,%ecx
> >         3a:   e8 ab c5 41 00          call   0x41c5ea
> >         3f:   48                      rex.W
> >
> >         Code starting with the faulting instruction
> >         ===========================================
> >         0:   0f 0b                   ud2
> >         2:   48 c7 c7 78 8b 5d 87    mov    $0xffffffff875d8b78,%rdi
> >         9:   44 89 f6                mov    %r14d,%esi
> >         c:   31 d2                   xor    %edx,%edx
> >         e:   31 c9                   xor    %ecx,%ecx
> >         10:   e8 ab c5 41 00          call   0x41c5c0
> >         15:   48                      rex.W
> >         [  254.487377] RSP: 0018:ffa000000001fb80 EFLAGS: 00010046
> >         [  254.487947] RAX: 49739800a6a6bf00 RBX: 0000000000000016 RCX: 0000000000000000
> >         [  254.488735] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff875d8b48
> >         [  254.489545] RBP: ffa000000001fd10 R08: dffffc0000000000 R09: 1ffffffff13312b6
> >         [  254.490406] R10: dffffc0000000000 R11: fffffbfff13312b7 R12: 000000004321fedc
> >         [  254.491330] R13: dffffc0000000000 R14: 0000000000000001 R15: dffffc0000000000
> >         [  254.492200] FS:  00007f658bad6780(0000) GS:ff110004c6000000(0000) knlGS:0000000000000000
> >         [  254.493005] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >         [  254.493667] CR2: 00007f7769811000 CR3: 000000010eebe004 CR4: 0000000000771ef0
> >         [  254.494555] PKRU: 55555554
> >         [  254.495006] Call Trace:
> >         [  254.495305]  <TASK>
> >         [  254.495609] ? __warn (kernel/panic.c:242 kernel/panic.c:748)
> >         [  254.496046] ? do_exit (kernel/exit.c:881)
> >         [  254.496675] ? do_exit (kernel/exit.c:881)
> >         [  254.497124] ? report_bug (lib/bug.c:? lib/bug.c:219)
> >         [  254.497564] ? handle_bug (arch/x86/kernel/traps.c:285)
> >         [  254.497984] ? exc_invalid_op (arch/x86/kernel/traps.c:309)
> >         [  254.498394] ? asm_exc_invalid_op (./arch/x86/include/asm/idtentry.h:621)
> >         [  254.498874] ? do_exit (kernel/exit.c:881)
> >         [  254.499355] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:182)
> >         [  254.499892] ? __rcu_read_unlock (kernel/rcu/tree_plugin.h:445)
> >         [  254.500311] ? atomic_notifier_call_chain (./include/linux/rcupdate.h:337 ./include/linux/rcupdate.h:849 kernel/notifier.c:222)
> >         [  254.500864] ? __pfx_do_exit (kernel/exit.c:877)
> >         [  254.501302] ? __pfx_ftrace_likely_update (kernel/trace/trace_branch.c:203)
> >         [  254.501825] ? native_machine_shutdown (arch/x86/kernel/reboot.c:765)
> >         [  254.502380] ? atomic_notifier_call_chain (./include/linux/rcupdate.h:337 ./include/linux/rcupdate.h:849 kernel/notifier.c:222)
> >         [  254.502984] __x64_sys_reboot (kernel/reboot.c:?)
> >         [  254.503407] ? __pfx___x64_sys_reboot (kernel/reboot.c:722)
> >         [  254.503952] ? __pfx_ftrace_likely_update (kernel/trace/trace_branch.c:203)
> >         [  254.504527] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:182)
> >         [  254.505067] ? __pfx_ftrace_likely_update (kernel/trace/trace_branch.c:203)
> >         [  254.505599] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:182)
> >         [  254.506159] ? __pfx_ftrace_likely_update (kernel/trace/trace_branch.c:203)
> >         [  254.506712] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:182)
> >         [  254.507263] ? do_syscall_64 (./arch/x86/include/asm/irqflags.h:42 ./arch/x86/include/asm/irqflags.h:97 ./include/linux/entry-common.h:198 arch/x86/entry/common.c:79)
> >         [  254.507865] do_syscall_64 (arch/x86/entry/common.c:83)
> >         [  254.508317] ? exc_page_fault (arch/x86/mm/fault.c:1542)
> >         [  254.508781] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
> >         [  254.509315] RIP: 0033:0x7f658b904a27
> >         [ 254.509879] Code: 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 89 fa be 69 19 12 28 bf ad de e1 fe b8 a9 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 01 c3 48 8b 15 c1 43 0f 00 f7 d8 64 89 02 b8
> >         All code
> >         ========
> >         0:   64 89 01                mov    %eax,%fs:(%rcx)
> >         3:   48 83 c8 ff             or     $0xffffffffffffffff,%rax
> >         7:   c3                      ret
> >         8:   66 2e 0f 1f 84 00 00    cs nopw 0x0(%rax,%rax,1)
> >         f:   00 00 00
> >         12:   90                      nop
> >         13:   f3 0f 1e fa             endbr64
> >         17:   89 fa                   mov    %edi,%edx
> >         19:   be 69 19 12 28          mov    $0x28121969,%esi
> >         1e:   bf ad de e1 fe          mov    $0xfee1dead,%edi
> >         23:   b8 a9 00 00 00          mov    $0xa9,%eax
> >         28:   0f 05                   syscall
> >         2a:*  48 3d 00 f0 ff ff       cmp    $0xfffffffffffff000,%rax         <-- trapping instruction
> >         30:   77 01                   ja     0x33
> >         32:   c3                      ret
> >         33:   48 8b 15 c1 43 0f 00    mov    0xf43c1(%rip),%rdx        # 0xf43fb
> >         3a:   f7 d8                   neg    %eax
> >         3c:   64 89 02                mov    %eax,%fs:(%rdx)
> >         3f:   b8                      .byte 0xb8
> >
> >         Code starting with the faulting instruction
> >         ===========================================
> >         0:   48 3d 00 f0 ff ff       cmp    $0xfffffffffffff000,%rax
> >         6:   77 01                   ja     0x9
> >         8:   c3                      ret
> >         9:   48 8b 15 c1 43 0f 00    mov    0xf43c1(%rip),%rdx        # 0xf43d1
> >         10:   f7 d8                   neg    %eax
> >         12:   64 89 02                mov    %eax,%fs:(%rdx)
> >         15:   b8                      .byte 0xb8
> >         [  254.511720] RSP: 002b:00007ffec6b608c8 EFLAGS: 00000217 ORIG_RAX: 00000000000000a9
> >         [  254.512535] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f658b904a27
> >         [  254.513332] RDX: 000000004321fedc RSI: 0000000028121969 RDI: 00000000fee1dead
> >         [  254.514243] RBP: 000000000000000a R08: 000055b7f5e58690 R09: 0000000000000000
> >         [  254.515058] R10: 0000000000000000 R11: 0000000000000217 R12: 0000000000000011
> >         [  254.515833] R13: 00007ffec6b60c88 R14: 000055b7f5e3828c R15: 000000000000000a
> >         [  254.516702]  </TASK>
> >         [  254.517021] irq event stamp: 1438412
> >         [  254.517510] hardirqs last enabled at (1438411): _raw_spin_unlock_irqrestore (./arch/x86/include/asm/irqflags.h:26 ./arch/x86/include/asm/irqflags.h:87 ./arch/x86/include/asm/irqflags.h:147 ./include/linux/spinlock_api_smp.h:151 kernel/locking/spinlock.c:194)
> >         [  254.518554] hardirqs last disabled at (1438412): native_machine_shutdown (arch/x86/kernel/reboot.c:?)
> >         [  254.519466] softirqs last enabled at (1438324): handle_softirqs (./arch/x86/include/asm/preempt.h:26 kernel/softirq.c:408 kernel/softirq.c:589)
> >         [  254.520398] softirqs last disabled at (1438303): __irq_exit_rcu (./arch/x86/include/asm/jump_label.h:36 kernel/softirq.c:664)
> >
> >
> > > Note first thing synchronize_group_exit() is going to do is cycle
> > > through an irq-protected lock, so by the time it unlocks irqs are
> > > enabled again.
> >
> > When pid=1 is being killed, then synchronize_group_exit() will be called
> > with irq enabled (as shown by the warning above), and
> > synchronize_group_exit()->spin_unlock_irq() will restore the interrupt
> > (once it got disabled in spin_lock_irq() pair).
> >
> > On the other side, if irqs are disabled when synchronize_group_exit() is
> > called, then synchronize_group_exit->spin_unlock_irq() will not enable
> > the interrupts, right?
> 
> It will enable interrupts. spin_lock_irq() blindly disables them, but
> in your case they are already off and this bit is a nop.
> spin_unlock_irq() blindly enables them and this is where they are back
> regardless of the initial state.
> 
> Thus even with the buggy caller interrupts are enabled after
> synchronize_group_exit() returns.

In fact this WARN_ON(irqs_disabled()); doesn't mean much, since it
doesn't matter what is the interrupt state just before
synchronize_group_exit, becaue it will disabled it, and enable it
anyway.

That means that WARN_ON() is useless(!?)

> I think this is the only reason there are no further warnings about
> the state when proceeding with exit.
> 
> Preferably this would be fixed so that nobody calls here with irqs off.

Does it really matter what is the interrupt state when calling do_exit()
since it will be enabled anyway?

Thanks!
--breno
Re: [PATCH] exit: skip IRQ disabled warning during power off
Posted by Oleg Nesterov 10 months, 1 week ago
On 04/04, Breno Leitao wrote:
>
> 	[  254.474273] reboot: Power down
> 	[  254.479332] ------------[ cut here ]------------
> 	[  254.479934] WARNING: CPU: 0 PID: 1 at kernel/exit.c:881 do_exit (kernel/exit.c:881)
                                       ^^^^^^
confused... so it seems that the init task does
sys_reboot(LINUX_REBOOT_CMD_POWER_OFF) and exits?

But if the init task exits for any reason it should trigger

	if (unlikely(is_global_init(tsk)))
		panic("Attempted to kill init! exitcode=0x%08x\n"

below?

OK, it can be multi-threaded, but still this look like we have more
problems than just WARN_ON(irqs_disabled()).

> When pid=1 is being killed, then synchronize_group_exit() will be called
> with irq enabled (as shown by the warning above),

Hmm. synchronize_group_exit() is called right after WARN_ON(irqs_disabled())
you observe, WARN_ON() won't enable irqs?

Oleg.
Re: [PATCH] exit: skip IRQ disabled warning during power off
Posted by Breno Leitao 10 months, 1 week ago
On Fri, Apr 04, 2025 at 04:16:24PM +0200, Oleg Nesterov wrote:
> On 04/04, Breno Leitao wrote:
> >
> > 	[  254.474273] reboot: Power down
> > 	[  254.479332] ------------[ cut here ]------------
> > 	[  254.479934] WARNING: CPU: 0 PID: 1 at kernel/exit.c:881 do_exit (kernel/exit.c:881)
>                                        ^^^^^^
> confused... so it seems that the init task does
> sys_reboot(LINUX_REBOOT_CMD_POWER_OFF) and exits?
> 
> But if the init task exits for any reason it should trigger
> 
> 	if (unlikely(is_global_init(tsk)))
> 		panic("Attempted to kill init! exitcode=0x%08x\n"

That is showing up later (sorry for not decoding it, but I don't think
it is super important)


	[   64.447039] ACPI: PM: Preparing to enter system sleep state S5
	[   64.454964] reboot: Power down
	[   64.460014] ------------[ cut here ]------------
	[   64.460597] WARNING: CPU: 0 PID: 1 at kernel/exit.c:881 do_exit+0xd0/0x2470
	[   64.461600] Modules linked in: evdev(E) serio_raw(E) button(E) virtio_mmio(E) 9pnet_virtio(E) 9p(E) 9pnet(E) netfs(E)
	[   64.462929] CPU: 0 UID: 0 PID: 1 Comm: virtme-ng-init Tainted: G            E    N 6.13.0-rc7-next-20250117-upstream #45
	[   64.464075] Tainted: [E]=UNSIGNED_MODULE, [N]=TEST
	[   64.464683] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
	[   64.465960] RIP: 0010:do_exit+0xd0/0x2470
	[   64.466427] Code: 00 00 45 31 f6 f7 c3 00 02 00 00 41 0f 94 c6 48 c7 c7 48 8b 5d 87 44 89 f6 31 d2 31 c9 e8 c8 c5 41 00 f7 c3 00 02 00 00 75 02 <0f> 0b 48 c7 c7 78 8b 5d 87 44 89 f6 31 d2 31 c9 e8 ab c5 41 00 48
	[   64.468353] RSP: 0018:ffa000000001fb80 EFLAGS: 00010046
	[   64.468949] RAX: b246b26687739d00 RBX: 0000000000000016 RCX: 0000000000000000
	[   64.469832] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff875d8b48
	[   64.470648] RBP: ffa000000001fd10 R08: dffffc0000000000 R09: 1ffffffff13312b6
	[   64.471451] R10: dffffc0000000000 R11: fffffbfff13312b7 R12: 000000004321fedc
	[   64.472320] R13: dffffc0000000000 R14: 0000000000000001 R15: dffffc0000000000
	[   64.473200] FS:  00007ff47a8a0780(0000) GS:ff110004c6000000(0000) knlGS:0000000000000000
	[   64.474141] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
	[   64.474841] CR2: 00007f4eb4ba1648 CR3: 000000010dcde004 CR4: 0000000000771ef0
	[   64.475667] PKRU: 55555554
	[   64.476058] Call Trace:
	[   64.476360]  <TASK>
	[   64.476907]  ? __warn+0x130/0x500
	[   64.477358]  ? do_exit+0xd0/0x2470
	[   64.477794]  ? do_exit+0xd0/0x2470
	[   64.478210]  ? report_bug+0x1bd/0x400
	[   64.478675]  ? handle_bug+0x63/0x90
	[   64.479082]  ? exc_invalid_op+0x1a/0x50
	[   64.479498]  ? asm_exc_invalid_op+0x1a/0x20
	[   64.479976]  ? do_exit+0xd0/0x2470
	[   64.480457]  ? srso_alias_return_thunk+0x5/0xfbef5
	[   64.481071]  ? __rcu_read_unlock+0xf3/0x630
	[   64.481513]  ? atomic_notifier_call_chain+0x26/0x110
	[   64.482075]  ? __pfx_do_exit+0x10/0x10
	[   64.482594]  ? __pfx_ftrace_likely_update+0x10/0x10
	[   64.483493]  ? native_machine_shutdown+0x12c/0x1f0
	[   64.484248]  ? atomic_notifier_call_chain+0x26/0x110
	[   64.484854]  __x64_sys_reboot+0x339/0x340
	[   64.485275]  ? __pfx___x64_sys_reboot+0x10/0x10
	[   64.485986]  ? __pfx_ftrace_likely_update+0x10/0x10
	[   64.486564]  ? srso_alias_return_thunk+0x5/0xfbef5
	[   64.487118]  ? __pfx_ftrace_likely_update+0x10/0x10
	[   64.487668]  ? srso_alias_return_thunk+0x5/0xfbef5
	[   64.488211]  ? __pfx_ftrace_likely_update+0x10/0x10
	[   64.488772]  ? srso_alias_return_thunk+0x5/0xfbef5
	[   64.489317]  ? do_syscall_64+0x77/0x260
	[   64.489902]  do_syscall_64+0xd6/0x260
	[   64.490349]  ? exc_page_fault+0xc8/0x130
	[   64.490811]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
	[   64.491332] RIP: 0033:0x7ff47a704a27
	[   64.491789] Code: 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 89 fa be 69 19 12 28 bf ad de e1 fe b8 a9 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 01 c3 48 8b 15 c1 43 0f 00 f7 d8 64 89 02 b8
	[   64.493809] RSP: 002b:00007ffed123b8e8 EFLAGS: 00000217 ORIG_RAX: 00000000000000a9
	[   64.494601] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007ff47a704a27
	[   64.495454] RDX: 000000004321fedc RSI: 0000000028121969 RDI: 00000000fee1dead
	[   64.496254] RBP: 000000000000000a R08: 000056492265c690 R09: 0000000000000000
	[   64.497111] R10: 0000000000000000 R11: 0000000000000217 R12: 0000000000000011
	[   64.498048] R13: 00007ffed123bca8 R14: 000056492263c28c R15: 000000000000000a
	[   64.498940]  </TASK>
	[   64.499238] irq event stamp: 1436466
	[   64.499677] hardirqs last  enabled at (1436465): [<ffffffff847d9edb>] _raw_spin_unlock_irqrestore+0x8b/0x150
	[   64.500762] hardirqs last disabled at (1436466): [<ffffffff81369164>] native_machine_shutdown+0xc4/0x1f0
	[   64.501815] softirqs last  enabled at (1436194): [<ffffffff8142617f>] handle_softirqs+0xa0f/0xb80
	[   64.502716] softirqs last disabled at (1436175): [<ffffffff81426532>] __irq_exit_rcu+0xc2/0x200
	[   64.503728] ---[ end trace 0000000000000000 ]---
	[   64.504458] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000000
	[   64.505337] CPU: 0 UID: 0 PID: 1 Comm: virtme-ng-init Tainted: G        W   E    N 6.13.0-rc7-next-20250117-upstream #45
	[   64.506249] Tainted: [W]=WARN, [E]=UNSIGNED_MODULE, [N]=TEST
	[   64.506874] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
	[   64.507881] Call Trace:
	[   64.508136]  <TASK>
	[   64.508452]  dump_stack_lvl+0x8e/0x1b0
	[   64.508975]  ? __pfx_dump_stack_lvl+0x10/0x10
	[   64.509448]  ? __pfx__printk+0x10/0x10
	[   64.509891]  ? srso_alias_return_thunk+0x5/0xfbef5
	[   64.510363]  panic+0x256/0x550
	[   64.510745]  ? srso_alias_return_thunk+0x5/0xfbef5
	[   64.511209]  ? __pfx_panic+0x10/0x10
	[   64.511665]  ? __pfx_ftrace_likely_update+0x10/0x10
	[   64.512269]  ? _raw_spin_unlock_irq+0x66/0x80
	[   64.512757]  ? srso_alias_return_thunk+0x5/0xfbef5
	[   64.513200]  ? seccomp_filter_release+0xf9/0x130
	[   64.513696]  do_exit+0x623/0x2470
	[   64.514073]  ? srso_alias_return_thunk+0x5/0xfbef5
	[   64.514523]  ? __rcu_read_unlock+0xf3/0x630
	[   64.514937]  ? __pfx_do_exit+0x10/0x10
	[   64.515342]  ? __pfx_ftrace_likely_update+0x10/0x10
	[   64.515949]  ? native_machine_shutdown+0x12c/0x1f0
	[   64.516437]  ? atomic_notifier_call_chain+0x26/0x110
	[   64.516941]  __x64_sys_reboot+0x339/0x340
	[   64.517302]  ? __pfx___x64_sys_reboot+0x10/0x10
	[   64.517793]  ? __pfx_ftrace_likely_update+0x10/0x10
	[   64.518288]  ? srso_alias_return_thunk+0x5/0xfbef5
	[   64.518775]  ? __pfx_ftrace_likely_update+0x10/0x10
	[   64.519363]  ? srso_alias_return_thunk+0x5/0xfbef5
	[   64.519889]  ? __pfx_ftrace_likely_update+0x10/0x10
	[   64.520350]  ? srso_alias_return_thunk+0x5/0xfbef5
	[   64.520828]  ? do_syscall_64+0x77/0x260
	[   64.521192]  do_syscall_64+0xd6/0x260
	[   64.521537]  ? exc_page_fault+0xc8/0x130
	[   64.521922]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
	[   64.522510] RIP: 0033:0x7ff47a704a27
	[   64.522918] Code: 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 89 fa be 69 19 12 28 bf ad de e1 fe b8 a9 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 01 c3 48 8b 15 c1 43 0f 00 f7 d8 64 89 02 b8
	[   64.524412] RSP: 002b:00007ffed123b8e8 EFLAGS: 00000217 ORIG_RAX: 00000000000000a9
	[   64.525153] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007ff47a704a27
	[   64.525966] RDX: 000000004321fedc RSI: 0000000028121969 RDI: 00000000fee1dead
	[   64.526651] RBP: 000000000000000a R08: 000056492265c690 R09: 0000000000000000
	[   64.527392] R10: 0000000000000000 R11: 0000000000000217 R12: 0000000000000011
	[   64.528059] R13: 00007ffed123bca8 R14: 000056492263c28c R15: 000000000000000a
	[   64.528847]  </TASK>
	[   64.529287] Kernel Offset: disabled
	[   64.530125] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000000 ]---

> 
> below?
> 
> OK, it can be multi-threaded, but still this look like we have more
> problems than just WARN_ON(irqs_disabled()).
> 
> > When pid=1 is being killed, then synchronize_group_exit() will be called
> > with irq enabled (as shown by the warning above),
> 
> Hmm. synchronize_group_exit() is called right after WARN_ON(irqs_disabled())
> you observe, WARN_ON() won't enable irqs?

That is Mateusz raised as well. We are discussing it on the other
thread. TL;DR: Is that warning wrongly placed, since IRQs will be
enabled after synchronize_group_exit()?!

Let's move this discussion to that thread, please.
Re: [PATCH] exit: skip IRQ disabled warning during power off
Posted by Oleg Nesterov 10 months, 1 week ago
On 04/04, Breno Leitao wrote:
>
> On Fri, Apr 04, 2025 at 04:16:24PM +0200, Oleg Nesterov wrote:
> > On 04/04, Breno Leitao wrote:
> > >
> > > 	[  254.474273] reboot: Power down
> > > 	[  254.479332] ------------[ cut here ]------------
> > > 	[  254.479934] WARNING: CPU: 0 PID: 1 at kernel/exit.c:881 do_exit (kernel/exit.c:881)
> >                                        ^^^^^^
> > confused... so it seems that the init task does
> > sys_reboot(LINUX_REBOOT_CMD_POWER_OFF) and exits?
> >
> > But if the init task exits for any reason it should trigger
> >
> > 	if (unlikely(is_global_init(tsk)))
> > 		panic("Attempted to kill init! exitcode=0x%08x\n"
>
> That is showing up later

OK, and this proves that we have more problems than just this WARN_ON().

Again, this is not my area, most probably I am wrong, but it seems to me
that do_exit() simply should not be called.

> We are discussing it on the other thread.
...
> Let's move this discussion to that thread, please.

Which thread? I wasn't cc'ed...

Oleg.
Re: [PATCH] exit: skip IRQ disabled warning during power off
Posted by Andrew Morton 9 months ago
On Fri, 4 Apr 2025 17:31:04 +0200 Oleg Nesterov <oleg@redhat.com> wrote:

> > > But if the init task exits for any reason it should trigger
> > >
> > > 	if (unlikely(is_global_init(tsk)))
> > > 		panic("Attempted to kill init! exitcode=0x%08x\n"
> >
> > That is showing up later
> 
> OK, and this proves that we have more problems than just this WARN_ON().
> 
> Again, this is not my area, most probably I am wrong, but it seems to me
> that do_exit() simply should not be called.
> 
> > We are discussing it on the other thread.
> ...
> > Let's move this discussion to that thread, please.
> 
> Which thread? I wasn't cc'ed...

Neither was I.

I'll drop this patch from mm.git.  Someone please cc me on any
new developments.