[RFC PATCH] sched/core: Enhanced debug logs in do_task_dead()

Zhongqiu Han posted 1 patch 1 year ago
kernel/sched/core.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
[RFC PATCH] sched/core: Enhanced debug logs in do_task_dead()
Posted by Zhongqiu Han 1 year ago
If BUG() is a NOP, dump the problematic stack for debugging purposes.

Signed-off-by: Zhongqiu Han <quic_zhonhan@quicinc.com>
---
If BUG() is a NOP, it should make sense for debugging purposes. However,
just arising the patch with RFC, because at least for now, I haven't found
a definition of BUG() as NOP in various architectures. Thanks~

 kernel/sched/core.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f59f2c0f6e32..fc36a9c5c136 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6774,7 +6774,11 @@ void __noreturn do_task_dead(void)
 	__schedule(SM_NONE);
 	BUG();
 
-	/* Avoid "noreturn function does return" - but don't continue if BUG() is a NOP: */
+	/*
+	 * Don't continue if BUG() is a NOP to avoid "noreturn function
+	 * does return" and dump stack for this case.
+	 */
+	dump_stack();
 	for (;;)
 		cpu_relax();
 }
-- 
2.25.1
Re: [RFC PATCH] sched/core: Enhanced debug logs in do_task_dead()
Posted by Peter Zijlstra 1 year ago
On Tue, Dec 10, 2024 at 09:45:13PM +0800, Zhongqiu Han wrote:
> If BUG() is a NOP, dump the problematic stack for debugging purposes.
> 
> Signed-off-by: Zhongqiu Han <quic_zhonhan@quicinc.com>
> ---
> If BUG() is a NOP, it should make sense for debugging purposes. However,
> just arising the patch with RFC, because at least for now, I haven't found
> a definition of BUG() as NOP in various architectures. Thanks~

Yeah, this don't make sense. If you want a stack-trace you shouldn't
have killed BUG.

And yeah, having done a quick peek, I don't see how you can kill BUG
these days other than explicitly modyfing your source, in which case you
get to keep the pieces.
Re: [RFC PATCH] sched/core: Enhanced debug logs in do_task_dead()
Posted by Zhongqiu Han 1 year ago
On 12/10/2024 10:18 PM, Peter Zijlstra wrote:
> On Tue, Dec 10, 2024 at 09:45:13PM +0800, Zhongqiu Han wrote:
>> If BUG() is a NOP, dump the problematic stack for debugging purposes.
>>
>> Signed-off-by: Zhongqiu Han <quic_zhonhan@quicinc.com>
>> ---
>> If BUG() is a NOP, it should make sense for debugging purposes. However,
>> just arising the patch with RFC, because at least for now, I haven't found
>> a definition of BUG() as NOP in various architectures. Thanks~
> 
> Yeah, this don't make sense. If you want a stack-trace you shouldn't
> have killed BUG.
> 
> And yeah, having done a quick peek, I don't see how you can kill BUG
> these days other than explicitly modyfing your source, in which case you
> get to keep the pieces.

Hi Peter,
Thanks a lot for your review and reply!

My initial thought was that if BUG() is a NOP, we should dump the stack
for debugging purposes before executing cpu_relax(), because according
to the current do_task_dead() code comments, BUG() might be a NOP in
some cases.
void __noreturn do_task_dead(void)
{
         /* Causes final put_task_struct in finish_task_switch(): */
         set_special_state(TASK_DEAD);

         /* Tell freezer to ignore us: */
         current->flags |= PF_NOFREEZE;

         __schedule(SM_NONE);
         BUG();

         /* Avoid "noreturn function does return" - but don't continue if
BUG() is a NOP: */-------------------> This comments
         for (;;)
                 cpu_relax();
}

For example, in the code _before_ 7/4/2014 commit a4b5d580e078 ('bug:
Make BUG() always stop the machine')., BUG() is indeed a NOP if (
!CONFIG_BUG && !HAVE_ARCH_BUG)

  #else /* !CONFIG_BUG */
  #ifndef HAVE_ARCH_BUG
-#define BUG() do {} while (0)
+#define BUG() do {} while (1)
  #endif



However, since 10/4/2024 commit 5284984a4fba ('bug: Fix no-return
statement warning with !CONFIG_BUG'), with !CONFIG_BUG, the default
BUG() is implemented to call unreachable() to ensure it does not return,
and if all architecture-specific implementations of BUG() also do not
return, I now think my thought should be meaningless.

commit 5284984a4fba ('bug: Fix no-return-statement warning with 
!CONFIG_BUG')
  #else /* !CONFIG_BUG */
  #ifndef HAVE_ARCH_BUG
-#define BUG() do {} while (1)
+#define BUG() do {             \
+       do {} while (1);        \
+       unreachable();          \
+} while (0)
  #endif


By the way, do you think it is reasonable to remove the cpu_relax code
by evaluating whether BUG() does not return in all architectures and
configurations? Additionally, _if_ some architecture-specific
implementations of BUG() do not have the capability to output the thread
stack, should we consider using the panic() function as a replacement?

Thanks for your time.



-- 
Thx and BRs,
Zhongqiu Han