hung_task: Support to panic when the maximum number of hung task warnings is reached

[PATCH][RFC] hung_task: Support to panic when the maximum number of hung task warnings is reached

Posted by lirongqing 1 week, 1 day ago

From: Li RongQing <lirongqing@baidu.com>

Currently the hung task detector can either panic immediately or continue
operation when hung tasks are detected. However, there are scenarios
where we want a more balanced approach:

- We don't want the system to panic immediately when a few hung tasks
  are detected, as the system may be able to recover
- And we also don't want the system to stall indefinitely with multiple
  hung tasks

This commit introduces a new mode (value 2) for the hung task panic behavior.
When set to 2, the system will panic only after the maximum number of hung
task warnings (hung_task_warnings) has been reached.

This provides a middle ground between immediate panic and potentially
infinite stall, allowing for automated vmcore generation after a reasonable
number of hung task incidents.

Signed-off-by: Li RongQing <lirongqing@baidu.com>
---
 Documentation/admin-guide/kernel-parameters.txt | 15 ++++++++-------
 Documentation/admin-guide/sysctl/kernel.rst     |  1 +
 kernel/hung_task.c                              |  5 +++--
 lib/Kconfig.debug                               |  4 ++--
 4 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 5a7a83c..f2a9876 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1993,13 +1993,14 @@
 
 	hung_task_panic=
 			[KNL] Should the hung task detector generate panics.
-			Format: 0 | 1
-
-			A value of 1 instructs the kernel to panic when a
-			hung task is detected. The default value is controlled
-			by the CONFIG_BOOTPARAM_HUNG_TASK_PANIC build-time
-			option. The value selected by this boot parameter can
-			be changed later by the kernel.hung_task_panic sysctl.
+			Format: 0 | 1 | 2
+
+			A value of 1 instructs the kernel to panic when a hung task is detected.
+			A value of 2 instructs the kernel to panic when hung_task_warnings is
+			decreased to 0.  The default value is controlled by the
+			CONFIG_BOOTPARAM_HUNG_TASK_PANIC build-time option. The value selected
+			by this boot parameter can be changed later by the kernel.hung_task_panic
+			sysctl.
 
 	hvc_iucv=	[S390]	Number of z/VM IUCV hypervisor console (HVC)
 				terminal devices. Valid values: 0..8
diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index 8b49eab..6f77241 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -403,6 +403,7 @@ This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled.
 = =================================================
 0 Continue operation. This is the default behavior.
 1 Panic immediately.
+2 Panic when hung_task_warnings is decreased to 0.
 = =================================================
 
 
diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index 8708a12..b052ec7 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -219,7 +219,8 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout)
 
 	trace_sched_process_hang(t);
 
-	if (sysctl_hung_task_panic) {
+	if ((sysctl_hung_task_panic == 1) ||
+		(!sysctl_hung_task_warnings && sysctl_hung_task_panic == 2)) {
 		console_verbose();
 		hung_task_show_lock = true;
 		hung_task_call_panic = true;
@@ -385,7 +386,7 @@ static const struct ctl_table hung_task_sysctls[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= SYSCTL_ZERO,
-		.extra2		= SYSCTL_ONE,
+		.extra2		= SYSCTL_TWO,
 	},
 	{
 		.procname	= "hung_task_check_count",
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index dc0e0c6..e7cf166 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1264,10 +1264,10 @@ config DEFAULT_HUNG_TASK_TIMEOUT
 	  Keeping the default should be fine in most cases.
 
 config BOOTPARAM_HUNG_TASK_PANIC
-	bool "Panic (Reboot) On Hung Tasks"
+	int "Panic (Reboot) On Hung Tasks"
 	depends on DETECT_HUNG_TASK
 	help
-	  Say Y here to enable the kernel to panic on "hung tasks",
+	  Say 1|2 here to enable the kernel to panic on "hung tasks",
 	  which are bugs that cause the kernel to leave a task stuck
 	  in uninterruptible "D" state.
 
-- 
2.9.4

Re: [PATCH][RFC] hung_task: Support to panic when the maximum number of hung task warnings is reached

Posted by Randy Dunlap 1 week, 1 day ago


On 9/22/25 8:37 PM, lirongqing wrote:
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index dc0e0c6..e7cf166 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -1264,10 +1264,10 @@ config DEFAULT_HUNG_TASK_TIMEOUT
>  	  Keeping the default should be fine in most cases.
>  
>  config BOOTPARAM_HUNG_TASK_PANIC
> -	bool "Panic (Reboot) On Hung Tasks"
> +	int "Panic (Reboot) On Hung Tasks"
>  	depends on DETECT_HUNG_TASK
>  	help
> -	  Say Y here to enable the kernel to panic on "hung tasks",
> +	  Say 1|2 here to enable the kernel to panic on "hung tasks",

Please make that   "1|2"    more user-friendly, e.g., "1 or 2".

>  	  which are bugs that cause the kernel to leave a task stuck
>  	  in uninterruptible "D" state.

-- 
~Randy

Re: [PATCH][RFC] hung_task: Support to panic when the maximum number of hung task warnings is reached

Posted by Andrew Morton 1 week, 1 day ago

On Tue, 23 Sep 2025 11:37:40 +0800 lirongqing <lirongqing@baidu.com> wrote:

> Currently the hung task detector can either panic immediately or continue
> operation when hung tasks are detected. However, there are scenarios
> where we want a more balanced approach:
> 
> - We don't want the system to panic immediately when a few hung tasks
>   are detected, as the system may be able to recover
> - And we also don't want the system to stall indefinitely with multiple
>   hung tasks
> 
> This commit introduces a new mode (value 2) for the hung task panic behavior.
> When set to 2, the system will panic only after the maximum number of hung
> task warnings (hung_task_warnings) has been reached.
> 
> This provides a middle ground between immediate panic and potentially
> infinite stall, allowing for automated vmcore generation after a reasonable

I assume the same argument applies to the NMI watchdog, to the
softlockup detector and to the RCU stall detector?

A general framework to handle all of these might be better.  But why do
it in kernel at all?  What about a userspace detector which parses
kernel logs (or new procfs counters) and makes such decisions?

RE: [????] Re: [PATCH][RFC] hung_task: Support to panic when the maximum number of hung task warnings is reached