[v3] hung_task: Provide runtime reset interface for hung task detector

[PATCH v3 2/2] hung_task: Enable runtime reset of hung_task_detect_count

Posted by Aaron Tomlin 1 month, 3 weeks ago

Introduce support for writing to /proc/sys/kernel/hung_task_detect_count.

Writing any value to this file atomically resets the counter of detected
hung tasks to zero. This grants system administrators the ability to clear
the cumulative diagnostic history after resolving an incident, simplifying
monitoring without requiring a system restart.

Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
---
 Documentation/admin-guide/sysctl/kernel.rst |  2 +-
 kernel/hung_task.c                          | 29 +++++++++++++++++++--
 2 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index 239da22c4e28..43c17b919969 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -418,7 +418,7 @@ hung_task_detect_count
 ======================
 
 Indicates the total number of tasks that have been detected as hung since
-the system boot.
+the system boot. The counter can be reset to zero when written to.
 
 This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled.
 
diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index 5902573200c0..01ce46a107b0 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -375,6 +375,31 @@ static long hung_timeout_jiffies(unsigned long last_checked,
 }
 
 #ifdef CONFIG_SYSCTL
+
+/**
+ * proc_dohung_task_detect_count - proc handler for hung_task_detect_count
+ * @table: Pointer to the struct ctl_table definition for this proc entry
+ * @write: Flag indicating the operation
+ * @buffer: User space buffer for data transfer
+ * @lenp: Pointer to the length of the data being transferred
+ * @ppos: Pointer to the current file offset
+ *
+ * This handler is used for reading the current hung task detection count
+ * and for resetting it to zero when a write operation is performed.
+ * Returns 0 on success or a negative error code on failure.
+ */
+static int proc_dohung_task_detect_count(const struct ctl_table *table, int write,
+					 void *buffer, size_t *lenp, loff_t *ppos)
+{
+	if (!write)
+		return proc_doulongvec_minmax(table, write, buffer, lenp, ppos);
+
+	WRITE_ONCE(sysctl_hung_task_detect_count, 0);
+	*ppos += *lenp;
+
+	return 0;
+}
+
 /*
  * Process updating of timeout sysctl
  */
@@ -457,8 +482,8 @@ static const struct ctl_table hung_task_sysctls[] = {
 		.procname	= "hung_task_detect_count",
 		.data		= &sysctl_hung_task_detect_count,
 		.maxlen		= sizeof(unsigned long),
-		.mode		= 0444,
-		.proc_handler	= proc_doulongvec_minmax,
+		.mode		= 0644,
+		.proc_handler	= proc_dohung_task_detect_count,
 	},
 	{
 		.procname	= "hung_task_sys_info",
-- 
2.51.0

Re: [PATCH v3 2/2] hung_task: Enable runtime reset of hung_task_detect_count

Posted by Petr Mladek 1 month, 3 weeks ago

Adding Joel into Cc. He is improving the sysctl API...

On Mon 2025-12-15 22:00:36, Aaron Tomlin wrote:
> Introduce support for writing to /proc/sys/kernel/hung_task_detect_count.
> 
> Writing any value to this file atomically resets the counter of detected
> hung tasks to zero. This grants system administrators the ability to clear
> the cumulative diagnostic history after resolving an incident, simplifying
> monitoring without requiring a system restart.
> 
> --- a/Documentation/admin-guide/sysctl/kernel.rst
> +++ b/Documentation/admin-guide/sysctl/kernel.rst
> @@ -418,7 +418,7 @@ hung_task_detect_count
>  ======================
>  
>  Indicates the total number of tasks that have been detected as hung since
> -the system boot.
> +the system boot. The counter can be reset to zero when written to.
>  
>  This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled.
>  
> diff --git a/kernel/hung_task.c b/kernel/hung_task.c
> index 5902573200c0..01ce46a107b0 100644
> --- a/kernel/hung_task.c
> +++ b/kernel/hung_task.c
> @@ -375,6 +375,31 @@ static long hung_timeout_jiffies(unsigned long last_checked,
>  }
>  
>  #ifdef CONFIG_SYSCTL
> +
> +/**
> + * proc_dohung_task_detect_count - proc handler for hung_task_detect_count
> + * @table: Pointer to the struct ctl_table definition for this proc entry
> + * @write: Flag indicating the operation
> + * @buffer: User space buffer for data transfer
> + * @lenp: Pointer to the length of the data being transferred
> + * @ppos: Pointer to the current file offset
> + *
> + * This handler is used for reading the current hung task detection count
> + * and for resetting it to zero when a write operation is performed.
> + * Returns 0 on success or a negative error code on failure.
> + */
> +static int proc_dohung_task_detect_count(const struct ctl_table *table, int write,
> +					 void *buffer, size_t *lenp, loff_t *ppos)
> +{
> +	if (!write)
> +		return proc_doulongvec_minmax(table, write, buffer, lenp, ppos);

There have been some changes in the sysctl API recently, see
https://lore.kernel.org/lkml/20251016-jag-sysctl_conv-v2-0-a2f16529acc4@kernel.org/

They are backward compatible, so the above code works. But it would be
nice to make it up-to-date, namely:

  + Replace "write" with "dir"
  + Use SYSCTL_USER_TO_KERN(dir) instead of (!write)


> +	WRITE_ONCE(sysctl_hung_task_detect_count, 0);

I might be too conservative. But it looks weird to allow clearing the
value by any write. It would be better to return -EINVAL for non-zero
values. This would require using a copy of struct ctl_table and read
the value into a temporary variable.

> +	*ppos += *lenp;
> +
> +	return 0;
> +}

I have played with the code. The diff on top of this patch would
look like:

diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index 01ce46a107b0..ebb3dfd0b148 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -379,7 +379,7 @@ static long hung_timeout_jiffies(unsigned long last_checked,
 /**
  * proc_dohung_task_detect_count - proc handler for hung_task_detect_count
  * @table: Pointer to the struct ctl_table definition for this proc entry
- * @write: Flag indicating the operation
+ * @dir: Flag indicating the operation
  * @buffer: User space buffer for data transfer
  * @lenp: Pointer to the length of the data being transferred
  * @ppos: Pointer to the current file offset
@@ -388,16 +388,29 @@ static long hung_timeout_jiffies(unsigned long last_checked,
  * and for resetting it to zero when a write operation is performed.
  * Returns 0 on success or a negative error code on failure.
  */
-static int proc_dohung_task_detect_count(const struct ctl_table *table, int write,
+static int proc_dohung_task_detect_count(const struct ctl_table *table, int dir,
 					 void *buffer, size_t *lenp, loff_t *ppos)
 {
-	if (!write)
-		return proc_doulongvec_minmax(table, write, buffer, lenp, ppos);
+	unsigned long detect_count = READ_ONCE(sysctl_hung_task_detect_count);
+	struct ctl_table t;
+	int err;
 
-	WRITE_ONCE(sysctl_hung_task_detect_count, 0);
-	*ppos += *lenp;
+	t = *table;
+	t.data = &detect_count;
 
-	return 0;
+	err = proc_doulongvec_minmax(&t, dir, buffer, lenp, ppos);
+	if (err < 0)
+		return err;
+
+	if (SYSCTL_USER_TO_KERN(dir)) {
+		/* The only valid value for clearing is zero. */
+		if (detect_count)
+			return -EINVAL;
+
+		WRITE_ONCE(sysctl_hung_task_detect_count, 0);
+	}
+
+	return err;
 }
 
 /*


Best Regards,
Petr

Re: [PATCH v3 2/2] hung_task: Enable runtime reset of hung_task_detect_count

Posted by Lance Yang 1 month, 3 weeks ago


On 2025/12/17 20:48, Petr Mladek wrote:
> Adding Joel into Cc. He is improving the sysctl API...
> 
> On Mon 2025-12-15 22:00:36, Aaron Tomlin wrote:
>> Introduce support for writing to /proc/sys/kernel/hung_task_detect_count.
>>
>> Writing any value to this file atomically resets the counter of detected
>> hung tasks to zero. This grants system administrators the ability to clear
>> the cumulative diagnostic history after resolving an incident, simplifying
>> monitoring without requiring a system restart.
>>
>> --- a/Documentation/admin-guide/sysctl/kernel.rst
>> +++ b/Documentation/admin-guide/sysctl/kernel.rst
>> @@ -418,7 +418,7 @@ hung_task_detect_count
>>   ======================
>>   
>>   Indicates the total number of tasks that have been detected as hung since
>> -the system boot.
>> +the system boot. The counter can be reset to zero when written to.
>>   
>>   This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled.
>>   
>> diff --git a/kernel/hung_task.c b/kernel/hung_task.c
>> index 5902573200c0..01ce46a107b0 100644
>> --- a/kernel/hung_task.c
>> +++ b/kernel/hung_task.c
>> @@ -375,6 +375,31 @@ static long hung_timeout_jiffies(unsigned long last_checked,
>>   }
>>   
>>   #ifdef CONFIG_SYSCTL
>> +
>> +/**
>> + * proc_dohung_task_detect_count - proc handler for hung_task_detect_count
>> + * @table: Pointer to the struct ctl_table definition for this proc entry
>> + * @write: Flag indicating the operation
>> + * @buffer: User space buffer for data transfer
>> + * @lenp: Pointer to the length of the data being transferred
>> + * @ppos: Pointer to the current file offset
>> + *
>> + * This handler is used for reading the current hung task detection count
>> + * and for resetting it to zero when a write operation is performed.
>> + * Returns 0 on success or a negative error code on failure.
>> + */
>> +static int proc_dohung_task_detect_count(const struct ctl_table *table, int write,
>> +					 void *buffer, size_t *lenp, loff_t *ppos)
>> +{
>> +	if (!write)
>> +		return proc_doulongvec_minmax(table, write, buffer, lenp, ppos);
> 
> There have been some changes in the sysctl API recently, see
> https://lore.kernel.org/lkml/20251016-jag-sysctl_conv-v2-0-a2f16529acc4@kernel.org/
> 
> They are backward compatible, so the above code works. But it would be
> nice to make it up-to-date, namely:
> 
>    + Replace "write" with "dir"
>    + Use SYSCTL_USER_TO_KERN(dir) instead of (!write)
> 
> 
>> +	WRITE_ONCE(sysctl_hung_task_detect_count, 0);
> 
> I might be too conservative. But it looks weird to allow clearing the
> value by any write. It would be better to return -EINVAL for non-zero
> values. This would require using a copy of struct ctl_table and read
> the value into a temporary variable.

That's okay, I think. See vmstat_refresh() for a similar pattern - it
accepts any write value to trigger the refresh operation without
validating the specific value ;)

Re: [PATCH v3 2/2] hung_task: Enable runtime reset of hung_task_detect_count

Posted by Joel Granados 1 month, 3 weeks ago

On Wed, Dec 17, 2025 at 09:21:01PM +0800, Lance Yang wrote:
> 
> 
> On 2025/12/17 20:48, Petr Mladek wrote:
> > Adding Joel into Cc. He is improving the sysctl API...
> > 
> > On Mon 2025-12-15 22:00:36, Aaron Tomlin wrote:
> > > Introduce support for writing to /proc/sys/kernel/hung_task_detect_count.
> > > 
> > > Writing any value to this file atomically resets the counter of detected
> > > hung tasks to zero. This grants system administrators the ability to clear
> > > the cumulative diagnostic history after resolving an incident, simplifying
> > > monitoring without requiring a system restart.
I see that this might be refactored due to dependency that was not
considered. I'll write my comments, but will also drop it from my radar
for now.

> > > 
> > > --- a/Documentation/admin-guide/sysctl/kernel.rst
> > > +++ b/Documentation/admin-guide/sysctl/kernel.rst
> > > @@ -418,7 +418,7 @@ hung_task_detect_count
> > >   ======================
> > >   Indicates the total number of tasks that have been detected as hung since
> > > -the system boot.
> > > +the system boot. The counter can be reset to zero when written to.
Would it make more sense to write it like this:

  Indicates the total number of tasks that have been detected as hung
  since the system boot or since the counter was reset. Counter is
  zeroed when written to.

> > >   This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled.
> > > diff --git a/kernel/hung_task.c b/kernel/hung_task.c
> > > index 5902573200c0..01ce46a107b0 100644
> > > --- a/kernel/hung_task.c
> > > +++ b/kernel/hung_task.c
> > > @@ -375,6 +375,31 @@ static long hung_timeout_jiffies(unsigned long last_checked,
> > >   }
> > >   #ifdef CONFIG_SYSCTL
> > > +
> > > +/**
> > > + * proc_dohung_task_detect_count - proc handler for hung_task_detect_count
> > > + * @table: Pointer to the struct ctl_table definition for this proc entry
> > > + * @write: Flag indicating the operation
> > > + * @buffer: User space buffer for data transfer
> > > + * @lenp: Pointer to the length of the data being transferred
> > > + * @ppos: Pointer to the current file offset
> > > + *
> > > + * This handler is used for reading the current hung task detection count
> > > + * and for resetting it to zero when a write operation is performed.
> > > + * Returns 0 on success or a negative error code on failure.
> > > + */
> > > +static int proc_dohung_task_detect_count(const struct ctl_table *table, int write,
> > > +					 void *buffer, size_t *lenp, loff_t *ppos)
> > > +{
> > > +	if (!write)
> > > +		return proc_doulongvec_minmax(table, write, buffer, lenp, ppos);
> > 
> > There have been some changes in the sysctl API recently, see
> > https://lore.kernel.org/lkml/20251016-jag-sysctl_conv-v2-0-a2f16529acc4@kernel.org/
> > 
> > They are backward compatible, so the above code works. But it would be
> > nice to make it up-to-date, namely:
> > 
> >    + Replace "write" with "dir"
> >    + Use SYSCTL_USER_TO_KERN(dir) instead of (!write)
> > 
> > 
> > > +	WRITE_ONCE(sysctl_hung_task_detect_count, 0);
> > 
> > I might be too conservative. But it looks weird to allow clearing the
> > value by any write. It would be better to return -EINVAL for non-zero
> > values. This would require using a copy of struct ctl_table and read
> > the value into a temporary variable.
> 
> That's okay, I think. See vmstat_refresh() for a similar pattern - it
You are correct but wouldn't it make more sense to zero it out only when
the write value passed by the user is zero. And return -EINVAL for any
other value?


best

-- 

Joel Granados

Re: [PATCH v3 2/2] hung_task: Enable runtime reset of hung_task_detect_count

Posted by Aaron Tomlin 1 month, 2 weeks ago

On Thu, Dec 18, 2025 at 10:08:07AM +0100, Joel Granados wrote:
> Would it make more sense to write it like this:
> 
>   Indicates the total number of tasks that have been detected as hung
>   since the system boot or since the counter was reset. Counter is
>   zeroed when written to.

Hi Joel,

Thank you for the feedback.

> > That's okay, I think. See vmstat_refresh() for a similar pattern - it
> You are correct but wouldn't it make more sense to zero it out only when
> the write value passed by the user is zero. And return -EINVAL for any
> other value?

I entirely agree with your assessment: only a value of zero should be
considered valid for resetting sysctl_hung_task_detect_count. Any attempt
by a user to write a non-zero value should be rejected with a return of
-EINVAL. This approach provides a much more intuitive interface.


Kind regards,
-- 
Aaron Tomlin

Re: [PATCH v3 2/2] hung_task: Enable runtime reset of hung_task_detect_count

Posted by Lance Yang 1 month, 2 weeks ago


On 2025/12/18 17:08, Joel Granados wrote:
> On Wed, Dec 17, 2025 at 09:21:01PM +0800, Lance Yang wrote:
>>
>>
>> On 2025/12/17 20:48, Petr Mladek wrote:
>>> Adding Joel into Cc. He is improving the sysctl API...
>>>
>>> On Mon 2025-12-15 22:00:36, Aaron Tomlin wrote:
>>>> Introduce support for writing to /proc/sys/kernel/hung_task_detect_count.
>>>>
>>>> Writing any value to this file atomically resets the counter of detected
>>>> hung tasks to zero. This grants system administrators the ability to clear
>>>> the cumulative diagnostic history after resolving an incident, simplifying
>>>> monitoring without requiring a system restart.
> I see that this might be refactored due to dependency that was not
> considered. I'll write my comments, but will also drop it from my radar
> for now.
> 
>>>>
>>>> --- a/Documentation/admin-guide/sysctl/kernel.rst
>>>> +++ b/Documentation/admin-guide/sysctl/kernel.rst
>>>> @@ -418,7 +418,7 @@ hung_task_detect_count
>>>>    ======================
>>>>    Indicates the total number of tasks that have been detected as hung since
>>>> -the system boot.
>>>> +the system boot. The counter can be reset to zero when written to.
> Would it make more sense to write it like this:
> 
>    Indicates the total number of tasks that have been detected as hung
>    since the system boot or since the counter was reset. Counter is
>    zeroed when written to.
> 
>>>>    This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled.
>>>> diff --git a/kernel/hung_task.c b/kernel/hung_task.c
>>>> index 5902573200c0..01ce46a107b0 100644
>>>> --- a/kernel/hung_task.c
>>>> +++ b/kernel/hung_task.c
>>>> @@ -375,6 +375,31 @@ static long hung_timeout_jiffies(unsigned long last_checked,
>>>>    }
>>>>    #ifdef CONFIG_SYSCTL
>>>> +
>>>> +/**
>>>> + * proc_dohung_task_detect_count - proc handler for hung_task_detect_count
>>>> + * @table: Pointer to the struct ctl_table definition for this proc entry
>>>> + * @write: Flag indicating the operation
>>>> + * @buffer: User space buffer for data transfer
>>>> + * @lenp: Pointer to the length of the data being transferred
>>>> + * @ppos: Pointer to the current file offset
>>>> + *
>>>> + * This handler is used for reading the current hung task detection count
>>>> + * and for resetting it to zero when a write operation is performed.
>>>> + * Returns 0 on success or a negative error code on failure.
>>>> + */
>>>> +static int proc_dohung_task_detect_count(const struct ctl_table *table, int write,
>>>> +					 void *buffer, size_t *lenp, loff_t *ppos)
>>>> +{
>>>> +	if (!write)
>>>> +		return proc_doulongvec_minmax(table, write, buffer, lenp, ppos);
>>>
>>> There have been some changes in the sysctl API recently, see
>>> https://lore.kernel.org/lkml/20251016-jag-sysctl_conv-v2-0-a2f16529acc4@kernel.org/
>>>
>>> They are backward compatible, so the above code works. But it would be
>>> nice to make it up-to-date, namely:
>>>
>>>     + Replace "write" with "dir"
>>>     + Use SYSCTL_USER_TO_KERN(dir) instead of (!write)
>>>
>>>
>>>> +	WRITE_ONCE(sysctl_hung_task_detect_count, 0);
>>>
>>> I might be too conservative. But it looks weird to allow clearing the
>>> value by any write. It would be better to return -EINVAL for non-zero
>>> values. This would require using a copy of struct ctl_table and read
>>> the value into a temporary variable.
>>
>> That's okay, I think. See vmstat_refresh() for a similar pattern - it
> You are correct but wouldn't it make more sense to zero it out only when
> the write value passed by the user is zero. And return -EINVAL for any
> other value?

Fair point, let's return -EINVAL when user writes a non-zero value :)

Re: [PATCH v3 2/2] hung_task: Enable runtime reset of hung_task_detect_count

Posted by Lance Yang 1 month, 3 weeks ago


On 2025/12/16 11:00, Aaron Tomlin wrote:
> Introduce support for writing to /proc/sys/kernel/hung_task_detect_count.
> 
> Writing any value to this file atomically resets the counter of detected
> hung tasks to zero. This grants system administrators the ability to clear
> the cumulative diagnostic history after resolving an incident, simplifying
> monitoring without requiring a system restart.

Thanks!

> 
> Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
> ---
>   Documentation/admin-guide/sysctl/kernel.rst |  2 +-
>   kernel/hung_task.c                          | 29 +++++++++++++++++++--
>   2 files changed, 28 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
> index 239da22c4e28..43c17b919969 100644
> --- a/Documentation/admin-guide/sysctl/kernel.rst
> +++ b/Documentation/admin-guide/sysctl/kernel.rst
> @@ -418,7 +418,7 @@ hung_task_detect_count
>   ======================
>   
>   Indicates the total number of tasks that have been detected as hung since
> -the system boot.
> +the system boot. The counter can be reset to zero when written to.
>   
>   This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled.
>   
> diff --git a/kernel/hung_task.c b/kernel/hung_task.c
> index 5902573200c0..01ce46a107b0 100644
> --- a/kernel/hung_task.c
> +++ b/kernel/hung_task.c
> @@ -375,6 +375,31 @@ static long hung_timeout_jiffies(unsigned long last_checked,
>   }
>   
>   #ifdef CONFIG_SYSCTL
> +
> +/**
> + * proc_dohung_task_detect_count - proc handler for hung_task_detect_count
> + * @table: Pointer to the struct ctl_table definition for this proc entry
> + * @write: Flag indicating the operation
> + * @buffer: User space buffer for data transfer
> + * @lenp: Pointer to the length of the data being transferred
> + * @ppos: Pointer to the current file offset
> + *
> + * This handler is used for reading the current hung task detection count
> + * and for resetting it to zero when a write operation is performed.
> + * Returns 0 on success or a negative error code on failure.
> + */
> +static int proc_dohung_task_detect_count(const struct ctl_table *table, int write,
> +					 void *buffer, size_t *lenp, loff_t *ppos)
> +{
> +	if (!write)
> +		return proc_doulongvec_minmax(table, write, buffer, lenp, ppos);
> +
> +	WRITE_ONCE(sysctl_hung_task_detect_count, 0);

The reset uses WRITE_ONCE() but the increment in check_hung_task()
is plain ++, it's just a stat cunter so fine though ;)

Tested-by: Lance Yang <lance.yang@linux.dev>

Reviewed-by: Lance Yang <lance.yang@linux.dev>

Cheers,
Lance

> +	*ppos += *lenp;
> +
> +	return 0;
> +}
> +
>   /*
>    * Process updating of timeout sysctl
>    */
> @@ -457,8 +482,8 @@ static const struct ctl_table hung_task_sysctls[] = {
>   		.procname	= "hung_task_detect_count",
>   		.data		= &sysctl_hung_task_detect_count,
>   		.maxlen		= sizeof(unsigned long),
> -		.mode		= 0444,
> -		.proc_handler	= proc_doulongvec_minmax,
> +		.mode		= 0644,
> +		.proc_handler	= proc_dohung_task_detect_count,
>   	},
>   	{
>   		.procname	= "hung_task_sys_info",

Re: [PATCH v3 2/2] hung_task: Enable runtime reset of hung_task_detect_count

Posted by Petr Mladek 1 month, 3 weeks ago

On Wed 2025-12-17 15:31:25, Lance Yang wrote:
> On 2025/12/16 11:00, Aaron Tomlin wrote:
> > Introduce support for writing to /proc/sys/kernel/hung_task_detect_count.
> > 
> > Writing any value to this file atomically resets the counter of detected
> > hung tasks to zero. This grants system administrators the ability to clear
> > the cumulative diagnostic history after resolving an incident, simplifying
> > monitoring without requiring a system restart.
> 
> > --- a/Documentation/admin-guide/sysctl/kernel.rst
> > +++ b/Documentation/admin-guide/sysctl/kernel.rst
> > @@ -418,7 +418,7 @@ hung_task_detect_count
> >   ======================
> >   Indicates the total number of tasks that have been detected as hung since
> > -the system boot.
> > +the system boot. The counter can be reset to zero when written to.
> >   This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled.
> > diff --git a/kernel/hung_task.c b/kernel/hung_task.c
> > index 5902573200c0..01ce46a107b0 100644
> > --- a/kernel/hung_task.c
> > +++ b/kernel/hung_task.c
> > @@ -375,6 +375,31 @@ static long hung_timeout_jiffies(unsigned long last_checked,
> >   }
> >   #ifdef CONFIG_SYSCTL
> > +
> > +/**
> > + * proc_dohung_task_detect_count - proc handler for hung_task_detect_count
> > + * @table: Pointer to the struct ctl_table definition for this proc entry
> > + * @write: Flag indicating the operation
> > + * @buffer: User space buffer for data transfer
> > + * @lenp: Pointer to the length of the data being transferred
> > + * @ppos: Pointer to the current file offset
> > + *
> > + * This handler is used for reading the current hung task detection count
> > + * and for resetting it to zero when a write operation is performed.
> > + * Returns 0 on success or a negative error code on failure.
> > + */
> > +static int proc_dohung_task_detect_count(const struct ctl_table *table, int write,
> > +					 void *buffer, size_t *lenp, loff_t *ppos)
> > +{
> > +	if (!write)
> > +		return proc_doulongvec_minmax(table, write, buffer, lenp, ppos);
> > +
> > +	WRITE_ONCE(sysctl_hung_task_detect_count, 0);
> 
> The reset uses WRITE_ONCE() but the increment in check_hung_task()
> is plain ++, it's just a stat cunter so fine though ;)

I was about to wave this away as well. But I am afraid that
it is even more complicated.

The counter is used to decide how many hung tasks were found in
by a single check, see:

static void check_hung_uninterruptible_tasks(unsigned long timeout)
{
[...]
	unsigned long prev_detect_count = sysctl_hung_task_detect_count;
[...]
	for_each_process_thread(g, t) {
[...]
		check_hung_task(t, timeout, prev_detect_count);
	}
[...]
	if (!(sysctl_hung_task_detect_count - prev_detect_count))
		return;

	if (need_warning || hung_task_call_panic) {
		si_mask |= SYS_INFO_LOCKS;

		if (sysctl_hung_task_all_cpu_backtrace)
			si_mask |= SYS_INFO_ALL_BT;
	}

	sys_info(si_mask);

	if (hung_task_call_panic)
		panic("hung_task: blocked tasks");
}

Any race might cause false positives and even panic() !!!
And the race window is rather big (checking all processes).

This race can't be prevented by an atomic read/write.
IMHO, the only solution would be to add some locking,
e.g. mutex and take it around the entire
check_hung_uninterruptible_tasks().

On one hand, it might work. The code is called from
a task context...

But adding a lock into a lockup-detector code triggers
some warning bells in my head.

Honestly, I am not sure if it is worth it. I understand
the motivation for the reset. But IMHO, even more important
is to make sure that the watchdog works as expected.

Best Regards,
Petr

Re: [PATCH v3 2/2] hung_task: Enable runtime reset of hung_task_detect_count

Posted by Aaron Tomlin 1 month, 3 weeks ago

On Wed, Dec 17, 2025 at 02:09:14PM +0100, Petr Mladek wrote:
> The counter is used to decide how many hung tasks were found in
> by a single check, see:
> 
> static void check_hung_uninterruptible_tasks(unsigned long timeout)
> {
> [...]
> 	unsigned long prev_detect_count = sysctl_hung_task_detect_count;
> [...]
> 	for_each_process_thread(g, t) {
> [...]
> 		check_hung_task(t, timeout, prev_detect_count);
> 	}
> [...]
> 	if (!(sysctl_hung_task_detect_count - prev_detect_count))
> 		return;
> 
> 	if (need_warning || hung_task_call_panic) {
> 		si_mask |= SYS_INFO_LOCKS;
> 
> 		if (sysctl_hung_task_all_cpu_backtrace)
> 			si_mask |= SYS_INFO_ALL_BT;
> 	}
> 
> 	sys_info(si_mask);
> 
> 	if (hung_task_call_panic)
> 		panic("hung_task: blocked tasks");
> }
> 
> Any race might cause false positives and even panic() !!!
> And the race window is rather big (checking all processes).
> 
> This race can't be prevented by an atomic read/write.
> IMHO, the only solution would be to add some locking,
> e.g. mutex and take it around the entire
> check_hung_uninterruptible_tasks().
> 
> On one hand, it might work. The code is called from
> a task context...
> 
> But adding a lock into a lockup-detector code triggers
> some warning bells in my head.
> 
> Honestly, I am not sure if it is worth it. I understand
> the motivation for the reset. But IMHO, even more important
> is to make sure that the watchdog works as expected.

Hi Petr,

Thank you for the feedback regarding the potential for false positives and
erroneous panics. You are absolutely correct that the race window is
substantial, given that it spans the entire duration of a full
process-thread iteration.

I believe we can resolve this race condition and prevent the integer
underflow without the overhead or risk of a mutex by leveraging hardware
atomicity and a slight adjustment to the delta logic.

Specifically, by converting sysctl_hung_task_detect_count to an
atomic_long_t, we ensure that the increment and reset operations are
indivisible at the CPU level. To handle the "reset mid-scan" issue you
highlighted, we can modify the delta calculation to be "reset-aware":


diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index 01ce46a107b0..74725dc1efd0 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -17,6 +17,7 @@
 #include <linux/export.h>
 #include <linux/panic_notifier.h>
 #include <linux/sysctl.h>
+#include <linux/atomic.h>
 #include <linux/suspend.h>
 #include <linux/utsname.h>
 #include <linux/sched/signal.h>
@@ -36,7 +37,7 @@ static int __read_mostly sysctl_hung_task_check_count = PID_MAX_LIMIT;
 /*
  * Total number of tasks detected as hung since boot:
  */
-static unsigned long __read_mostly sysctl_hung_task_detect_count;
+static atomic_long_t atomic_long_t sysctl_hung_task_detect_count = ATOMIC_LONG_INIT(0);
 
 /*
  * Limit number of tasks checked in a batch.
@@ -253,6 +254,7 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout,
 		unsigned long prev_detect_count)
 {
 	unsigned long total_hung_task;
+	unsigned long current_detect;
 
 	if (!task_is_hung(t, timeout))
 		return;
@@ -261,9 +263,14 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout,
 	 * This counter tracks the total number of tasks detected as hung
 	 * since boot.
 	 */
-	sysctl_hung_task_detect_count++;
+	atomic_long_inc(&sysctl_hung_task_detect_count);
+
+	current_detect = atomic_long_read(&sysctl_hung_task_detect_count);
+	if (current_detect >= prev_detect_count)
+		total_hung_task = current_detect - prev_detect_count;
+	else
+		total_hung_task = current_detect;
 
-	total_hung_task = sysctl_hung_task_detect_count - prev_detect_count;
 	trace_sched_process_hang(t);
 
 	if (sysctl_hung_task_panic && total_hung_task >= sysctl_hung_task_panic) {
@@ -322,7 +329,7 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout)
 	int max_count = sysctl_hung_task_check_count;
 	unsigned long last_break = jiffies;
 	struct task_struct *g, *t;
-	unsigned long prev_detect_count = sysctl_hung_task_detect_count;
+	unsigned long prev_detect_count = atomic_long_read(&sysctl_hung_task_detect_count);
 	int need_warning = sysctl_hung_task_warnings;
 	unsigned long si_mask = hung_task_si_mask;
 
@@ -350,7 +357,7 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout)
  unlock:
 	rcu_read_unlock();
 
-	if (!(sysctl_hung_task_detect_count - prev_detect_count))
+	if (!(atomic_long_read(&sysctl_hung_task_detect_count) - prev_detect_count))
 		return;
 
 	if (need_warning || hung_task_call_panic) {
@@ -391,10 +398,15 @@ static long hung_timeout_jiffies(unsigned long last_checked,
 static int proc_dohung_task_detect_count(const struct ctl_table *table, int write,
 					 void *buffer, size_t *lenp, loff_t *ppos)
 {
-	if (!write)
-		return proc_doulongvec_minmax(table, write, buffer, lenp, ppos);
+	if (!write) {
+		unsigned long count = atomic_long_read(&sysctl_hung_task_detect_count);
+		struct ctl_table proxy_table = *table;
+
+		proxy_table.data = &count;
+		return proc_doulongvec_minmax(&proxy_table, write, buffer, lenp, ppos);
+	}
 
-	WRITE_ONCE(sysctl_hung_task_detect_count, 0);
+	atomic_long_set(&sysctl_hung_task_detect_count, 0);
 	*ppos += *lenp;
 
 	return 0;
@@ -480,7 +492,6 @@ static const struct ctl_table hung_task_sysctls[] = {
 	},
 	{
 		.procname	= "hung_task_detect_count",
-		.data		= &sysctl_hung_task_detect_count,
 		.maxlen		= sizeof(unsigned long),
 		.mode		= 0644,
 		.proc_handler	= proc_dohung_task_detect_count,




Please note that this updated approach has not been tested yet, but I
believe it addresses the core concerns while maintaining the architectural
integrity of the detector.


Kind regards,
-- 
Aaron Tomlin

Re: [PATCH v3 2/2] hung_task: Enable runtime reset of hung_task_detect_count

Posted by Petr Mladek 1 month, 2 weeks ago

On Thu 2025-12-18 22:09:57, Aaron Tomlin wrote:
> On Wed, Dec 17, 2025 at 02:09:14PM +0100, Petr Mladek wrote:
> > The counter is used to decide how many hung tasks were found in
> > by a single check, see:
> > 
> > Any race might cause false positives and even panic() !!!
> > And the race window is rather big (checking all processes).
> > 
> > This race can't be prevented by an atomic read/write.
> > IMHO, the only solution would be to add some locking,
> > e.g. mutex and take it around the entire
> > check_hung_uninterruptible_tasks().

[...]

> Thank you for the feedback regarding the potential for false positives and
> erroneous panics. You are absolutely correct that the race window is
> substantial, given that it spans the entire duration of a full
> process-thread iteration.
> 
> I believe we can resolve this race condition and prevent the integer
> underflow without the overhead or risk of a mutex by leveraging hardware
> atomicity and a slight adjustment to the delta logic.
> 
> Specifically, by converting sysctl_hung_task_detect_count to an
> atomic_long_t, we ensure that the increment and reset operations are
> indivisible at the CPU level. To handle the "reset mid-scan" issue you
> highlighted, we can modify the delta calculation to be "reset-aware":

> diff --git a/kernel/hung_task.c b/kernel/hung_task.c
> index 01ce46a107b0..74725dc1efd0 100644
> --- a/kernel/hung_task.c
> +++ b/kernel/hung_task.c
> @@ -17,6 +17,7 @@
>  #include <linux/export.h>
>  #include <linux/panic_notifier.h>
>  #include <linux/sysctl.h>
> +#include <linux/atomic.h>
>  #include <linux/suspend.h>
>  #include <linux/utsname.h>
>  #include <linux/sched/signal.h>
> @@ -36,7 +37,7 @@ static int __read_mostly sysctl_hung_task_check_count = PID_MAX_LIMIT;
>  /*
>   * Total number of tasks detected as hung since boot:
>   */
> -static unsigned long __read_mostly sysctl_hung_task_detect_count;
> +static atomic_long_t atomic_long_t sysctl_hung_task_detect_count = ATOMIC_LONG_INIT(0);
>  
>  /*
>   * Limit number of tasks checked in a batch.
> @@ -253,6 +254,7 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout,
>  		unsigned long prev_detect_count)
>  {
>  	unsigned long total_hung_task;
> +	unsigned long current_detect;
>  
>  	if (!task_is_hung(t, timeout))
>  		return;
> @@ -261,9 +263,14 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout,
>  	 * This counter tracks the total number of tasks detected as hung
>  	 * since boot.
>  	 */
> -	sysctl_hung_task_detect_count++;
> +	atomic_long_inc(&sysctl_hung_task_detect_count);
> +
> +	current_detect = atomic_long_read(&sysctl_hung_task_detect_count);

The two atomic_long_inc() and atomic_long_read() can be replaced
by atomic_long_inc_return().

> +	if (current_detect >= prev_detect_count)
> +		total_hung_task = current_detect - prev_detect_count;
> +	else
> +		total_hung_task = current_detect;

Interesting trick. The result may not be precise when the counter
is reset in the middle of the check. But it would prevent
calling panic() because of a wrong computation.

It should be acceptable. The panic() is needed when the lockup looks
permanent. And if the lockup is permanent then the right
total_hung_task number will be computed by the next check.

Please, add a comment into the code explaining why it is done this way
and why it is OK.

> -	total_hung_task = sysctl_hung_task_detect_count - prev_detect_count;
>  	trace_sched_process_hang(t);
>  
>  	if (sysctl_hung_task_panic && total_hung_task >= sysctl_hung_task_panic) {
> @@ -322,7 +329,7 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout)
>  	int max_count = sysctl_hung_task_check_count;
>  	unsigned long last_break = jiffies;
>  	struct task_struct *g, *t;
> -	unsigned long prev_detect_count = sysctl_hung_task_detect_count;
> +	unsigned long prev_detect_count = atomic_long_read(&sysctl_hung_task_detect_count);

I think that it is not super important in this case. But this is a kind of
locking, so I would use some proper annotation/barriers. IMHO, we
should use here:

	atomic_long_read_acquire()

And the counter part would be 

	atomic_long_inc_return_release()

>  	int need_warning = sysctl_hung_task_warnings;
>  	unsigned long si_mask = hung_task_si_mask;
>  
> @@ -350,7 +357,7 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout)
>   unlock:
>  	rcu_read_unlock();
>  
> -	if (!(sysctl_hung_task_detect_count - prev_detect_count))
> +	if (!(atomic_long_read(&sysctl_hung_task_detect_count) - prev_detect_count))

I guess that we should do the same check here. Otherwise, we might
print the sys_info() just because the counter has been reset
in the meantime.

And from the locking/barriers POV, we should use atomic_long_read_release() here.

>  		return;
>  
>  	if (need_warning || hung_task_call_panic) {
> @@ -391,10 +398,15 @@ static long hung_timeout_jiffies(unsigned long last_checked,
>  static int proc_dohung_task_detect_count(const struct ctl_table *table, int write,
>  					 void *buffer, size_t *lenp, loff_t *ppos)
>  {
> -	if (!write)
> -		return proc_doulongvec_minmax(table, write, buffer, lenp, ppos);
> +	if (!write) {
> +		unsigned long count = atomic_long_read(&sysctl_hung_task_detect_count);
> +		struct ctl_table proxy_table = *table;
> +
> +		proxy_table.data = &count;
> +		return proc_doulongvec_minmax(&proxy_table, write, buffer, lenp, ppos);

If we are going to use the proxy table then we are already half way
to add the check of written value. Please, return -EINVAL when user
writes a non-zero value as I suggested at
https://lore.kernel.org/r/aUKmh0Gs8YH_iLbC@pathway.suse.cz

+	}
>  
> -	WRITE_ONCE(sysctl_hung_task_detect_count, 0);
> +	atomic_long_set(&sysctl_hung_task_detect_count, 0);
>  	*ppos += *lenp;
>  
>  	return 0;
> 
> Please note that this updated approach has not been tested yet, but I
> believe it addresses the core concerns while maintaining the architectural
> integrity of the detector.

This approach looks acceptable to me.

I though also about using sequence counters, see
Documentation/locking/seqlock.rst. It would allow to restart the scan
when the counter has been reset in the meantime. But I think that it
is not worth it. The counter is important only to decide whether
to call panic() or not. And it makes sense to call panic() only
when the stall is persistent. So, it is perfectly fine to wait
for the next scan.

Best Regards,
Petr

Re: [PATCH v3 2/2] hung_task: Enable runtime reset of hung_task_detect_count

Posted by Lance Yang 1 month, 2 weeks ago


On 2025/12/19 20:03, Petr Mladek wrote:
> On Thu 2025-12-18 22:09:57, Aaron Tomlin wrote:
>> On Wed, Dec 17, 2025 at 02:09:14PM +0100, Petr Mladek wrote:
>>> The counter is used to decide how many hung tasks were found in
>>> by a single check, see:
>>>
>>> Any race might cause false positives and even panic() !!!
>>> And the race window is rather big (checking all processes).
>>>
>>> This race can't be prevented by an atomic read/write.
>>> IMHO, the only solution would be to add some locking,
>>> e.g. mutex and take it around the entire
>>> check_hung_uninterruptible_tasks().
> 
> [...]
> 
>> Thank you for the feedback regarding the potential for false positives and
>> erroneous panics. You are absolutely correct that the race window is
>> substantial, given that it spans the entire duration of a full
>> process-thread iteration.
>>
>> I believe we can resolve this race condition and prevent the integer
>> underflow without the overhead or risk of a mutex by leveraging hardware
>> atomicity and a slight adjustment to the delta logic.
>>
>> Specifically, by converting sysctl_hung_task_detect_count to an
>> atomic_long_t, we ensure that the increment and reset operations are
>> indivisible at the CPU level. To handle the "reset mid-scan" issue you
>> highlighted, we can modify the delta calculation to be "reset-aware":
> 
>> diff --git a/kernel/hung_task.c b/kernel/hung_task.c
>> index 01ce46a107b0..74725dc1efd0 100644
>> --- a/kernel/hung_task.c
>> +++ b/kernel/hung_task.c
>> @@ -17,6 +17,7 @@
>>   #include <linux/export.h>
>>   #include <linux/panic_notifier.h>
>>   #include <linux/sysctl.h>
>> +#include <linux/atomic.h>
>>   #include <linux/suspend.h>
>>   #include <linux/utsname.h>
>>   #include <linux/sched/signal.h>
>> @@ -36,7 +37,7 @@ static int __read_mostly sysctl_hung_task_check_count = PID_MAX_LIMIT;
>>   /*
>>    * Total number of tasks detected as hung since boot:
>>    */
>> -static unsigned long __read_mostly sysctl_hung_task_detect_count;
>> +static atomic_long_t atomic_long_t sysctl_hung_task_detect_count = ATOMIC_LONG_INIT(0);

Double atomic_long_t here.

>>   
>>   /*
>>    * Limit number of tasks checked in a batch.
>> @@ -253,6 +254,7 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout,
>>   		unsigned long prev_detect_count)
>>   {
>>   	unsigned long total_hung_task;
>> +	unsigned long current_detect;
>>   
>>   	if (!task_is_hung(t, timeout))
>>   		return;
>> @@ -261,9 +263,14 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout,
>>   	 * This counter tracks the total number of tasks detected as hung
>>   	 * since boot.
>>   	 */
>> -	sysctl_hung_task_detect_count++;
>> +	atomic_long_inc(&sysctl_hung_task_detect_count);
>> +
>> +	current_detect = atomic_long_read(&sysctl_hung_task_detect_count);
> 
> The two atomic_long_inc() and atomic_long_read() can be replaced
> by atomic_long_inc_return().

+1, cleaner and potentially cheaper on some archs :)

> 
>> +	if (current_detect >= prev_detect_count)
>> +		total_hung_task = current_detect - prev_detect_count;
>> +	else
>> +		total_hung_task = current_detect;
> 
> Interesting trick. The result may not be precise when the counter
> is reset in the middle of the check. But it would prevent
> calling panic() because of a wrong computation.

Clever! If it's a real persistent hang, we'll catch it next
time anyway.

But yeah, add a comment explaining that we're ok with
undercounting one scan if counter gets reset mid-flight.

Future readers will thank you ;)

> 
> It should be acceptable. The panic() is needed when the lockup looks
> permanent. And if the lockup is permanent then the right
> total_hung_task number will be computed by the next check.
> 
> Please, add a comment into the code explaining why it is done this way
> and why it is OK.
> 
>> -	total_hung_task = sysctl_hung_task_detect_count - prev_detect_count;
>>   	trace_sched_process_hang(t);
>>   
>>   	if (sysctl_hung_task_panic && total_hung_task >= sysctl_hung_task_panic) {
>> @@ -322,7 +329,7 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout)
>>   	int max_count = sysctl_hung_task_check_count;
>>   	unsigned long last_break = jiffies;
>>   	struct task_struct *g, *t;
>> -	unsigned long prev_detect_count = sysctl_hung_task_detect_count;
>> +	unsigned long prev_detect_count = atomic_long_read(&sysctl_hung_task_detect_count);
> 
> I think that it is not super important in this case. But this is a kind of
> locking, so I would use some proper annotation/barriers. IMHO, we
> should use here:

Right, memory ordering matters!

> 
> 	atomic_long_read_acquire()
> 
> And the counter part would be
> 
> 	atomic_long_inc_return_release()
> 
>>   	int need_warning = sysctl_hung_task_warnings;
>>   	unsigned long si_mask = hung_task_si_mask;
>>   
>> @@ -350,7 +357,7 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout)
>>    unlock:
>>   	rcu_read_unlock();
>>   
>> -	if (!(sysctl_hung_task_detect_count - prev_detect_count))
>> +	if (!(atomic_long_read(&sysctl_hung_task_detect_count) - prev_detect_count))
> 
> I guess that we should do the same check here. Otherwise, we might
> print the sys_info() just because the counter has been reset
> in the meantime.

Good catch. Same underflow check should apply here.

> 
> And from the locking/barriers POV, we should use atomic_long_read_release() here.
> 
>>   		return;
>>   
>>   	if (need_warning || hung_task_call_panic) {
>> @@ -391,10 +398,15 @@ static long hung_timeout_jiffies(unsigned long last_checked,
>>   static int proc_dohung_task_detect_count(const struct ctl_table *table, int write,
>>   					 void *buffer, size_t *lenp, loff_t *ppos)
>>   {
>> -	if (!write)
>> -		return proc_doulongvec_minmax(table, write, buffer, lenp, ppos);
>> +	if (!write) {
>> +		unsigned long count = atomic_long_read(&sysctl_hung_task_detect_count);
>> +		struct ctl_table proxy_table = *table;
>> +
>> +		proxy_table.data = &count;
>> +		return proc_doulongvec_minmax(&proxy_table, write, buffer, lenp, ppos);
> 
> If we are going to use the proxy table then we are already half way
> to add the check of written value. Please, return -EINVAL when user
> writes a non-zero value as I suggested at
> https://lore.kernel.org/r/aUKmh0Gs8YH_iLbC@pathway.suse.cz

Explicit is better. Echo 0 to reset, anything else is an error :)

> 
> +	}
>>   
>> -	WRITE_ONCE(sysctl_hung_task_detect_count, 0);
>> +	atomic_long_set(&sysctl_hung_task_detect_count, 0);
>>   	*ppos += *lenp;
>>   
>>   	return 0;
>>
>> Please note that this updated approach has not been tested yet, but I
>> believe it addresses the core concerns while maintaining the architectural
>> integrity of the detector.
> 
> This approach looks acceptable to me.

Yeah, I think this is the way to go. Much better than adding a
lock to hung task detector ...

> 
> I though also about using sequence counters, see
> Documentation/locking/seqlock.rst. It would allow to restart the scan
> when the counter has been reset in the meantime. But I think that it
> is not worth it. The counter is important only to decide whether
> to call panic() or not. And it makes sense to call panic() only
> when the stall is persistent. So, it is perfectly fine to wait
> for the next scan.
> 
> Best Regards,
> Petr

Re: [PATCH v3 2/2] hung_task: Enable runtime reset of hung_task_detect_count

Posted by Aaron Tomlin 1 month, 2 weeks ago

On Fri, Dec 19, 2025 at 10:15:14PM +0800, Lance Yang wrote:
> Clever! If it's a real persistent hang, we'll catch it next
> time anyway.
> 
> But yeah, add a comment explaining that we're ok with
> undercounting one scan if counter gets reset mid-flight.
> 
> Future readers will thank you ;)

Hi Petr, Lance,

Thank you for the feedback.

> > It should be acceptable. The panic() is needed when the lockup looks
> > permanent. And if the lockup is permanent then the right
> > total_hung_task number will be computed by the next check.
> > 
> > Please, add a comment into the code explaining why it is done this way
> > and why it is OK.

Acknowledged.

> Right, memory ordering matters!

Agreed.

> Explicit is better. Echo 0 to reset, anything else is an error :)

Acknowledged.

> Yeah, I think this is the way to go. Much better than adding a
> lock to hung task detector ...

Agreed.

I will prepare a new series.


Kind regards,
-- 
Aaron Tomlin

Re: [PATCH v3 2/2] hung_task: Enable runtime reset of hung_task_detect_count

Posted by Lance Yang 1 month, 3 weeks ago


On 2025/12/17 21:09, Petr Mladek wrote:
> On Wed 2025-12-17 15:31:25, Lance Yang wrote:
>> On 2025/12/16 11:00, Aaron Tomlin wrote:
>>> Introduce support for writing to /proc/sys/kernel/hung_task_detect_count.
>>>
>>> Writing any value to this file atomically resets the counter of detected
>>> hung tasks to zero. This grants system administrators the ability to clear
>>> the cumulative diagnostic history after resolving an incident, simplifying
>>> monitoring without requiring a system restart.
>>
>>> --- a/Documentation/admin-guide/sysctl/kernel.rst
>>> +++ b/Documentation/admin-guide/sysctl/kernel.rst
>>> @@ -418,7 +418,7 @@ hung_task_detect_count
>>>    ======================
>>>    Indicates the total number of tasks that have been detected as hung since
>>> -the system boot.
>>> +the system boot. The counter can be reset to zero when written to.
>>>    This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled.
>>> diff --git a/kernel/hung_task.c b/kernel/hung_task.c
>>> index 5902573200c0..01ce46a107b0 100644
>>> --- a/kernel/hung_task.c
>>> +++ b/kernel/hung_task.c
>>> @@ -375,6 +375,31 @@ static long hung_timeout_jiffies(unsigned long last_checked,
>>>    }
>>>    #ifdef CONFIG_SYSCTL
>>> +
>>> +/**
>>> + * proc_dohung_task_detect_count - proc handler for hung_task_detect_count
>>> + * @table: Pointer to the struct ctl_table definition for this proc entry
>>> + * @write: Flag indicating the operation
>>> + * @buffer: User space buffer for data transfer
>>> + * @lenp: Pointer to the length of the data being transferred
>>> + * @ppos: Pointer to the current file offset
>>> + *
>>> + * This handler is used for reading the current hung task detection count
>>> + * and for resetting it to zero when a write operation is performed.
>>> + * Returns 0 on success or a negative error code on failure.
>>> + */
>>> +static int proc_dohung_task_detect_count(const struct ctl_table *table, int write,
>>> +					 void *buffer, size_t *lenp, loff_t *ppos)
>>> +{
>>> +	if (!write)
>>> +		return proc_doulongvec_minmax(table, write, buffer, lenp, ppos);
>>> +
>>> +	WRITE_ONCE(sysctl_hung_task_detect_count, 0);
>>
>> The reset uses WRITE_ONCE() but the increment in check_hung_task()
>> is plain ++, it's just a stat cunter so fine though ;)
> 
> I was about to wave this away as well. But I am afraid that
> it is even more complicated.

Good catch!

> 
> The counter is used to decide how many hung tasks were found in
> by a single check, see:
> 
> static void check_hung_uninterruptible_tasks(unsigned long timeout)
> {
> [...]
> 	unsigned long prev_detect_count = sysctl_hung_task_detect_count;
> [...]
> 	for_each_process_thread(g, t) {
> [...]
> 		check_hung_task(t, timeout, prev_detect_count);
> 	}
> [...]
> 	if (!(sysctl_hung_task_detect_count - prev_detect_count))
> 		return;
> 
> 	if (need_warning || hung_task_call_panic) {
> 		si_mask |= SYS_INFO_LOCKS;
> 
> 		if (sysctl_hung_task_all_cpu_backtrace)
> 			si_mask |= SYS_INFO_ALL_BT;
> 	}
> 
> 	sys_info(si_mask);
> 
> 	if (hung_task_call_panic)
> 		panic("hung_task: blocked tasks");
> }
> 
> Any race might cause false positives and even panic() !!!
> And the race window is rather big (checking all processes).

Right, I completely missed the dependency on prev_detect_count.
Messing with that counter while the loop is running is really
dangerous ...

> 
> This race can't be prevented by an atomic read/write.
> IMHO, the only solution would be to add some locking,
> e.g. mutex and take it around the entire
> check_hung_uninterruptible_tasks().
> 
> On one hand, it might work. The code is called from
> a task context...
> 
> But adding a lock into a lockup-detector code triggers
> some warning bells in my head.
> 
> Honestly, I am not sure if it is worth it. I understand
> the motivation for the reset. But IMHO, even more important
> is to make sure that the watchdog works as expected.

Given that adding a lock to the detector path is undesirable,
the risk clearly outweighs the benefit :)

Thanks for saving me from this pitfall, Petr!

[PATCH v3 1/2] hung_task: Introduce helper for hung task warning
[PATCH v3 2/2] hung_task: Enable runtime reset of hung_task_detect_count