kernel/hung_task.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-)
At present, the hung task detector behaves in an unoptimised manner: it
wakes up periodically (every check_interval_secs, defaulting to 120
seconds) and performs an O(N) scan of the entire process list,
regardless of the system's actual state. On idle embedded devices,
virtual machines, or large servers with no activity, this behaviour
unnecessarily consumes CPU cycles and memory bandwidth, hindering
power-saving states.
To rectify this, this patch introduces an adaptive "green" polling
mechanism. The detector will now verify whether the system is
effectively idle before committing to a full process scan.
To implement this, we utilise the standard get_avenrun() API to verify
the global system load. Tasks in the TASK_UNINTERRUPTIBLE (D) state
explicitly contribute to the system load average; consequently, if the
1-minute load average is zero, we can confidently infer that no tasks
are currently hung, allowing us to bypass the expensive process scan.
Crucially, we invoke get_avenrun(load, 0, 0) with both the offset and
shift parameters set to zero. This configuration is deliberate and
necessary for safety:
1. Zero Offset: Prevents the application of any artificial
rounding bias usually intended for human-readable display.
2. Zero Shift: Retrieves the raw fixed-point value (where 1.0
load = 2048) rather than shifting it down to an integer.
This ensures maximum sensitivity: even a microscopic fractional load
(e.g., a single task entering D state momentarily) will register as a
non-zero raw value. This guarantees that we never encounter a false
negative where a valid hung task is ignored due to integer truncation or
rounding errors.
This heuristic significantly minimises the detector's footprint on
healthy systems whilst maintaining robust reliability for genuine hangs.
Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
---
kernel/hung_task.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index d2254c91450b..7b9f5c1bd35e 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -17,6 +17,7 @@
#include <linux/export.h>
#include <linux/panic_notifier.h>
#include <linux/sysctl.h>
+#include <linux/sched/loadavg.h>
#include <linux/suspend.h>
#include <linux/utsname.h>
#include <linux/sched/signal.h>
@@ -503,6 +504,7 @@ static int watchdog(void *dummy)
for ( ; ; ) {
unsigned long timeout = sysctl_hung_task_timeout_secs;
unsigned long interval = sysctl_hung_task_check_interval_secs;
+ unsigned long load[3];
long t;
if (interval == 0)
@@ -511,8 +513,12 @@ static int watchdog(void *dummy)
t = hung_timeout_jiffies(hung_last_checked, interval);
if (t <= 0) {
if (!atomic_xchg(&reset_hung_task, 0) &&
- !hung_detector_suspended)
- check_hung_uninterruptible_tasks(timeout);
+ !hung_detector_suspended) {
+ /* Check 1-min load to detect idle system */
+ get_avenrun(load, 0, 0);
+ if (load[0] > 0)
+ check_hung_uninterruptible_tasks(timeout);
+ }
hung_last_checked = jiffies;
continue;
}
--
2.51.0
Hi Aaron,
Keep one patch or series under review at a time, especially in the
same subsystem ...
Maintainers/Reviewers have limited bandwidth and can focus better
on one thing at a time.
Please, be patient! Just wait for it to be merged or rejected before
sending the next.
On 2026/1/26 11:45, Aaron Tomlin wrote:
> At present, the hung task detector behaves in an unoptimised manner: it
> wakes up periodically (every check_interval_secs, defaulting to 120
> seconds) and performs an O(N) scan of the entire process list,
> regardless of the system's actual state. On idle embedded devices,
> virtual machines, or large servers with no activity, this behaviour
> unnecessarily consumes CPU cycles and memory bandwidth, hindering
> power-saving states.
>
> To rectify this, this patch introduces an adaptive "green" polling
> mechanism. The detector will now verify whether the system is
> effectively idle before committing to a full process scan.
>
> To implement this, we utilise the standard get_avenrun() API to verify
> the global system load. Tasks in the TASK_UNINTERRUPTIBLE (D) state
> explicitly contribute to the system load average; consequently, if the
> 1-minute load average is zero, we can confidently infer that no tasks
> are currently hung, allowing us to bypass the expensive process scan.
>
> Crucially, we invoke get_avenrun(load, 0, 0) with both the offset and
> shift parameters set to zero. This configuration is deliberate and
> necessary for safety:
>
> 1. Zero Offset: Prevents the application of any artificial
> rounding bias usually intended for human-readable display.
>
> 2. Zero Shift: Retrieves the raw fixed-point value (where 1.0
> load = 2048) rather than shifting it down to an integer.
>
> This ensures maximum sensitivity: even a microscopic fractional load
> (e.g., a single task entering D state momentarily) will register as a
> non-zero raw value. This guarantees that we never encounter a false
> negative where a valid hung task is ignored due to integer truncation or
> rounding errors.
>
> This heuristic significantly minimises the detector's footprint on
> healthy systems whilst maintaining robust reliability for genuine hangs.
>
> Signed-off-by: Aaron Tomlin <atomlin@atomlin.com>
> ---
> kernel/hung_task.c | 10 ++++++++--
> 1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/hung_task.c b/kernel/hung_task.c
> index d2254c91450b..7b9f5c1bd35e 100644
> --- a/kernel/hung_task.c
> +++ b/kernel/hung_task.c
> @@ -17,6 +17,7 @@
> #include <linux/export.h>
> #include <linux/panic_notifier.h>
> #include <linux/sysctl.h>
> +#include <linux/sched/loadavg.h>
> #include <linux/suspend.h>
> #include <linux/utsname.h>
> #include <linux/sched/signal.h>
> @@ -503,6 +504,7 @@ static int watchdog(void *dummy)
> for ( ; ; ) {
> unsigned long timeout = sysctl_hung_task_timeout_secs;
> unsigned long interval = sysctl_hung_task_check_interval_secs;
> + unsigned long load[3];
> long t;
>
> if (interval == 0)
> @@ -511,8 +513,12 @@ static int watchdog(void *dummy)
> t = hung_timeout_jiffies(hung_last_checked, interval);
> if (t <= 0) {
> if (!atomic_xchg(&reset_hung_task, 0) &&
> - !hung_detector_suspended)
> - check_hung_uninterruptible_tasks(timeout);
> + !hung_detector_suspended) {
> + /* Check 1-min load to detect idle system */
> + get_avenrun(load, 0, 0);
> + if (load[0] > 0)
> + check_hung_uninterruptible_tasks(timeout);
The optimization is not worth the trouble.
I don't think the assumption that "load[0] == 0 means no hung tasks" is
100% correct.
So that would miss actual hung tasks - a false negative, which is worse
than the "wasted scan" you're trying to avoid.
Also, I don't *really* care about optimizing something that runs once
every 120 seconds :)
Nacked-by: Lance Yang <lance.yang@linux.dev>
On Mon, Jan 26, 2026 at 01:23:01PM +0800, Lance Yang wrote:
> Hi Aaron,
Hi Lance,
> Keep one patch or series under review at a time, especially in the
> same subsystem ...
Understood. That's fair.
> > @@ -503,6 +504,7 @@ static int watchdog(void *dummy)
> > for ( ; ; ) {
> > unsigned long timeout = sysctl_hung_task_timeout_secs;
> > unsigned long interval = sysctl_hung_task_check_interval_secs;
> > + unsigned long load[3];
> > long t;
> > if (interval == 0)
> > @@ -511,8 +513,12 @@ static int watchdog(void *dummy)
> > t = hung_timeout_jiffies(hung_last_checked, interval);
> > if (t <= 0) {
> > if (!atomic_xchg(&reset_hung_task, 0) &&
> > - !hung_detector_suspended)
> > - check_hung_uninterruptible_tasks(timeout);
> > + !hung_detector_suspended) {
> > + /* Check 1-min load to detect idle system */
> > + get_avenrun(load, 0, 0);
> > + if (load[0] > 0)
> > + check_hung_uninterruptible_tasks(timeout);
>
> The optimization is not worth the trouble.
>
> I don't think the assumption that "load[0] == 0 means no hung tasks" is
> 100% correct.
>
> So that would miss actual hung tasks - a false negative, which is worse
> than the "wasted scan" you're trying to avoid.
>
> Also, I don't *really* care about optimizing something that runs once
> every 120 seconds :)
>
> Nacked-by: Lance Yang <lance.yang@linux.dev>
Yes, please ignore. This is indeed wrong.
Regarding the value of the optimisation, while a 120-second interval
implies a low frequency, the cost of the scan is O(N). On large servers
with high thread counts (even if idle), iterating the entire task list
dirties cache lines and consumes memory bandwidth unnecessarily.
Nevertheless, we currently do not have a way to economically compute the
total number of tasks in TASK_UNINTERRUPTIBLE state.
Kind regards,
--
Aaron Tomlin
On Mon 2026-01-26 15:14:27, Aaron Tomlin wrote:
> On Mon, Jan 26, 2026 at 01:23:01PM +0800, Lance Yang wrote:
> > Hi Aaron,
>
> Hi Lance,
>
> > Keep one patch or series under review at a time, especially in the
> > same subsystem ...
+1 :-)
> Understood. That's fair.
>
> > > @@ -503,6 +504,7 @@ static int watchdog(void *dummy)
> > > for ( ; ; ) {
> > > unsigned long timeout = sysctl_hung_task_timeout_secs;
> > > unsigned long interval = sysctl_hung_task_check_interval_secs;
> > > + unsigned long load[3];
> > > long t;
> > > if (interval == 0)
> > > @@ -511,8 +513,12 @@ static int watchdog(void *dummy)
> > > t = hung_timeout_jiffies(hung_last_checked, interval);
> > > if (t <= 0) {
> > > if (!atomic_xchg(&reset_hung_task, 0) &&
> > > - !hung_detector_suspended)
> > > - check_hung_uninterruptible_tasks(timeout);
> > > + !hung_detector_suspended) {
> > > + /* Check 1-min load to detect idle system */
> > > + get_avenrun(load, 0, 0);
> > > + if (load[0] > 0)
> > > + check_hung_uninterruptible_tasks(timeout);
> >
> > The optimization is not worth the trouble.
> >
> > I don't think the assumption that "load[0] == 0 means no hung tasks" is
> > 100% correct.
> >
> > So that would miss actual hung tasks - a false negative, which is worse
> > than the "wasted scan" you're trying to avoid.
> >
> > Also, I don't *really* care about optimizing something that runs once
> > every 120 seconds :)
> >
> > Nacked-by: Lance Yang <lance.yang@linux.dev>
>
> Yes, please ignore. This is indeed wrong.
>
> Regarding the value of the optimisation, while a 120-second interval
> implies a low frequency, the cost of the scan is O(N). On large servers
> with high thread counts (even if idle), iterating the entire task list
> dirties cache lines and consumes memory bandwidth unnecessarily.
>
> Nevertheless, we currently do not have a way to economically compute the
> total number of tasks in TASK_UNINTERRUPTIBLE state.
It makes some sense. And the check of the average load is trivial
so it might be acceptable.
But I somehow doubt that it works. Have you ever seen a system with
(avenrun[0] == 0)? IMHO, it might be pretty hard to achieve it.
Or maybe I am too pessimistic. Or are there embedded systems which can
only be waken by some interrupt from a sensor? Do embedded systems
run hung task detector?
By other words. Is this patch solving a theoretical scenario?
Did you test it in practice, please?
Best Regards,
Petr
© 2016 - 2026 Red Hat, Inc.