[v2] workqueue: Detect stalled in-flight workers

[PATCH v2 4/5] workqueue: Show all busy workers in stall diagnostics

Posted by Breno Leitao 1 month ago

show_cpu_pool_hog() only prints workers whose task is currently running
on the CPU (task_is_running()).  This misses workers that are busy
processing a work item but are sleeping or blocked — for example, a
worker that clears PF_WQ_WORKER and enters wait_event_idle().  Such a
worker still occupies a pool slot and prevents progress, yet produces
an empty backtrace section in the watchdog output.

This is happening on real arm64 systems, where
toggle_allocation_gate() IPIs every single CPU in the machine (which
lacks NMI), causing workqueue stalls that show empty backtraces because
toggle_allocation_gate() is sleeping in wait_event_idle().

Remove the task_is_running() filter so every in-flight worker in the
pool's busy_hash is dumped.  The busy_hash is protected by pool->lock,
which is already held.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 kernel/workqueue.c | 28 +++++++++++++---------------
 1 file changed, 13 insertions(+), 15 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 56d8af13843f8..09b9ad78d566c 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -7583,9 +7583,9 @@ MODULE_PARM_DESC(panic_on_stall_time, "Panic if stall exceeds this many seconds
 
 /*
  * Show workers that might prevent the processing of pending work items.
- * The only candidates are CPU-bound workers in the running state.
- * Pending work items should be handled by another idle worker
- * in all other situations.
+ * A busy worker that is not running on the CPU (e.g. sleeping in
+ * wait_event_idle() with PF_WQ_WORKER cleared) can stall the pool just as
+ * effectively as a CPU-bound one, so dump every in-flight worker.
  */
 static void show_cpu_pool_hog(struct worker_pool *pool)
 {
@@ -7596,19 +7596,17 @@ static void show_cpu_pool_hog(struct worker_pool *pool)
 	raw_spin_lock_irqsave(&pool->lock, irq_flags);
 
 	hash_for_each(pool->busy_hash, bkt, worker, hentry) {
-		if (task_is_running(worker->task)) {
-			/*
-			 * Defer printing to avoid deadlocks in console
-			 * drivers that queue work while holding locks
-			 * also taken in their write paths.
-			 */
-			printk_deferred_enter();
+		/*
+		 * Defer printing to avoid deadlocks in console
+		 * drivers that queue work while holding locks
+		 * also taken in their write paths.
+		 */
+		printk_deferred_enter();
 
-			pr_info("pool %d:\n", pool->id);
-			sched_show_task(worker->task);
+		pr_info("pool %d:\n", pool->id);
+		sched_show_task(worker->task);
 
-			printk_deferred_exit();
-		}
+		printk_deferred_exit();
 	}
 
 	raw_spin_unlock_irqrestore(&pool->lock, irq_flags);
@@ -7619,7 +7617,7 @@ static void show_cpu_pools_hogs(void)
 	struct worker_pool *pool;
 	int pi;
 
-	pr_info("Showing backtraces of running workers in stalled CPU-bound worker pools:\n");
+	pr_info("Showing backtraces of busy workers in stalled CPU-bound worker pools:\n");
 
 	rcu_read_lock();
 

-- 
2.47.3

Re: [PATCH v2 4/5] workqueue: Show all busy workers in stall diagnostics

Posted by Petr Mladek 4 weeks ago

On Thu 2026-03-05 08:15:40, Breno Leitao wrote:
> show_cpu_pool_hog() only prints workers whose task is currently running
> on the CPU (task_is_running()).  This misses workers that are busy
> processing a work item but are sleeping or blocked — for example, a
> worker that clears PF_WQ_WORKER and enters wait_event_idle().

IMHO, it is misleading. AFAIK, workers clear PF_WQ_WORKER flag only
when they are going to die. They never do so when going to sleep.

> Such a
> worker still occupies a pool slot and prevents progress, yet produces
> an empty backtrace section in the watchdog output.
> 
> This is happening on real arm64 systems, where
> toggle_allocation_gate() IPIs every single CPU in the machine (which
> lacks NMI), causing workqueue stalls that show empty backtraces because
> toggle_allocation_gate() is sleeping in wait_event_idle().

The wait_event_idle() called in toggle_allocation_gate() should not
cause a stall. The scheduler should call wq_worker_sleeping(tsk)
and wake up another idle worker. It should guarantee the progress.

> Remove the task_is_running() filter so every in-flight worker in the
> pool's busy_hash is dumped.  The busy_hash is protected by pool->lock,
> which is already held.

As I explained in reply to the cover letter, sleeping workers should
not block forward progress. It seems that in this case, the system was
not able to wake up the other idle worker or it was the last idle
worker and was not able to fork a new one.

IMHO, we should warn about this when there is no running worker.
It might be more useful than printing backtraces of the sleeping
workers because they likely did not cause the problem.

I believe that the problem, in this particular situation, is that
the system can't schedule or fork new processes. It might help
to warn about it and maybe show backtrace of the currently
running process on the stalled CPU.

Anyway, I think we could do better here. And blindly printing backtraces
from all workers would do more harm then good in most situations.

Best Regards,
Petr

Re: [PATCH v2 4/5] workqueue: Show all busy workers in stall diagnostics

Posted by Breno Leitao 3 weeks, 6 days ago

On Thu, Mar 12, 2026 at 06:03:03PM +0100, Petr Mladek wrote:
> On Thu 2026-03-05 08:15:40, Breno Leitao wrote:
> > show_cpu_pool_hog() only prints workers whose task is currently running
> > on the CPU (task_is_running()).  This misses workers that are busy
> > processing a work item but are sleeping or blocked — for example, a
> > worker that clears PF_WQ_WORKER and enters wait_event_idle().
> 
> IMHO, it is misleading. AFAIK, workers clear PF_WQ_WORKER flag only
> when they are going to die. They never do so when going to sleep.
> 
> > Such a
> > worker still occupies a pool slot and prevents progress, yet produces
> > an empty backtrace section in the watchdog output.
> > 
> > This is happening on real arm64 systems, where
> > toggle_allocation_gate() IPIs every single CPU in the machine (which
> > lacks NMI), causing workqueue stalls that show empty backtraces because
> > toggle_allocation_gate() is sleeping in wait_event_idle().
> 
> The wait_event_idle() called in toggle_allocation_gate() should not
> cause a stall. The scheduler should call wq_worker_sleeping(tsk)
> and wake up another idle worker. It should guarantee the progress.
> 
> > Remove the task_is_running() filter so every in-flight worker in the
> > pool's busy_hash is dumped.  The busy_hash is protected by pool->lock,
> > which is already held.
> 
> As I explained in reply to the cover letter, sleeping workers should
> not block forward progress. It seems that in this case, the system was
> not able to wake up the other idle worker or it was the last idle
> worker and was not able to fork a new one.
> 
> IMHO, we should warn about this when there is no running worker.
> It might be more useful than printing backtraces of the sleeping
> workers because they likely did not cause the problem.
> 
> I believe that the problem, in this particular situation, is that
> the system can't schedule or fork new processes. It might help
> to warn about it and maybe show backtrace of the currently
> running process on the stalled CPU.

Do you mean checking if pool->busy_hash is empty, and then warning?

Commit fc36ad49ce7160907bcbe4f05c226595611ac293
Author: Breno Leitao <leitao@debian.org>
Date:   Fri Mar 13 05:35:02 2026 -0700

    workqueue: warn when stalled pool has no running workers

    When the workqueue watchdog detects a pool stall and the pool's
    busy_hash is empty (no workers executing any work item), print a
    diagnostic warning with the pool state and trigger a backtrace of
    the currently running task on the stalled CPU.

    Signed-off-by: Breno Leitao <leitao@debian.org>
    Suggested-by: Petr Mladek <pmladek@suse.com>

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 6ee52ba9b14f7..d538067754123 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -7655,6 +7655,17 @@ static void show_cpu_pool_busy_workers(struct worker_pool *pool)

        raw_spin_lock_irqsave(&pool->lock, irq_flags);

+       if (hash_empty(pool->busy_hash)) {
+               raw_spin_unlock_irqrestore(&pool->lock, irq_flags);
+
+               pr_info("pool %d: no running workers, cpu=%d is %s (nr_workers=%d nr_idle=%d)\n",
+                       pool->id, pool->cpu,
+                       idle_cpu(pool->cpu) ? "idle" : "busy",
+                       pool->nr_workers, pool->nr_idle);
+               trigger_single_cpu_backtrace(pool->cpu);
+               return;
+       }
+
        hash_for_each(pool->busy_hash, bkt, worker, hentry) {
                if (task_is_running(worker->task)) {
                        /*

Re: [PATCH v2 4/5] workqueue: Show all busy workers in stall diagnostics

Posted by Petr Mladek 3 weeks, 6 days ago

On Fri 2026-03-13 05:57:59, Breno Leitao wrote:
> On Thu, Mar 12, 2026 at 06:03:03PM +0100, Petr Mladek wrote:
> > On Thu 2026-03-05 08:15:40, Breno Leitao wrote:
> > > show_cpu_pool_hog() only prints workers whose task is currently running
> > > on the CPU (task_is_running()).  This misses workers that are busy
> > > processing a work item but are sleeping or blocked — for example, a
> > > worker that clears PF_WQ_WORKER and enters wait_event_idle().
> > 
> > IMHO, it is misleading. AFAIK, workers clear PF_WQ_WORKER flag only
> > when they are going to die. They never do so when going to sleep.
> > 
> > > Such a
> > > worker still occupies a pool slot and prevents progress, yet produces
> > > an empty backtrace section in the watchdog output.
> > > 
> > > This is happening on real arm64 systems, where
> > > toggle_allocation_gate() IPIs every single CPU in the machine (which
> > > lacks NMI), causing workqueue stalls that show empty backtraces because
> > > toggle_allocation_gate() is sleeping in wait_event_idle().
> > 
> > The wait_event_idle() called in toggle_allocation_gate() should not
> > cause a stall. The scheduler should call wq_worker_sleeping(tsk)
> > and wake up another idle worker. It should guarantee the progress.
> > 
> > > Remove the task_is_running() filter so every in-flight worker in the
> > > pool's busy_hash is dumped.  The busy_hash is protected by pool->lock,
> > > which is already held.
> > 
> > As I explained in reply to the cover letter, sleeping workers should
> > not block forward progress. It seems that in this case, the system was
> > not able to wake up the other idle worker or it was the last idle
> > worker and was not able to fork a new one.
> > 
> > IMHO, we should warn about this when there is no running worker.
> > It might be more useful than printing backtraces of the sleeping
> > workers because they likely did not cause the problem.
> > 
> > I believe that the problem, in this particular situation, is that
> > the system can't schedule or fork new processes. It might help
> > to warn about it and maybe show backtrace of the currently
> > running process on the stalled CPU.
> 
> Do you mean checking if pool->busy_hash is empty, and then warning?
> 
> Commit fc36ad49ce7160907bcbe4f05c226595611ac293
> Author: Breno Leitao <leitao@debian.org>
> Date:   Fri Mar 13 05:35:02 2026 -0700
> 
>     workqueue: warn when stalled pool has no running workers
> 
>     When the workqueue watchdog detects a pool stall and the pool's
>     busy_hash is empty (no workers executing any work item), print a
>     diagnostic warning with the pool state and trigger a backtrace of
>     the currently running task on the stalled CPU.
> 
>     Signed-off-by: Breno Leitao <leitao@debian.org>
>     Suggested-by: Petr Mladek <pmladek@suse.com>
> 
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 6ee52ba9b14f7..d538067754123 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -7655,6 +7655,17 @@ static void show_cpu_pool_busy_workers(struct worker_pool *pool)
> 
>         raw_spin_lock_irqsave(&pool->lock, irq_flags);
> 
> +       if (hash_empty(pool->busy_hash)) {

This would print it only when there is no in-flight work.

But I think that the problem is when there in no worker in
the running state. There should always be one to guarantee
the forward progress.

I took inspiration from your patch. This is what comes to my mind
on top of the current master (printing only running workers):

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index aeaec79bc09c..a044c7e42139 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -7588,12 +7588,15 @@ static void show_cpu_pool_hog(struct worker_pool *pool)
 {
 	struct worker *worker;
 	unsigned long irq_flags;
+	bool found_running;
 	int bkt;
 
 	raw_spin_lock_irqsave(&pool->lock, irq_flags);
 
+	found_running = false;
 	hash_for_each(pool->busy_hash, bkt, worker, hentry) {
 		if (task_is_running(worker->task)) {
+			found_running = true;
 			/*
 			 * Defer printing to avoid deadlocks in console
 			 * drivers that queue work while holding locks
@@ -7609,6 +7612,19 @@ static void show_cpu_pool_hog(struct worker_pool *pool)
 	}
 
 	raw_spin_unlock_irqrestore(&pool->lock, irq_flags);
+
+	if (!found_running) {
+		pr_info("pool %d: no worker in running state, cpu=%d is %s (nr_workers=%d nr_idle=%d)\n",
+			pool->id, pool->cpu,
+			idle_cpu(pool->cpu) ? "idle" : "busy",
+			pool->nr_workers, pool->nr_idle);
+		pr_info("The pool might have troubles to wake up another idle worker.\n");
+		if (pool->manager) {
+			pr_info("Backtrace of the pool manager:\n");
+			sched_show_task(pool->manager->task);
+		}
+		trigger_single_cpu_backtrace(pool->cpu);
+	}
 }
 
 static void show_cpu_pools_hogs(void)


Warning: The code is not safe. We would need add some synchronization
	 of the pool->manager pointer.

	Even better might be to print state and backtrace of the process
	which was woken by kick_pool() when the last running worker
	went asleep.

Motivation: AFAIK, if there is a pending work in CPU bound workqueue
	than at least one worker in the related worker pool should be
	in "task_is_running()" state to guarantee forward progress.

	If we find the running worker then it will likely be the
	culprit. It either runs for too long. Or it is the last
	idle worker and it fails to create a new one.

	If there is no worker in running state then there is likely
	a problem in the core workqueue code. Or some work shoot
	the workqueue into its leg. Anyway, we might need to print
	much more details to nail it down.

Best Regards,
Petr

Re: [PATCH v2 4/5] workqueue: Show all busy workers in stall diagnostics

Posted by Breno Leitao 3 weeks, 1 day ago

Hello Petr,

On Fri, Mar 13, 2026 at 05:27:40PM +0100, Petr Mladek wrote:
> I took inspiration from your patch. This is what comes to my mind
> on top of the current master (printing only running workers):
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index aeaec79bc09c..a044c7e42139 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -7588,12 +7588,15 @@ static void show_cpu_pool_hog(struct worker_pool *pool)
>  {
>  	struct worker *worker;
>  	unsigned long irq_flags;
> +	bool found_running;
>  	int bkt;
>
>  	raw_spin_lock_irqsave(&pool->lock, irq_flags);
>
> +	found_running = false;
>  	hash_for_each(pool->busy_hash, bkt, worker, hentry) {
>  		if (task_is_running(worker->task)) {
> +			found_running = true;
>  			/*
>  			 * Defer printing to avoid deadlocks in console
>  			 * drivers that queue work while holding locks
> @@ -7609,6 +7612,19 @@ static void show_cpu_pool_hog(struct worker_pool *pool)
>  	}
>
>  	raw_spin_unlock_irqrestore(&pool->lock, irq_flags);
> +
> +	if (!found_running) {
> +		pr_info("pool %d: no worker in running state, cpu=%d is %s (nr_workers=%d nr_idle=%d)\n",
> +			pool->id, pool->cpu,
> +			idle_cpu(pool->cpu) ? "idle" : "busy",
> +			pool->nr_workers, pool->nr_idle);
> +		pr_info("The pool might have troubles to wake up another idle worker.\n");
> +		if (pool->manager) {
> +			pr_info("Backtrace of the pool manager:\n");
> +			sched_show_task(pool->manager->task);
> +		}
> +		trigger_single_cpu_backtrace(pool->cpu);
> +	}
>  }
>
>  static void show_cpu_pools_hogs(void)
>
>
> Warning: The code is not safe. We would need add some synchronization
> 	 of the pool->manager pointer.
>
> 	Even better might be to print state and backtrace of the process
> 	which was woken by kick_pool() when the last running worker
> 	went asleep.

I agree. We should probably store the last woken worker in the worker_pool
structure and print it later.

I've spent some time verifying that the locking and lifecycle management are
correct. While I'm not completely certain, I believe it's getting closer. An
extra pair of eyes would be helpful.

This is the new version of this patch:

commit feccca7e696ead3272669ee4d4dc02b6946d0faf
Author: Breno Leitao <leitao@debian.org>
Date:   Mon Mar 16 09:47:09 2026 -0700

    workqueue: print diagnostic info when no worker is in running state
    
    show_cpu_pool_busy_workers() iterates over busy workers but gives no
    feedback when none are found in running state, which is a key indicator
    that a pool may be stuck — unable to wake an idle worker to process
    pending work.
    
    Add a diagnostic message when no running workers are found, reporting
    pool id, CPU, idle state, and worker counts.  Also trigger a single-CPU
    backtrace for the stalled CPU.
    
    To identify the task most likely responsible for the stall, add
    last_woken_worker (L: pool->lock) to worker_pool and record it in
    kick_pool() just before wake_up_process().  This captures the idle
    worker that was kicked to take over when the last running worker went to
    sleep; if the pool is now stuck with no running worker, that task is the
    prime suspect and its backtrace is dumped.
    
    Using struct worker * rather than struct task_struct * avoids any
    lifetime concern: workers are only destroyed via set_worker_dying()
    which requires pool->lock, and set_worker_dying() clears
    last_woken_worker when the dying worker matches.  show_cpu_pool_busy_workers()
    holds pool->lock while calling sched_show_task(), so last_woken_worker
    is either NULL or points to a live worker with a valid task.  More
    precisely, set_worker_dying() clears last_woken_worker before setting
    WORKER_DIE, so a non-NULL last_woken_worker means the kthread has not
    yet exited and worker->task is still alive.
    
    The pool info message is printed inside pool->lock using
    printk_deferred_enter/exit, the same pattern used by the existing
    busy-worker loop, to avoid deadlocks with console drivers that queue
    work while holding locks also taken in their write paths.
    trigger_single_cpu_backtrace() is called after releasing the lock.
    
    Suggested-by: Petr Mladek <pmladek@suse.com>
    Signed-off-by: Breno Leitao <leitao@debian.org>

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index b77119d71641a..38aebf4514c03 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -217,6 +217,7 @@ struct worker_pool {
 						/* L: hash of busy workers */
 
 	struct worker		*manager;	/* L: purely informational */
+	struct worker		*last_woken_worker; /* L: last worker woken by kick_pool() */
 	struct list_head	workers;	/* A: attached workers */
 
 	struct ida		worker_ida;	/* worker IDs for task name */
@@ -1295,6 +1296,9 @@ static bool kick_pool(struct worker_pool *pool)
 		}
 	}
 #endif
+	/* Track the last idle worker woken, used for stall diagnostics. */
+	pool->last_woken_worker = worker;
+
 	wake_up_process(p);
 	return true;
 }
@@ -2902,6 +2906,13 @@ static void set_worker_dying(struct worker *worker, struct list_head *list)
 	pool->nr_workers--;
 	pool->nr_idle--;
 
+	/*
+	 * Clear last_woken_worker if it points to this worker, so that
+	 * show_cpu_pool_busy_workers() cannot dereference a freed worker.
+	 */
+	if (pool->last_woken_worker == worker)
+		pool->last_woken_worker = NULL;
+
 	worker->flags |= WORKER_DIE;
 
 	list_move(&worker->entry, list);
@@ -7582,20 +7593,58 @@ module_param_named(panic_on_stall_time, wq_panic_on_stall_time, uint, 0644);
 MODULE_PARM_DESC(panic_on_stall_time, "Panic if stall exceeds this many seconds (0=disabled)");
 
 /*
- * Show workers that might prevent the processing of pending work items.
- * A busy worker that is not running on the CPU (e.g. sleeping in
- * wait_event_idle() with PF_WQ_WORKER cleared) can stall the pool just as
- * effectively as a CPU-bound one, so dump every in-flight worker.
+ * Report that a pool has no worker in running state, which is a sign that the
+ * pool may be stuck. Print pool info. Must be called with pool->lock held and
+ * inside a printk_deferred_enter/exit region.
+ */
+static void show_pool_no_running_worker(struct worker_pool *pool)
+{
+	lockdep_assert_held(&pool->lock);
+
+	printk_deferred_enter();
+	pr_info("pool %d: no worker in running state, cpu=%d is %s (nr_workers=%d nr_idle=%d)\n",
+		pool->id, pool->cpu,
+		idle_cpu(pool->cpu) ? "idle" : "busy",
+		pool->nr_workers, pool->nr_idle);
+	pr_info("The pool might have trouble waking an idle worker.\n");
+	/*
+	 * last_woken_worker and its task are valid here: set_worker_dying()
+	 * clears it under pool->lock before setting WORKER_DIE, so if
+	 * last_woken_worker is non-NULL the kthread has not yet exited and
+	 * worker->task is still alive.
+	 */
+	if (pool->last_woken_worker) {
+		pr_info("Backtrace of last woken worker:\n");
+		sched_show_task(pool->last_woken_worker->task);
+	} else {
+		pr_info("Last woken worker empty\n");
+	}
+	printk_deferred_exit();
+}
+
+/*
+ * Show running workers that might prevent the processing of pending work items.
+ * If no running worker is found, the pool may be stuck waiting for an idle
+ * worker to be woken, so report the pool state and the last woken worker.
  */
 static void show_cpu_pool_busy_workers(struct worker_pool *pool)
 {
 	struct worker *worker;
 	unsigned long irq_flags;
-	int bkt;
+	bool found_running = false;
+	int cpu, bkt;
 
 	raw_spin_lock_irqsave(&pool->lock, irq_flags);
 
+	/* Snapshot cpu inside the lock to safely use it after unlock. */
+	cpu = pool->cpu;
+
 	hash_for_each(pool->busy_hash, bkt, worker, hentry) {
+		/* Skip workers that are not actively running on the CPU. */
+		if (!task_is_running(worker->task))
+			continue;
+
+		found_running = true;
 		/*
 		 * Defer printing to avoid deadlocks in console
 		 * drivers that queue work while holding locks
@@ -7609,7 +7658,23 @@ static void show_cpu_pool_busy_workers(struct worker_pool *pool)
 		printk_deferred_exit();
 	}
 
+	/*
+	 * If no running worker was found, the pool is likely stuck. Print pool
+	 * state and the backtrace of the last woken worker, which is the prime
+	 * suspect for the stall.
+	 */
+	if (!found_running)
+		show_pool_no_running_worker(pool);
+
 	raw_spin_unlock_irqrestore(&pool->lock, irq_flags);
+
+	/*
+	 * Trigger a backtrace on the stalled CPU to capture what it is
+	 * currently executing. Called after releasing the lock to avoid
+	 * any potential issues with NMI delivery.
+	 */
+	if (!found_running)
+		trigger_single_cpu_backtrace(cpu);
 }
 
 static void show_cpu_pools_busy_workers(void)

Re: [PATCH v2 4/5] workqueue: Show all busy workers in stall diagnostics

Posted by Petr Mladek 3 weeks, 1 day ago

On Wed 2026-03-18 04:31:08, Breno Leitao wrote:
> On Fri, Mar 13, 2026 at 05:27:40PM +0100, Petr Mladek wrote:
> I agree. We should probably store the last woken worker in the worker_pool
> structure and print it later.
> 
> I've spent some time verifying that the locking and lifecycle management are
> correct. While I'm not completely certain, I believe it's getting closer. An
> extra pair of eyes would be helpful.
> 
> This is the new version of this patch:
> 
> commit feccca7e696ead3272669ee4d4dc02b6946d0faf
> Author: Breno Leitao <leitao@debian.org>
> Date:   Mon Mar 16 09:47:09 2026 -0700
> 
>     workqueue: print diagnostic info when no worker is in running state
>     
>     show_cpu_pool_busy_workers() iterates over busy workers but gives no
>     feedback when none are found in running state, which is a key indicator
>     that a pool may be stuck — unable to wake an idle worker to process
>     pending work.
>     
>     Add a diagnostic message when no running workers are found, reporting
>     pool id, CPU, idle state, and worker counts.  Also trigger a single-CPU
>     backtrace for the stalled CPU.
>     
>     To identify the task most likely responsible for the stall, add
>     last_woken_worker (L: pool->lock) to worker_pool and record it in
>     kick_pool() just before wake_up_process().  This captures the idle
>     worker that was kicked to take over when the last running worker went to
>     sleep; if the pool is now stuck with no running worker, that task is the
>     prime suspect and its backtrace is dumped.
>     
>     Using struct worker * rather than struct task_struct * avoids any
>     lifetime concern: workers are only destroyed via set_worker_dying()
>     which requires pool->lock, and set_worker_dying() clears
>     last_woken_worker when the dying worker matches.  show_cpu_pool_busy_workers()
>     holds pool->lock while calling sched_show_task(), so last_woken_worker
>     is either NULL or points to a live worker with a valid task.  More
>     precisely, set_worker_dying() clears last_woken_worker before setting
>     WORKER_DIE, so a non-NULL last_woken_worker means the kthread has not
>     yet exited and worker->task is still alive.
>     
>     The pool info message is printed inside pool->lock using
>     printk_deferred_enter/exit, the same pattern used by the existing
>     busy-worker loop, to avoid deadlocks with console drivers that queue
>     work while holding locks also taken in their write paths.
>     trigger_single_cpu_backtrace() is called after releasing the lock.
>     
>     Suggested-by: Petr Mladek <pmladek@suse.com>
>     Signed-off-by: Breno Leitao <leitao@debian.org>
> 
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index b77119d71641a..38aebf4514c03 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -7582,20 +7593,58 @@ module_param_named(panic_on_stall_time, wq_panic_on_stall_time, uint, 0644);
>  MODULE_PARM_DESC(panic_on_stall_time, "Panic if stall exceeds this many seconds (0=disabled)");
>  
>  /*
> - * Show workers that might prevent the processing of pending work items.
> - * A busy worker that is not running on the CPU (e.g. sleeping in
> - * wait_event_idle() with PF_WQ_WORKER cleared) can stall the pool just as
> - * effectively as a CPU-bound one, so dump every in-flight worker.
> + * Report that a pool has no worker in running state, which is a sign that the
> + * pool may be stuck. Print pool info. Must be called with pool->lock held and
> + * inside a printk_deferred_enter/exit region.
> + */
> +static void show_pool_no_running_worker(struct worker_pool *pool)
> +{
> +	lockdep_assert_held(&pool->lock);
> +
> +	printk_deferred_enter();
> +	pr_info("pool %d: no worker in running state, cpu=%d is %s (nr_workers=%d nr_idle=%d)\n",
> +		pool->id, pool->cpu,
> +		idle_cpu(pool->cpu) ? "idle" : "busy",
> +		pool->nr_workers, pool->nr_idle);
> +	pr_info("The pool might have trouble waking an idle worker.\n");
> +	/*
> +	 * last_woken_worker and its task are valid here: set_worker_dying()
> +	 * clears it under pool->lock before setting WORKER_DIE, so if
> +	 * last_woken_worker is non-NULL the kthread has not yet exited and
> +	 * worker->task is still alive.
> +	 */
> +	if (pool->last_woken_worker) {
> +		pr_info("Backtrace of last woken worker:\n");
> +		sched_show_task(pool->last_woken_worker->task);
> +	} else {
> +		pr_info("Last woken worker empty\n");

This is a bit ambiguous. It sounds like that the worker is idle.
I would write something like:

		pr_info("There is no info about the last woken worker\n");
		pr_info("Missing info about the last woken worker.\n");

> +	}
> +	printk_deferred_exit();
> +}
> +

Otherwise, I like this patch.

I still think what might be the reason that there is no worker
in the running state. Let's see if this patch brings some useful info.

One more idea. It might be useful to store a timestamp when the last
worker was woken. And then print either the timestamp or delta.
It would help to make sure that kick_pool() was really called
during the reported stall.

Best Regards,
Petr

Re: [PATCH v2 4/5] workqueue: Show all busy workers in stall diagnostics

Posted by Breno Leitao 2 weeks, 6 days ago

On Wed, Mar 18, 2026 at 04:11:54PM +0100, Petr Mladek wrote:
> On Wed 2026-03-18 04:31:08, Breno Leitao wrote:
> > On Fri, Mar 13, 2026 at 05:27:40PM +0100, Petr Mladek wrote:
> > I agree. We should probably store the last woken worker in the worker_pool
> > structure and print it later.
> > 
> > I've spent some time verifying that the locking and lifecycle management are
> > correct. While I'm not completely certain, I believe it's getting closer. An
> > extra pair of eyes would be helpful.
> > 
> > This is the new version of this patch:
> > 
> > commit feccca7e696ead3272669ee4d4dc02b6946d0faf
> > Author: Breno Leitao <leitao@debian.org>
> > Date:   Mon Mar 16 09:47:09 2026 -0700
> > 
> >     workqueue: print diagnostic info when no worker is in running state
> >     
> >     show_cpu_pool_busy_workers() iterates over busy workers but gives no
> >     feedback when none are found in running state, which is a key indicator
> >     that a pool may be stuck — unable to wake an idle worker to process
> >     pending work.
> >     
> >     Add a diagnostic message when no running workers are found, reporting
> >     pool id, CPU, idle state, and worker counts.  Also trigger a single-CPU
> >     backtrace for the stalled CPU.
> >     
> >     To identify the task most likely responsible for the stall, add
> >     last_woken_worker (L: pool->lock) to worker_pool and record it in
> >     kick_pool() just before wake_up_process().  This captures the idle
> >     worker that was kicked to take over when the last running worker went to
> >     sleep; if the pool is now stuck with no running worker, that task is the
> >     prime suspect and its backtrace is dumped.
> >     
> >     Using struct worker * rather than struct task_struct * avoids any
> >     lifetime concern: workers are only destroyed via set_worker_dying()
> >     which requires pool->lock, and set_worker_dying() clears
> >     last_woken_worker when the dying worker matches.  show_cpu_pool_busy_workers()
> >     holds pool->lock while calling sched_show_task(), so last_woken_worker
> >     is either NULL or points to a live worker with a valid task.  More
> >     precisely, set_worker_dying() clears last_woken_worker before setting
> >     WORKER_DIE, so a non-NULL last_woken_worker means the kthread has not
> >     yet exited and worker->task is still alive.
> >     
> >     The pool info message is printed inside pool->lock using
> >     printk_deferred_enter/exit, the same pattern used by the existing
> >     busy-worker loop, to avoid deadlocks with console drivers that queue
> >     work while holding locks also taken in their write paths.
> >     trigger_single_cpu_backtrace() is called after releasing the lock.
> >     
> >     Suggested-by: Petr Mladek <pmladek@suse.com>
> >     Signed-off-by: Breno Leitao <leitao@debian.org>
> > 
> > diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> > index b77119d71641a..38aebf4514c03 100644
> > --- a/kernel/workqueue.c
> > +++ b/kernel/workqueue.c
> > @@ -7582,20 +7593,58 @@ module_param_named(panic_on_stall_time, wq_panic_on_stall_time, uint, 0644);
> >  MODULE_PARM_DESC(panic_on_stall_time, "Panic if stall exceeds this many seconds (0=disabled)");
> >  
> >  /*
> > - * Show workers that might prevent the processing of pending work items.
> > - * A busy worker that is not running on the CPU (e.g. sleeping in
> > - * wait_event_idle() with PF_WQ_WORKER cleared) can stall the pool just as
> > - * effectively as a CPU-bound one, so dump every in-flight worker.
> > + * Report that a pool has no worker in running state, which is a sign that the
> > + * pool may be stuck. Print pool info. Must be called with pool->lock held and
> > + * inside a printk_deferred_enter/exit region.
> > + */
> > +static void show_pool_no_running_worker(struct worker_pool *pool)
> > +{
> > +	lockdep_assert_held(&pool->lock);
> > +
> > +	printk_deferred_enter();
> > +	pr_info("pool %d: no worker in running state, cpu=%d is %s (nr_workers=%d nr_idle=%d)\n",
> > +		pool->id, pool->cpu,
> > +		idle_cpu(pool->cpu) ? "idle" : "busy",
> > +		pool->nr_workers, pool->nr_idle);
> > +	pr_info("The pool might have trouble waking an idle worker.\n");
> > +	/*
> > +	 * last_woken_worker and its task are valid here: set_worker_dying()
> > +	 * clears it under pool->lock before setting WORKER_DIE, so if
> > +	 * last_woken_worker is non-NULL the kthread has not yet exited and
> > +	 * worker->task is still alive.
> > +	 */
> > +	if (pool->last_woken_worker) {
> > +		pr_info("Backtrace of last woken worker:\n");
> > +		sched_show_task(pool->last_woken_worker->task);
> > +	} else {
> > +		pr_info("Last woken worker empty\n");
> 
> This is a bit ambiguous. It sounds like that the worker is idle.
> I would write something like:
> 
> 		pr_info("There is no info about the last woken worker\n");
> 		pr_info("Missing info about the last woken worker.\n");
> 
> > +	}
> > +	printk_deferred_exit();
> > +}
> > +
> 
> Otherwise, I like this patch.
> 
> I still think what might be the reason that there is no worker
> in the running state. Let's see if this patch brings some useful info.
> 
> One more idea. It might be useful to store a timestamp when the last
> worker was woken. And then print either the timestamp or delta.
> It would help to make sure that kick_pool() was really called
> during the reported stall.

Ack, this is the following patch I will deploy in production, let's see
how useful it is.

commit c78b175971888da3c2ae6d84971e9beb01269a92
Author: Breno Leitao <leitao@debian.org>
Date:   Mon Mar 16 09:47:09 2026 -0700

    workqueue: print diagnostic info when no worker is in running state
    
    show_cpu_pool_busy_workers() iterates over busy workers but gives no
    feedback when none are found in running state, which is a key indicator
    that a pool may be stuck — unable to wake an idle worker to process
    pending work.
    
    Add a diagnostic message when no running workers are found, reporting
    pool id, CPU, idle state, and worker counts.  Also trigger a single-CPU
    backtrace for the stalled CPU.
    
    To identify the task most likely responsible for the stall, add
    last_woken_worker and last_woken_tstamp (both L: pool->lock) to
    worker_pool and record them in kick_pool() just before
    wake_up_process().  This captures the idle worker that was kicked to
    take over when the last running worker went to sleep; if the pool is
    now stuck with no running worker, that task is the prime suspect and
    its backtrace is dumped along with how long ago it was woken.
    
    Using struct worker * rather than struct task_struct * avoids any
    lifetime concern: workers are only destroyed via set_worker_dying()
    which requires pool->lock, and set_worker_dying() clears
    last_woken_worker when the dying worker matches.  show_cpu_pool_busy_workers()
    holds pool->lock while calling sched_show_task(), so last_woken_worker
    is either NULL or points to a live worker with a valid task.  More
    precisely, set_worker_dying() clears last_woken_worker before setting
    WORKER_DIE, so a non-NULL last_woken_worker means the kthread has not
    yet exited and worker->task is still alive.
    
    The pool info message is printed inside pool->lock using
    printk_deferred_enter/exit, the same pattern used by the existing
    busy-worker loop, to avoid deadlocks with console drivers that queue
    work while holding locks also taken in their write paths.
    trigger_single_cpu_backtrace() is called after releasing the lock.
    
    Sample output from a stall triggered by the wq_stall test now.
    
      pool 174: no worker in running state, cpu=43 is idle (nr_workers=2 nr_idle=1)
      The pool might have trouble waking an idle worker.
      Last worker woken 48977 ms ago:
      task:kworker/43:1    state:I stack:0     pid:631   tgid:631   ppid:2
      Call Trace:
        <stack trace>
    
    Suggested-by: Petr Mladek <pmladek@suse.com>
    Signed-off-by: Breno Leitao <leitao@debian.org>

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index b77119d71641a..f8b1741824117 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -217,6 +217,8 @@ struct worker_pool {
 						/* L: hash of busy workers */
 
 	struct worker		*manager;	/* L: purely informational */
+	struct worker		*last_woken_worker; /* L: last worker woken by kick_pool() */
+	unsigned long		last_woken_tstamp;  /* L: timestamp of last kick_pool() wake */
 	struct list_head	workers;	/* A: attached workers */
 
 	struct ida		worker_ida;	/* worker IDs for task name */
@@ -1295,6 +1297,10 @@ static bool kick_pool(struct worker_pool *pool)
 		}
 	}
 #endif
+	/* Track the last idle worker woken, used for stall diagnostics. */
+	pool->last_woken_worker = worker;
+	pool->last_woken_tstamp = jiffies;
+
 	wake_up_process(p);
 	return true;
 }
@@ -2902,6 +2908,13 @@ static void set_worker_dying(struct worker *worker, struct list_head *list)
 	pool->nr_workers--;
 	pool->nr_idle--;
 
+	/*
+	 * Clear last_woken_worker if it points to this worker, so that
+	 * show_cpu_pool_busy_workers() cannot dereference a freed worker.
+	 */
+	if (pool->last_woken_worker == worker)
+		pool->last_woken_worker = NULL;
+
 	worker->flags |= WORKER_DIE;
 
 	list_move(&worker->entry, list);
@@ -7582,20 +7595,59 @@ module_param_named(panic_on_stall_time, wq_panic_on_stall_time, uint, 0644);
 MODULE_PARM_DESC(panic_on_stall_time, "Panic if stall exceeds this many seconds (0=disabled)");
 
 /*
- * Show workers that might prevent the processing of pending work items.
- * A busy worker that is not running on the CPU (e.g. sleeping in
- * wait_event_idle() with PF_WQ_WORKER cleared) can stall the pool just as
- * effectively as a CPU-bound one, so dump every in-flight worker.
+ * Report that a pool has no worker in running state, which is a sign that the
+ * pool may be stuck. Print pool info. Must be called with pool->lock held and
+ * inside a printk_deferred_enter/exit region.
+ */
+static void show_pool_no_running_worker(struct worker_pool *pool)
+{
+	lockdep_assert_held(&pool->lock);
+
+	printk_deferred_enter();
+	pr_info("pool %d: no worker in running state, cpu=%d is %s (nr_workers=%d nr_idle=%d)\n",
+		pool->id, pool->cpu,
+		idle_cpu(pool->cpu) ? "idle" : "busy",
+		pool->nr_workers, pool->nr_idle);
+	pr_info("The pool might have trouble waking an idle worker.\n");
+	/*
+	 * last_woken_worker and its task are valid here: set_worker_dying()
+	 * clears it under pool->lock before setting WORKER_DIE, so if
+	 * last_woken_worker is non-NULL the kthread has not yet exited and
+	 * worker->task is still alive.
+	 */
+	if (pool->last_woken_worker) {
+		pr_info("Last worker woken %lu ms ago:\n",
+			jiffies_to_msecs(jiffies - pool->last_woken_tstamp));
+		sched_show_task(pool->last_woken_worker->task);
+	} else {
+		pr_info("Missing info about the last woken worker.\n");
+	}
+	printk_deferred_exit();
+}
+
+/*
+ * Show running workers that might prevent the processing of pending work items.
+ * If no running worker is found, the pool may be stuck waiting for an idle
+ * worker to be woken, so report the pool state and the last woken worker.
  */
 static void show_cpu_pool_busy_workers(struct worker_pool *pool)
 {
 	struct worker *worker;
 	unsigned long irq_flags;
-	int bkt;
+	bool found_running = false;
+	int cpu, bkt;
 
 	raw_spin_lock_irqsave(&pool->lock, irq_flags);
 
+	/* Snapshot cpu inside the lock to safely use it after unlock. */
+	cpu = pool->cpu;
+
 	hash_for_each(pool->busy_hash, bkt, worker, hentry) {
+		/* Skip workers that are not actively running on the CPU. */
+		if (!task_is_running(worker->task))
+			continue;
+
+		found_running = true;
 		/*
 		 * Defer printing to avoid deadlocks in console
 		 * drivers that queue work while holding locks
@@ -7609,7 +7661,23 @@ static void show_cpu_pool_busy_workers(struct worker_pool *pool)
 		printk_deferred_exit();
 	}
 
+	/*
+	 * If no running worker was found, the pool is likely stuck. Print pool
+	 * state and the backtrace of the last woken worker, which is the prime
+	 * suspect for the stall.
+	 */
+	if (!found_running)
+		show_pool_no_running_worker(pool);
+
 	raw_spin_unlock_irqrestore(&pool->lock, irq_flags);
+
+	/*
+	 * Trigger a backtrace on the stalled CPU to capture what it is
+	 * currently executing. Called after releasing the lock to avoid
+	 * any potential issues with NMI delivery.
+	 */
+	if (!found_running)
+		trigger_single_cpu_backtrace(cpu);
 }
 
 static void show_cpu_pools_busy_workers(void)

Re: [PATCH v2 4/5] workqueue: Show all busy workers in stall diagnostics

Posted by Song Liu 1 month ago

On Thu, Mar 5, 2026 at 8:16 AM Breno Leitao <leitao@debian.org> wrote:
>
> show_cpu_pool_hog() only prints workers whose task is currently running
> on the CPU (task_is_running()).  This misses workers that are busy
> processing a work item but are sleeping or blocked — for example, a
> worker that clears PF_WQ_WORKER and enters wait_event_idle().  Such a
> worker still occupies a pool slot and prevents progress, yet produces
> an empty backtrace section in the watchdog output.
>
> This is happening on real arm64 systems, where
> toggle_allocation_gate() IPIs every single CPU in the machine (which
> lacks NMI), causing workqueue stalls that show empty backtraces because
> toggle_allocation_gate() is sleeping in wait_event_idle().
>
> Remove the task_is_running() filter so every in-flight worker in the
> pool's busy_hash is dumped.  The busy_hash is protected by pool->lock,
> which is already held.
>
> Signed-off-by: Breno Leitao <leitao@debian.org>

Acked-by: Song Liu <song@kernel.org>

[PATCH v2 1/5] workqueue: Use POOL_BH instead of WQ_BH when checking pool flags
[PATCH v2 2/5] workqueue: Rename pool->watchdog_ts to pool->last_progress_ts
[PATCH v2 3/5] workqueue: Show in-flight work item duration in stall diagnostics
[PATCH v2 4/5] workqueue: Show all busy workers in stall diagnostics
[PATCH v2 5/5] workqueue: Add stall detector sample module