[PATCH 0/4] workqueue: Detect stalled in-flight workers

Breno Leitao posted 4 patches 2 months ago
There is a newer version of this series
kernel/workqueue.c          | 71 ++++++++++++++++++++++++++++++++++++++-------
kernel/workqueue_internal.h |  1 +
lib/Kconfig.debug           | 12 ++++++++
3 files changed, 74 insertions(+), 10 deletions(-)
[PATCH 0/4] workqueue: Detect stalled in-flight workers
Posted by Breno Leitao 2 months ago
The workqueue watchdog detects pools that haven't made forward progress
by checking whether pending work items on the worklist have been waiting
too long. However, this approach has a blind spot: if a pool has only
one work item and that item has already been dequeued and is executing on
a worker, the worklist is empty and the watchdog skips the pool entirely.
This means a single hogged worker with no other pending work is invisible
to the stall detector.

I was able to come up with the following example that shows this blind
spot:

	static void stall_work_fn(struct work_struct *work)
	{
		for (;;) {
			mdelay(1000);
			cond_resched();
		}
	}

Additionally, when the watchdog does report stalled pools, the output
doesn't show how long each in-flight work item has been running, making
it harder to identify which specific worker is stuck.

This series addresses both issues:

Patch 1 fixes a minor semantic inconsistency where pool flags were
checked against a workqueue-level constant (WQ_BH instead of POOL_BH).
No behavioral change since both constants have the same value.

Patch 2 renames pool->watchdog_ts to pool->last_progress_ts to better
describe what the timestamp actually tracks.

Patch 3 adds a current_start timestamp to struct worker, recording when
a work item began executing. This is printed in show_pwq() as elapsed
wall-clock time (e.g., "in-flight: 165:stall_work_fn [wq_stall] for
100s"), giving immediate visibility into how long each worker has been
busy.

Patch 4 introduces pool_has_stalled_worker(), which scans all workers in
a pool's busy_hash for any whose current_start timestamp exceeds the
watchdog threshold. This is called unconditionally for every pool,
independent of worklist state, so a stuck worker is always detected. The
feature is gated behind a new CONFIG_WQ_WATCHDOG_WORKERS option
(disabled by default) under CONFIG_WQ_WATCHDOG.

An option is to get rid of CONFIG_WQ_WATCHDOG_WORKERS completely. I've
been running this change on some hosts with workloads (mainly stress-ng)
and I haven't found any false positive.

With this series applied, we will be able to see a stall like the one
above:

	 BUG: workqueue lockup - worker365:stall_work_fn [wq_stall] stuck in pool cpus=9 node=0 flags=0x0 nice=0 for 2570s!
	 Showing busy workqueues and worker pools:
	  workqueue events: flags=0x100
	  pwq 38: cpus=9 node=0 flags=0x0 nice=0 active=2 refcnt=3
	  workqueue stall_wq: flags=0x0

---
Breno Leitao (4):
      workqueue: Use POOL_BH instead of WQ_BH when checking pool flags
      workqueue: Rename pool->watchdog_ts to pool->last_progress_ts
      workqueue: Show in-flight work item duration in stall diagnostics
      workqueue: Detect stalled in-flight work items with empty worklist

 kernel/workqueue.c          | 71 ++++++++++++++++++++++++++++++++++++++-------
 kernel/workqueue_internal.h |  1 +
 lib/Kconfig.debug           | 12 ++++++++
 3 files changed, 74 insertions(+), 10 deletions(-)
---
base-commit: 9cb8b0f289560728dbb8b88158e7a957e2e90a14
change-id: 20260210-wqstall_start-at-e7319a005ab4

Best regards,
--  
Breno Leitao <leitao@debian.org>
Re: [PATCH 0/4] workqueue: Detect stalled in-flight workers
Posted by Tejun Heo 2 months ago
Hello,

On Wed, Feb 11, 2026 at 04:29:14AM -0800, Breno Leitao wrote:
> The workqueue watchdog detects pools that haven't made forward progress
> by checking whether pending work items on the worklist have been waiting
> too long. However, this approach has a blind spot: if a pool has only
> one work item and that item has already been dequeued and is executing on
> a worker, the worklist is empty and the watchdog skips the pool entirely.
> This means a single hogged worker with no other pending work is invisible
> to the stall detector.
> 
> I was able to come up with the following example that shows this blind
> spot:
> 
> 	static void stall_work_fn(struct work_struct *work)
> 	{
> 		for (;;) {
> 			mdelay(1000);
> 			cond_resched();
> 		}
> 	}

Workqueue doesn't require users to limit execution time. As long as there is
enough supply of concurrency to avoid stalling of pending work items, work
items can run as long as they want, including indefinitely. Workqueue stall
is there to indicate that there is insufficient supply of concurrency.

Thanks.

-- 
tejun
Re: [PATCH 0/4] workqueue: Detect stalled in-flight workers
Posted by Breno Leitao 1 month, 2 weeks ago
Hello Tejun,

On Wed, Feb 11, 2026 at 08:56:11AM -1000, Tejun Heo wrote:
> On Wed, Feb 11, 2026 at 04:29:14AM -0800, Breno Leitao wrote:
> > The workqueue watchdog detects pools that haven't made forward progress
> > by checking whether pending work items on the worklist have been waiting
> > too long. However, this approach has a blind spot: if a pool has only
> > one work item and that item has already been dequeued and is executing on
> > a worker, the worklist is empty and the watchdog skips the pool entirely.
> > This means a single hogged worker with no other pending work is invisible
> > to the stall detector.
> > 
> > I was able to come up with the following example that shows this blind
> > spot:
> > 
> > 	static void stall_work_fn(struct work_struct *work)
> > 	{
> > 		for (;;) {
> > 			mdelay(1000);
> > 			cond_resched();
> > 		}
> > 	}
> 
> Workqueue doesn't require users to limit execution time. As long as there is
> enough supply of concurrency to avoid stalling of pending work items, work
> items can run as long as they want, including indefinitely. Workqueue stall
> is there to indicate that there is insufficient supply of concurrency.

Thank you for the clarification. Let me share more context about the
actual problem I am observing so we can think through it together.

On some production hosts, I am seeing a workqueue stall where no
backtraces are printed:

	BUG: workqueue lockup - pool cpus=4 node=0 flags=0x0 nice=0 stuck for 132s!
	Showing busy workqueues and worker pools:
	workqueue events: flags=0x100
		pwq 18: cpus=4 node=0 flags=0x0 nice=0 active=4 refcnt=5
		in-flight: 178:stall_work1_fn [wq_stall]
		pending: stall_work2_fn [wq_stall], free_obj_work, psi_avgs_work
	workqueue mm_percpu_wq: flags=0x108
		pwq 18: cpus=4 node=0 flags=0x0 nice=0 active=1 refcnt=2
		pending: vmstat_update
		pool 18: cpus=4 node=0 flags=0x0 nice=0 hung=132s workers=2 idle: 45
	Showing backtraces of running workers in stalled
	CPU-bound worker pools:
		<nothing here>

We initially suspected a TOCTOU issue, and Omar put together a patch to
address that, but it did not identify anything.

After digging deeper, I believe I have found the root cause along with
a reproducer[1]:

  1) kfence executes toggle_allocation_gate() as a delayed workqueue
     item (kfence_timer) on the system WQ.

  2) toggle_allocation_gate() enables a static key, which IPIs every
     CPU to patch code:
          static_branch_enable(&kfence_allocation_key);

  3) toggle_allocation_gate() then sleeps in TASK_IDLE waiting for a
     kfence allocation to occur:
          wait_event_idle(allocation_wait,
                  atomic_read(&kfence_allocation_gate) > 0 || ...);

     This can last indefinitely if no allocation goes through the
     kfence path. The worker remains in the pool's busy_hash
     (in-flight) but is no longer task_is_running().

  4) The workqueue watchdog detects the stall and calls
     show_cpu_pool_hog(), which only prints backtraces for workers
     that are actively running on CPU:

          static void show_cpu_pool_hog(struct worker_pool *pool) {
                  ...
                  if (task_is_running(worker->task))
                          sched_show_task(worker->task);
          }

  5) Nothing is printed because the offending worker is in TASK_IDLE
     state. The output shows "Showing backtraces of running workers in
     stalled CPU-bound worker pools:" followed by nothing, effectively
     hiding the actual culprit.

The fix I am considering is to remove the task_is_running() filter in
show_cpu_pool_hog() so that all in-flight workers in stalled pools have
their backtraces printed, regardless of whether they are running or
sleeping. This would make sleeping culprits like toggle_allocation_gate()
visible in the watchdog output.

When I test without the task_runinng, then I see the culprit.

Fix I am testing:

	diff --git a/kernel/workqueue.c b/kernel/workqueue.c
	index aeaec79bc09c4..3f5ee08f99313 100644
	--- a/kernel/workqueue.c
	+++ b/kernel/workqueue.c
	@@ -7593,19 +7593,17 @@ static void show_cpu_pool_hog(struct worker_pool *pool)
		raw_spin_lock_irqsave(&pool->lock, irq_flags);

		hash_for_each(pool->busy_hash, bkt, worker, hentry) {
	-               if (task_is_running(worker->task)) {
	-                       /*
	-                        * Defer printing to avoid deadlocks in console
	-                        * drivers that queue work while holding locks
	-                        * also taken in their write paths.
	-                        */
	-                       printk_deferred_enter();
	+               /*
	+                * Defer printing to avoid deadlocks in console
	+                * drivers that queue work while holding locks
	+                * also taken in their write paths.
	+                */
	+               printk_deferred_enter();

	-                       pr_info("pool %d:\n", pool->id);
	-                       sched_show_task(worker->task);
	+               pr_info("pool %d:\n", pool->id);
	+               sched_show_task(worker->task);

	-                       printk_deferred_exit();
	-               }
	+               printk_deferred_exit();
		}

		raw_spin_unlock_irqrestore(&pool->lock, irq_flags);
	@@ -7616,7 +7614,7 @@ static void show_cpu_pools_hogs(void)
		struct worker_pool *pool;
		int pi;

	-       pr_info("Showing backtraces of running workers in stalled CPU-bound worker pools:\n");
	+       pr_info("Showing backtraces of in-flight workers in stalled CPU-bound worker pools:\n");

		rcu_read_lock();

Then I see:

	BUG: workqueue lockup - pool cpus=6 node=0 flags=0x0 nice=0 stuck for 34s!
	 Showing busy workqueues and worker pools:
	 workqueue events: flags=0x100
	   pwq 26: cpus=6 node=0 flags=0x0 nice=0 active=3 refcnt=4
	     in-flight: 161:stall_work1_fn [wq_stall]
	     pending: stall_work2_fn [wq_stall], psi_avgs_work
	 workqueue mm_percpu_wq: flags=0x108
	   pwq 26: cpus=6 node=0 flags=0x0 nice=0 active=1 refcnt=2
	     pending: vmstat_update
	 pool 26: cpus=6 node=0 flags=0x0 nice=0 hung=34s workers=3 idle: 210 57
	 Showing backtraces of in-flight workers in stalled CPU-bound worker pools:
	 pool 26:
	 task:kworker/6:1     state:I stack:0     pid:161   tgid:161   ppid:2      task_flags:0x4208040 flags:0x00080000
	 Call Trace:
	  <TASK>
	  __schedule+0x1521/0x5360
	  ? console_trylock+0x40/0x40
	  ? preempt_count_add+0x92/0x1a0
	  ? do_raw_spin_lock+0x12c/0x2f0
	  ? is_mmconf_reserved+0x390/0x390
	  ? schedule+0x91/0x350
	  ? schedule+0x91/0x350
	  schedule+0x165/0x350
	  stall_work1_fn+0x17f/0x250 [wq_stall]


Link: https://github.com/leitao/debug/blob/main/workqueue_stall/wq_stall.c [1]
Re: [PATCH 0/4] workqueue: Detect stalled in-flight workers
Posted by Tejun Heo 1 month, 2 weeks ago
Hello,

On Wed, Mar 04, 2026 at 07:40:49AM -0800, Breno Leitao wrote:
> The fix I am considering is to remove the task_is_running() filter in
> show_cpu_pool_hog() so that all in-flight workers in stalled pools have
> their backtraces printed, regardless of whether they are running or
> sleeping. This would make sleeping culprits like toggle_allocation_gate()
> visible in the watchdog output.

Yeah, that makes sense to me.

Thanks.

-- 
tejun