[v2] workqueue: Detect stalled in-flight workers

[PATCH v2 0/5] workqueue: Detect stalled in-flight workers

Posted by Breno Leitao 1 month ago

There is a blind spot exists in the work queue stall detecetor (aka
show_cpu_pool_hog()). It only prints workers whose task_is_running() is
true, so a busy worker that is sleeping (e.g. wait_event_idle())
produces an empty backtrace section even though it is the cause of the
stall.

Additionally, when the watchdog does report stalled pools, the output
doesn't show how long each in-flight work item has been running, making
it harder to identify which specific worker is stuck.

Example of the sample code:

    BUG: workqueue lockup - pool cpus=4 node=0 flags=0x0 nice=0 stuck for 132s!
    Showing busy workqueues and worker pools:
    workqueue events: flags=0x100
        pwq 18: cpus=4 node=0 flags=0x0 nice=0 active=4 refcnt=5
        in-flight: 178:stall_work1_fn [wq_stall]
        pending: stall_work2_fn [wq_stall], free_obj_work, psi_avgs_work
	...
    Showing backtraces of running workers in stalled
    CPU-bound worker pools:
        <nothing here>

I see it happening on real machines, causing some stalls that doesn't
have any backtrace. This is one of the code path:

  1) kfence executes toggle_allocation_gate() as a delayed workqueue
     item (kfence_timer) on the system WQ.

  2) toggle_allocation_gate() enables a static key, which IPIs every
     CPU to patch code:
          static_branch_enable(&kfence_allocation_key);

  3) toggle_allocation_gate() then sleeps in TASK_IDLE waiting for a
     kfence allocation to occur:
          wait_event_idle(allocation_wait,
                  atomic_read(&kfence_allocation_gate) > 0 || ...);

     This can last indefinitely if no allocation goes through the
     kfence path (or IPIing all the CPUs take longer, which is common on
     platforms that do not have NMI).

     The worker remains in the pool's busy_hash
     (in-flight) but is no longer task_is_running().

  4) The workqueue watchdog detects the stall and calls
     show_cpu_pool_hog(), which only prints backtraces for workers
     that are actively running on CPU:

          static void show_cpu_pool_hog(struct worker_pool *pool) {
                  ...
                  if (task_is_running(worker->task))
                          sched_show_task(worker->task);
          }

  5) Nothing is printed because the offending worker is in TASK_IDLE
     state. The output shows "Showing backtraces of running workers in
     stalled CPU-bound worker pools:" followed by nothing, effectively
     hiding the actual culprit.

Given I am using this detector a lot, I am also proposing additional
improvements here as well.

This series addresses these issues:

Patch 1 fixes a minor semantic inconsistency where pool flags were
checked against a workqueue-level constant (WQ_BH instead of POOL_BH).
No behavioral change since both constants have the same value.

Patch 2 renames pool->watchdog_ts to pool->last_progress_ts to better
describe what the timestamp actually tracks.

Patch 3 adds a current_start timestamp to struct worker, recording when
a work item began executing. This is printed in show_pwq() as elapsed
wall-clock time (e.g., "in-flight: 165:stall_work_fn [wq_stall] for
100s"), giving immediate visibility into how long each worker has been
busy.

Patch 4 removes the task_is_running() filter from show_cpu_pool_hog()
so that every in-flight worker in the pool's busy_hash is dumped. This
catches workers that are busy but sleeping or blocked, which were
previously invisible in the watchdog output.

With this series applied, stall output shows the backtrace for all
tasks, and for how long the work is stall. Example:

	 BUG: workqueue lockup - pool cpus=14 node=0 flags=0x0 nice=0 stuck for 42!
	 Showing busy workqueues and worker pools:
	 workqueue events: flags=0x100
	   pwq 2: cpus=0 node=0 flags=0x0 nice=0 active=1 refcnt=2
	     pending: vmstat_shepherd
	   pwq 58: cpus=14 node=0 flags=0x0 nice=0 active=4 refcnt=5
	     in-flight: 184:stall_work1_fn [wq_stall] for 39s
 	 ...
	 Showing backtraces of busy workers in stalled CPU-bound worker pools:
	 pool 58:
	 task:kworker/14:1    state:I stack:0     pid:184 tgid:184   ppid:2      task_flags:0x4208040 flags:0x00080000
	 Call Trace:
	  <TASK>
	  __schedule+0x1521/0x5360
	  schedule+0x165/0x350
	  stall_work1_fn+0x17f/0x250 [wq_stall]
	  ...

---
Changes in v2:
- Drop the task_running() filter in show_cpu_pool_hog() instead of assuming a
  work item cannot stay running forever.
- Add a sample code to exercise the stall detector
- Link to v1: https://patch.msgid.link/20260211-wqstall_start-at-v1-0-bd9499a18c19@debian.org

---
Breno Leitao (5):
      workqueue: Use POOL_BH instead of WQ_BH when checking pool flags
      workqueue: Rename pool->watchdog_ts to pool->last_progress_ts
      workqueue: Show in-flight work item duration in stall diagnostics
      workqueue: Show all busy workers in stall diagnostics
      workqueue: Add stall detector sample module

 kernel/workqueue.c                          | 47 +++++++-------
 kernel/workqueue_internal.h                 |  1 +
 samples/workqueue/stall_detector/Makefile   |  1 +
 samples/workqueue/stall_detector/wq_stall.c | 98 +++++++++++++++++++++++++++++
 4 files changed, 124 insertions(+), 23 deletions(-)
---
base-commit: c107785c7e8dbabd1c18301a1c362544b5786282
change-id: 20260210-wqstall_start-at-e7319a005ab4

Best regards,
--  
Breno Leitao <leitao@debian.org>

Re: [PATCH v2 0/5] workqueue: Detect stalled in-flight workers

Posted by Petr Mladek 4 weeks ago

On Thu 2026-03-05 08:15:36, Breno Leitao wrote:
> There is a blind spot exists in the work queue stall detecetor (aka
> show_cpu_pool_hog()). It only prints workers whose task_is_running() is
> true, so a busy worker that is sleeping (e.g. wait_event_idle())
> produces an empty backtrace section even though it is the cause of the
> stall.
> 
> Additionally, when the watchdog does report stalled pools, the output
> doesn't show how long each in-flight work item has been running, making
> it harder to identify which specific worker is stuck.
> 
> Example of the sample code:
> 
>     BUG: workqueue lockup - pool cpus=4 node=0 flags=0x0 nice=0 stuck for 132s!
>     Showing busy workqueues and worker pools:
>     workqueue events: flags=0x100
>         pwq 18: cpus=4 node=0 flags=0x0 nice=0 active=4 refcnt=5
>         in-flight: 178:stall_work1_fn [wq_stall]
>         pending: stall_work2_fn [wq_stall], free_obj_work, psi_avgs_work
> 	...
>     Showing backtraces of running workers in stalled
>     CPU-bound worker pools:
>         <nothing here>
> 
> I see it happening on real machines, causing some stalls that doesn't
> have any backtrace. This is one of the code path:
> 
>   1) kfence executes toggle_allocation_gate() as a delayed workqueue
>      item (kfence_timer) on the system WQ.
> 
>   2) toggle_allocation_gate() enables a static key, which IPIs every
>      CPU to patch code:
>           static_branch_enable(&kfence_allocation_key);
> 
>   3) toggle_allocation_gate() then sleeps in TASK_IDLE waiting for a
>      kfence allocation to occur:
>           wait_event_idle(allocation_wait,
>                   atomic_read(&kfence_allocation_gate) > 0 || ...);
> 
>      This can last indefinitely if no allocation goes through the
>      kfence path (or IPIing all the CPUs take longer, which is common on
>      platforms that do not have NMI).
> 
>      The worker remains in the pool's busy_hash
>      (in-flight) but is no longer task_is_running().
>
>   4) The workqueue watchdog detects the stall and calls
>      show_cpu_pool_hog(), which only prints backtraces for workers
>      that are actively running on CPU:
> 
>           static void show_cpu_pool_hog(struct worker_pool *pool) {
>                   ...
>                   if (task_is_running(worker->task))
>                           sched_show_task(worker->task);
>           }
> 
>   5) Nothing is printed because the offending worker is in TASK_IDLE
>      state. The output shows "Showing backtraces of running workers in
>      stalled CPU-bound worker pools:" followed by nothing, effectively
>      hiding the actual culprit.

I am trying to better understand the situation. There was a reason
why only the worker in the running state was shown.

Normally, a sleeping worker should not cause a stall. The scheduler calls
wq_worker_sleeping() which should wake up another idle worker. There is
always at least one idle worker in the poll. It should start processing
the next pending work. Or it should fork another worker when it was
the last idle one.

I wonder what blocked the idle worker from waking or forking
a new worker. Was it caused by the IPIs?

Did printing the sleeping workers helped to analyze the problem?

I wonder if we could do better in this case. For example, warn
that the scheduler failed to wake up another idle worker when
no worker is in the running state. And maybe, print backtrace
of the currently running process on the given CPU because it
likely blocks waking/scheduling the idle worker.

Otherwise, I like the other improvements.

Best Regards,
Petr

Re: [PATCH v2 0/5] workqueue: Detect stalled in-flight workers

Posted by Breno Leitao 3 weeks, 6 days ago

Hello Petr,

On Thu, Mar 12, 2026 at 05:38:26PM +0100, Petr Mladek wrote:
> On Thu 2026-03-05 08:15:36, Breno Leitao wrote:
> > There is a blind spot exists in the work queue stall detecetor (aka
> > show_cpu_pool_hog()). It only prints workers whose task_is_running() is
> > true, so a busy worker that is sleeping (e.g. wait_event_idle())
> > produces an empty backtrace section even though it is the cause of the
> > stall.
> > 
> > Additionally, when the watchdog does report stalled pools, the output
> > doesn't show how long each in-flight work item has been running, making
> > it harder to identify which specific worker is stuck.
> > 
> > Example of the sample code:
> > 
> >     BUG: workqueue lockup - pool cpus=4 node=0 flags=0x0 nice=0 stuck for 132s!
> >     Showing busy workqueues and worker pools:
> >     workqueue events: flags=0x100
> >         pwq 18: cpus=4 node=0 flags=0x0 nice=0 active=4 refcnt=5
> >         in-flight: 178:stall_work1_fn [wq_stall]
> >         pending: stall_work2_fn [wq_stall], free_obj_work, psi_avgs_work
> > 	...
> >     Showing backtraces of running workers in stalled
> >     CPU-bound worker pools:
> >         <nothing here>
> > 
> > I see it happening on real machines, causing some stalls that doesn't
> > have any backtrace. This is one of the code path:
> > 
> >   1) kfence executes toggle_allocation_gate() as a delayed workqueue
> >      item (kfence_timer) on the system WQ.
> > 
> >   2) toggle_allocation_gate() enables a static key, which IPIs every
> >      CPU to patch code:
> >           static_branch_enable(&kfence_allocation_key);
> > 
> >   3) toggle_allocation_gate() then sleeps in TASK_IDLE waiting for a
> >      kfence allocation to occur:
> >           wait_event_idle(allocation_wait,
> >                   atomic_read(&kfence_allocation_gate) > 0 || ...);
> > 
> >      This can last indefinitely if no allocation goes through the
> >      kfence path (or IPIing all the CPUs take longer, which is common on
> >      platforms that do not have NMI).
> > 
> >      The worker remains in the pool's busy_hash
> >      (in-flight) but is no longer task_is_running().
> >
> >   4) The workqueue watchdog detects the stall and calls
> >      show_cpu_pool_hog(), which only prints backtraces for workers
> >      that are actively running on CPU:
> > 
> >           static void show_cpu_pool_hog(struct worker_pool *pool) {
> >                   ...
> >                   if (task_is_running(worker->task))
> >                           sched_show_task(worker->task);
> >           }
> > 
> >   5) Nothing is printed because the offending worker is in TASK_IDLE
> >      state. The output shows "Showing backtraces of running workers in
> >      stalled CPU-bound worker pools:" followed by nothing, effectively
> >      hiding the actual culprit.
> 
> I am trying to better understand the situation. There was a reason
> why only the worker in the running state was shown.
> 
> Normally, a sleeping worker should not cause a stall. The scheduler calls
> wq_worker_sleeping() which should wake up another idle worker. There is
> always at least one idle worker in the poll. It should start processing
> the next pending work. Or it should fork another worker when it was
> the last idle one.

Right, but let's look at this case:

	 BUG: workqueue lockup - pool 55 cpu 13 curr 0 (swapper/13) stack ffff800085640000 cpus=13 node=0 flags=0x0 nice=-20 stuck for 679s!
	  work func=blk_mq_timeout_work data=0xffff0000ad7e3a05
	  Showing busy workqueues and worker pools:
	  workqueue events_unbound: flags=0x2
	    pwq 288: cpus=0-71 flags=0x4 nice=0 active=1 refcnt=2
	      in-flight: 4083734:btrfs_extent_map_shrinker_worker
	  workqueue mm_percpu_wq: flags=0x8
	    pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=1 refcnt=2
	      pending: vmstat_update
	  pool 288: cpus=0-71 flags=0x4 nice=0 hung=0s workers=17 idle: 3800629 3959700 3554824 3706405 3759881 4065549 4041361 4065548 1715676 4086805 3860852 3587585 4065550 4014041 3944711 3744484
	  Showing backtraces of running workers in stalled CPU-bound worker pools:
		# Nothing in here

It seems CPU 13 is idle (curr = 0) and blk_mq_timeout_work has been pending for
679s ?

> I wonder what blocked the idle worker from waking or forking
> a new worker. Was it caused by the IPIs?

Not sure, keep in mind that these hosts (arm64) do not have NMI, so,
IPIs are just regular interrupts that could take a long time to be handled. The
toggle_allocation_gate() was good example, given it was sending IPIs very
frequently and I took it as an example for the cover letter, but, this problem
also show up with diferent places. (more examples later)

> Did printing the sleeping workers helped to analyze the problem?

That is my hope. I don't have a reproducer other than the one in this
patchset.

I am currently rolling this patchset to production, and I can report once
I get more information.

> I wonder if we could do better in this case. For example, warn
> that the scheduler failed to wake up another idle worker when
> no worker is in the running state. And maybe, print backtrace
> of the currently running process on the given CPU because it
> likely blocks waking/scheduling the idle worker.

I am happy to improve this, given this has been a hard issue. let me give more
instances of the "empty" stalls I am seeing. All with empty backtraces:

# Instance 1
	 BUG: workqueue lockup - pool cpus=33 node=0 flags=0x0 nice=0 stuck for 33s!
	 Showing busy workqueues and worker pools:
	 workqueue events: flags=0x0
	   pwq 134: cpus=33 node=0 flags=0x0 nice=0 active=3 refcnt=4
	     pending: 3*psi_avgs_work
	   pwq 218: cpus=54 node=0 flags=0x0 nice=0 active=1 refcnt=2
	     in-flight: 842:key_garbage_collector
	 workqueue mm_percpu_wq: flags=0x8
	   pwq 134: cpus=33 node=0 flags=0x0 nice=0 active=1 refcnt=2
	     pending: vmstat_update
	 pool 218: cpus=54 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 11200 524627
	 Showing backtraces of running workers in stalled CPU-bound worker pools:

# Instance 2
	 BUG: workqueue lockup - pool cpus=53 node=0 flags=0x0 nice=0 stuck for 459s!
	 Showing busy workqueues and worker pools:
	 workqueue events: flags=0x0
	   pwq 2: cpus=0 node=0 flags=0x0 nice=0 active=1 refcnt=2
	     pending: psi_avgs_work
	   pwq 214: cpus=53 node=0 flags=0x0 nice=0 active=4 refcnt=5
	     pending: 2*psi_avgs_work, drain_local_memcg_stock, iova_depot_work_func
	 workqueue events_freezable: flags=0x4
	   pwq 2: cpus=0 node=0 flags=0x0 nice=0 active=1 refcnt=2
	     pending: pci_pme_list_scan
	 workqueue slub_flushwq: flags=0x8
	   pwq 214: cpus=53 node=0 flags=0x0 nice=0 active=1 refcnt=3
	     pending: flush_cpu_slab BAR(7520)
	 workqueue mm_percpu_wq: flags=0x8
	   pwq 214: cpus=53 node=0 flags=0x0 nice=0 active=1 refcnt=2
	     pending: vmstat_update
	 workqueue mlx5_cmd_0002:03:00.1: flags=0x6000a
	   pwq 576: cpus=0-143 flags=0x4 nice=0 active=1 refcnt=146
	     pending: cmd_work_handler
	 Showing backtraces of running workers in stalled CPU-bound worker pools:

# Instance 3
	 BUG: workqueue lockup - pool cpus=74 node=1 flags=0x0 nice=0 stuck for 31s!
	 Showing busy workqueues and worker pools:
	 workqueue mm_percpu_wq: flags=0x8
	   pwq 298: cpus=74 node=1 flags=0x0 nice=0 active=1 refcnt=2
	     pending: vmstat_update
	 Showing backtraces of running workers in stalled CPU-bound worker pools:	

# Instance 4
	 BUG: workqueue lockup - pool cpus=71 node=0 flags=0x0 nice=0 stuck for 32s!
	 Showing busy workqueues and worker pools:
	 workqueue events: flags=0x0
	   pwq 286: cpus=71 node=0 flags=0x0 nice=0 active=2 refcnt=3
	     pending: psi_avgs_work, fuse_check_timeout
	 workqueue events_freezable: flags=0x4
	   pwq 2: cpus=0 node=0 flags=0x0 nice=0 active=1 refcnt=2
	     pending: pci_pme_list_scan
	 workqueue mm_percpu_wq: flags=0x8
	   pwq 286: cpus=71 node=0 flags=0x0 nice=0 active=1 refcnt=2
	     pending: vmstat_update
	 Showing backtraces of running workers in stalled CPU-bound worker pools:

Thanks for your help,
--breno

Re: [PATCH v2 0/5] workqueue: Detect stalled in-flight workers

Posted by Petr Mladek 3 weeks, 6 days ago

On Fri 2026-03-13 05:24:54, Breno Leitao wrote:
> Hello Petr,
> 
> On Thu, Mar 12, 2026 at 05:38:26PM +0100, Petr Mladek wrote:
> > On Thu 2026-03-05 08:15:36, Breno Leitao wrote:
> > > There is a blind spot exists in the work queue stall detecetor (aka
> > > show_cpu_pool_hog()). It only prints workers whose task_is_running() is
> > > true, so a busy worker that is sleeping (e.g. wait_event_idle())
> > > produces an empty backtrace section even though it is the cause of the
> > > stall.
> > > 
> > > Additionally, when the watchdog does report stalled pools, the output
> > > doesn't show how long each in-flight work item has been running, making
> > > it harder to identify which specific worker is stuck.
> > > 
> > > Example of the sample code:
> > > 
> > >     BUG: workqueue lockup - pool cpus=4 node=0 flags=0x0 nice=0 stuck for 132s!
> > >     Showing busy workqueues and worker pools:
> > >     workqueue events: flags=0x100
> > >         pwq 18: cpus=4 node=0 flags=0x0 nice=0 active=4 refcnt=5
> > >         in-flight: 178:stall_work1_fn [wq_stall]
> > >         pending: stall_work2_fn [wq_stall], free_obj_work, psi_avgs_work
> > > 	...
> > >     Showing backtraces of running workers in stalled
> > >     CPU-bound worker pools:
> > >         <nothing here>
> > > 
> > > I see it happening on real machines, causing some stalls that doesn't
> > > have any backtrace. This is one of the code path:
> > > 
> > >   1) kfence executes toggle_allocation_gate() as a delayed workqueue
> > >      item (kfence_timer) on the system WQ.
> > > 
> > >   2) toggle_allocation_gate() enables a static key, which IPIs every
> > >      CPU to patch code:
> > >           static_branch_enable(&kfence_allocation_key);
> > > 
> > >   3) toggle_allocation_gate() then sleeps in TASK_IDLE waiting for a
> > >      kfence allocation to occur:
> > >           wait_event_idle(allocation_wait,
> > >                   atomic_read(&kfence_allocation_gate) > 0 || ...);
> > > 
> > >      This can last indefinitely if no allocation goes through the
> > >      kfence path (or IPIing all the CPUs take longer, which is common on
> > >      platforms that do not have NMI).
> > > 
> > >      The worker remains in the pool's busy_hash
> > >      (in-flight) but is no longer task_is_running().
> > >
> > >   4) The workqueue watchdog detects the stall and calls
> > >      show_cpu_pool_hog(), which only prints backtraces for workers
> > >      that are actively running on CPU:
> > > 
> > >           static void show_cpu_pool_hog(struct worker_pool *pool) {
> > >                   ...
> > >                   if (task_is_running(worker->task))
> > >                           sched_show_task(worker->task);
> > >           }
> > > 
> > >   5) Nothing is printed because the offending worker is in TASK_IDLE
> > >      state. The output shows "Showing backtraces of running workers in
> > >      stalled CPU-bound worker pools:" followed by nothing, effectively
> > >      hiding the actual culprit.
> > 
> > I am trying to better understand the situation. There was a reason
> > why only the worker in the running state was shown.
> > 
> > Normally, a sleeping worker should not cause a stall. The scheduler calls
> > wq_worker_sleeping() which should wake up another idle worker. There is
> > always at least one idle worker in the poll. It should start processing
> > the next pending work. Or it should fork another worker when it was
> > the last idle one.
> 
> Right, but let's look at this case:
> 
> 	 BUG: workqueue lockup - pool 55 cpu 13 curr 0 (swapper/13) stack ffff800085640000 cpus=13 node=0 flags=0x0 nice=-20 stuck for 679s!
> 	  work func=blk_mq_timeout_work data=0xffff0000ad7e3a05
> 	  Showing busy workqueues and worker pools:
> 	  workqueue events_unbound: flags=0x2
> 	    pwq 288: cpus=0-71 flags=0x4 nice=0 active=1 refcnt=2
> 	      in-flight: 4083734:btrfs_extent_map_shrinker_worker
> 	  workqueue mm_percpu_wq: flags=0x8
> 	    pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=1 refcnt=2
> 	      pending: vmstat_update
> 	  pool 288: cpus=0-71 flags=0x4 nice=0 hung=0s workers=17 idle: 3800629 3959700 3554824 3706405 3759881 4065549 4041361 4065548 1715676 4086805 3860852 3587585 4065550 4014041 3944711 3744484
> 	  Showing backtraces of running workers in stalled CPU-bound worker pools:
> 		# Nothing in here
> 
> It seems CPU 13 is idle (curr = 0) and blk_mq_timeout_work has been pending for
> 679s ?

It looks like that progress is not blocked by an overloaded CPU.

One interesting thing is there is no "pwq XXX: cpus=13" in the list
of busy workqueues and worker pools. IMHO, the watchdog should report
a stall only when there is a pending work. It does not make much sense
to me.

BTW: I look at pr_cont_pool_info() in the mainline and it does not
not print the name of the current process and its stack address.
I guess that it is printed by another debugging patch ?


> 	  pool 288: cpus=0-71 flags=0x4 nice=0 hung=0s workers=17 idle: 3800629 3959700 3554824 3706405 3759881 4065549 4041361 4065548 17

> > I wonder what blocked the idle worker from waking or forking
> > a new worker. Was it caused by the IPIs?
> 
> Not sure, keep in mind that these hosts (arm64) do not have NMI, so,
> IPIs are just regular interrupts that could take a long time to be handled. The
> toggle_allocation_gate() was good example, given it was sending IPIs very
> frequently and I took it as an example for the cover letter, but, this problem
> also show up with diferent places. (more examples later)
> 
> > Did printing the sleeping workers helped to analyze the problem?
> 
> That is my hope. I don't have a reproducer other than the one in this
> patchset.

Good to know. Note that the reproducer is not "realistic".
PF_WQ_WORKER is an internal flag and must not be manipulated
by the queued work callbacks. It is like shooting into an own leg.

> I am currently rolling this patchset to production, and I can report once
> I get more information.

That would be great. I am really curious what is the root problem here.


> > I wonder if we could do better in this case. For example, warn
> > that the scheduler failed to wake up another idle worker when
> > no worker is in the running state. And maybe, print backtrace
> > of the currently running process on the given CPU because it
> > likely blocks waking/scheduling the idle worker.
> 
> I am happy to improve this, given this has been a hard issue. let me give more
> instances of the "empty" stalls I am seeing. All with empty backtraces:
> 
> # Instance 1
> 	 BUG: workqueue lockup - pool cpus=33 node=0 flags=0x0 nice=0 stuck for 33s!
> 	 Showing busy workqueues and worker pools:
> 	 workqueue events: flags=0x0
> 	   pwq 134: cpus=33 node=0 flags=0x0 nice=0 active=3 refcnt=4
> 	     pending: 3*psi_avgs_work
> 	   pwq 218: cpus=54 node=0 flags=0x0 nice=0 active=1 refcnt=2
> 	     in-flight: 842:key_garbage_collector
> 	 workqueue mm_percpu_wq: flags=0x8
> 	   pwq 134: cpus=33 node=0 flags=0x0 nice=0 active=1 refcnt=2
> 	     pending: vmstat_update
> 	 pool 218: cpus=54 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 11200 524627
> 	 Showing backtraces of running workers in stalled CPU-bound worker pools:
> 
> # Instance 2
> 	 BUG: workqueue lockup - pool cpus=53 node=0 flags=0x0 nice=0 stuck for 459s!
> 	 Showing busy workqueues and worker pools:
> 	 workqueue events: flags=0x0
> 	   pwq 2: cpus=0 node=0 flags=0x0 nice=0 active=1 refcnt=2
> 	     pending: psi_avgs_work
> 	   pwq 214: cpus=53 node=0 flags=0x0 nice=0 active=4 refcnt=5
> 	     pending: 2*psi_avgs_work, drain_local_memcg_stock, iova_depot_work_func
> 	 workqueue events_freezable: flags=0x4
> 	   pwq 2: cpus=0 node=0 flags=0x0 nice=0 active=1 refcnt=2
> 	     pending: pci_pme_list_scan
> 	 workqueue slub_flushwq: flags=0x8
> 	   pwq 214: cpus=53 node=0 flags=0x0 nice=0 active=1 refcnt=3
> 	     pending: flush_cpu_slab BAR(7520)
> 	 workqueue mm_percpu_wq: flags=0x8
> 	   pwq 214: cpus=53 node=0 flags=0x0 nice=0 active=1 refcnt=2
> 	     pending: vmstat_update
> 	 workqueue mlx5_cmd_0002:03:00.1: flags=0x6000a
> 	   pwq 576: cpus=0-143 flags=0x4 nice=0 active=1 refcnt=146
> 	     pending: cmd_work_handler
> 	 Showing backtraces of running workers in stalled CPU-bound worker pools:
> 
> # Instance 3
> 	 BUG: workqueue lockup - pool cpus=74 node=1 flags=0x0 nice=0 stuck for 31s!
> 	 Showing busy workqueues and worker pools:
> 	 workqueue mm_percpu_wq: flags=0x8
> 	   pwq 298: cpus=74 node=1 flags=0x0 nice=0 active=1 refcnt=2
> 	     pending: vmstat_update
> 	 Showing backtraces of running workers in stalled CPU-bound worker pools:	
> 
> # Instance 4
> 	 BUG: workqueue lockup - pool cpus=71 node=0 flags=0x0 nice=0 stuck for 32s!
> 	 Showing busy workqueues and worker pools:
> 	 workqueue events: flags=0x0
> 	   pwq 286: cpus=71 node=0 flags=0x0 nice=0 active=2 refcnt=3
> 	     pending: psi_avgs_work, fuse_check_timeout
> 	 workqueue events_freezable: flags=0x4
> 	   pwq 2: cpus=0 node=0 flags=0x0 nice=0 active=1 refcnt=2
> 	     pending: pci_pme_list_scan
> 	 workqueue mm_percpu_wq: flags=0x8
> 	   pwq 286: cpus=71 node=0 flags=0x0 nice=0 active=1 refcnt=2
> 	     pending: vmstat_update
> 	 Showing backtraces of running workers in stalled CPU-bound worker pools:

In all these cases, there is listed some pending work on the stuck
"cpus=XXX". So, it looks more sane than the 1st report.

I agree that it looks ugly that it did not print any backtraces.
But I am not sure if the backtraces would help.

If there is no running worker then wq_worker_sleeping() should wake up
another idle worker. And if this is the last idle worker in the
per-CPU pool than it should create another worker.

Honestly, I think that there is only small chance that the backtraces
of the sleeping workers will help to solve the problem.

IMHO, the problem is that wq_worker_sleeping() was not able to
guarantee forward progress. Note that there should always be
at least one idle work on CPU-bound worker pools.

Now, the might be more reasons why it failed:

  1. It did not wake up any idle worker because it though
     it has already been done, for example because a messed
     worker->sleeping flag, worker->flags & WORKER_NOT_RUNNING flag,
     pool->nr_running count.

     IMHO, the chance of this bug is small.


  2. The scheduler does not schedule the woken idle worker because:

	+ too big load
	+ soft/hardlockup on the given CPU
	+ the scheduler does not schedule anything, e.g. because of
	  stop_machine()

      It seems that this not the case on the 1st example where
      the CPU is idle. But I am not sure how exactly are the IPIs
      handled on arm64.


   3. There always must be at least one idle worker in each pool.
      But the last idle worker newer processes pending work.
      It has to create another worker instead.

      create_worker() might fail from more reasons:

	+ worker pool limit (is there any?)
	+ PID limit
	+ memory limit

      I have personally seen these problems caused by PID limit.
      Note that containers might have relatively small limits by
      default !!!

   4. ???


I think that it might be interesting to print backtrace and
state of the worker which is supposed to guarantee progress.
Is it "pool->manager" ?

Also create_worker() prints an error when it can't create worker.
But the error is printed only once. And it might get lost on
huge systems with extensive load and logging.

Maybe, we could add some global variable allow to print
these errors once again when workqueue stall is detected.

Or store some timestamps when the function tried to create a new worker
and when it succeeded last time. And print it in the stall report.

Best Regards,
Petr

Re: [PATCH v2 0/5] workqueue: Detect stalled in-flight workers

Posted by Breno Leitao 3 weeks, 6 days ago

On Fri, Mar 13, 2026 at 03:38:57PM +0100, Petr Mladek wrote:
> On Fri 2026-03-13 05:24:54, Breno Leitao wrote:

> > Right, but let's look at this case:
> > 
> > 	 BUG: workqueue lockup - pool 55 cpu 13 curr 0 (swapper/13) stack ffff800085640000 cpus=13 node=0 flags=0x0 nice=-20 stuck for 679s!
> > 	  work func=blk_mq_timeout_work data=0xffff0000ad7e3a05
> > 	  Showing busy workqueues and worker pools:
> > 	  workqueue events_unbound: flags=0x2
> > 	    pwq 288: cpus=0-71 flags=0x4 nice=0 active=1 refcnt=2
> > 	      in-flight: 4083734:btrfs_extent_map_shrinker_worker
> > 	  workqueue mm_percpu_wq: flags=0x8
> > 	    pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=1 refcnt=2
> > 	      pending: vmstat_update
> > 	  pool 288: cpus=0-71 flags=0x4 nice=0 hung=0s workers=17 idle: 3800629 3959700 3554824 3706405 3759881 4065549 4041361 4065548 1715676 4086805 3860852 3587585 4065550 4014041 3944711 3744484
> > 	  Showing backtraces of running workers in stalled CPU-bound worker pools:
> > 		# Nothing in here
> > 
> > It seems CPU 13 is idle (curr = 0) and blk_mq_timeout_work has been pending for
> > 679s ?
> 
> It looks like that progress is not blocked by an overloaded CPU.

Looking at data address, it seems it always have the last 0x5 bits set,
meaning that WORK_STRUCT_PENDING and WORK_STRUCT_PWQ set, right?

So, the work is peding for a huge amount of time (see more examples below)

> One interesting thing is there is no "pwq XXX: cpus=13" in the list
> of busy workqueues and worker pools. IMHO, the watchdog should report
> a stall only when there is a pending work. It does not make much sense
> to me.
> 
> BTW: I look at pr_cont_pool_info() in the mainline and it does not
> not print the name of the current process and its stack address.
> I guess that it is printed by another debugging patch ?

Sorry, this was an simple change we got in initially, that is basically doing:

	void *curr_stack;
	curr_stack = try_get_task_stack(curr)
	pr_emerg("BUG: workqueue lockup - pool %d cpu %d curr %d (%s) stack %px",
		 pool->id, pool->cpu, curr->pid,
		 curr->comm, curr_stack);
> 
> 
> > 	  pool 288: cpus=0-71 flags=0x4 nice=0 hung=0s workers=17 idle: 3800629 3959700 3554824 3706405 3759881 4065549 4041361 4065548 17
> 
> > > I wonder what blocked the idle worker from waking or forking
> > > a new worker. Was it caused by the IPIs?
> > 
> > Not sure, keep in mind that these hosts (arm64) do not have NMI, so,
> > IPIs are just regular interrupts that could take a long time to be handled. The
> > toggle_allocation_gate() was good example, given it was sending IPIs very
> > frequently and I took it as an example for the cover letter, but, this problem
> > also show up with diferent places. (more examples later)
> > 
> > > Did printing the sleeping workers helped to analyze the problem?
> > 
> > That is my hope. I don't have a reproducer other than the one in this
> > patchset.
> 
> Good to know. Note that the reproducer is not "realistic".
> PF_WQ_WORKER is an internal flag and must not be manipulated
> by the queued work callbacks. It is like shooting into an own leg.

Ack!

> > I am currently rolling this patchset to production, and I can report once
> > I get more information.
> 
> That would be great. I am really curious what is the root problem here.

In fact, I got some instances of this issue with this new patchset, and,
still, the backtrace is empty. These are the only 3 issues I got with the new
patches applied. All of them wiht the "blk_mq_timeout_work" function.

	BUG: workqueue lockup - pool 11 cpu 2 curr 686384 (thrmon_agg) stack ffff8002bd200000 cpus=2 node=0 flags=0x0 nice=-20 stuck for 276s!
	   work func=blk_mq_timeout_work data=0xffff0000b88e3405
	   Showing busy workqueues and worker pools:
	   workqueue kblockd: flags=0x18
	     pwq 11: cpus=2 node=0 flags=0x0 nice=-20 active=1 refcnt=2
	       pending: blk_mq_timeout_work
	   Showing backtraces of busy workers in stalled CPU-bound worker pools:

	BUG: workqueue lockup - pool 7 cpu 1 curr 0 (swapper/1) stack ffff800084f80000 cpus=1 node=0 flags=0x0 nice=-20 stuck for 114s!
           work func=blk_mq_timeout_work data=0xffff0000b88e3205
           Showing busy workqueues and worker pools:
           workqueue events: flags=0x0
             pwq 510: cpus=127 node=1 flags=0x0 nice=0 active=1 refcnt=2
               pending: psi_avgs_work
           Showing backtraces of busy workers in stalled CPU-bound worker pools:

	BUG: workqueue lockup - pool 11 cpu 2 curr 24596 (mcrcfg-fci) stack ffff8002b5a40000 cpus=2 node=0 flags=0x0 nice=-20 stuck for 282s!
           work func=blk_mq_timeout_work data=0xffff0000b8706805
           Showing busy workqueues and worker pools:
           Showing backtraces of busy workers in stalled CPU-bound worker pools:

> In all these cases, there is listed some pending work on the stuck
> "cpus=XXX". So, it looks more sane than the 1st report.
> 
> I agree that it looks ugly that it did not print any backtraces.
> But I am not sure if the backtraces would help.
> 
> If there is no running worker then wq_worker_sleeping() should wake up
> another idle worker. And if this is the last idle worker in the
> per-CPU pool than it should create another worker.
> 
> Honestly, I think that there is only small chance that the backtraces
> of the sleeping workers will help to solve the problem.
> 
> IMHO, the problem is that wq_worker_sleeping() was not able to
> guarantee forward progress. Note that there should always be
> at least one idle work on CPU-bound worker pools.
> 
> Now, the might be more reasons why it failed:
> 
>   1. It did not wake up any idle worker because it though
>      it has already been done, for example because a messed
>      worker->sleeping flag, worker->flags & WORKER_NOT_RUNNING flag,
>      pool->nr_running count.
> 
>      IMHO, the chance of this bug is small.
> 
> 
>   2. The scheduler does not schedule the woken idle worker because:
> 
> 	+ too big load
> 	+ soft/hardlockup on the given CPU
> 	+ the scheduler does not schedule anything, e.g. because of
> 	  stop_machine()
> 
>       It seems that this not the case on the 1st example where
>       the CPU is idle. But I am not sure how exactly are the IPIs
>       handled on arm64.

I don't have information about the load of those machines when the problem
happens, but, in some case the problem happen when there is no workload
(production job) running on those machine, thus, it is hard to assume that the
load is high.

>    3. There always must be at least one idle worker in each pool.
>       But the last idle worker newer processes pending work.
>       It has to create another worker instead.
> 
>       create_worker() might fail from more reasons:
> 
> 	+ worker pool limit (is there any?)
> 	+ PID limit
> 	+ memory limit
> 
>       I have personally seen these problems caused by PID limit.
>       Note that containers might have relatively small limits by
>       default !!!

Might this justify the fact that WORK_STRUCT_PENDING bit is set for ~200
seconds?


> I think that it might be interesting to print backtrace and
> state of the worker which is supposed to guarantee progress.
> Is it "pool->manager" ?
> 
> Also create_worker() prints an error when it can't create worker.
> But the error is printed only once. And it might get lost on
> huge systems with extensive load and logging.

That is definitely not the case. I've scan Meta's whole fleet for create_worker
error, and there is a single instance on a unrelated host.

> Maybe, we could add some global variable allow to print
> these errors once again when workqueue stall is detected.
> 
> Or store some timestamps when the function tried to create a new worker
> and when it succeeded last time. And print it in the stall report.

Re: [PATCH v2 0/5] workqueue: Detect stalled in-flight workers

Posted by Petr Mladek 3 weeks, 1 day ago

On Fri 2026-03-13 10:36:09, Breno Leitao wrote:
> On Fri, Mar 13, 2026 at 03:38:57PM +0100, Petr Mladek wrote:
> > On Fri 2026-03-13 05:24:54, Breno Leitao wrote:
> > > I am currently rolling this patchset to production, and I can report once
> > > I get more information.
> > 
> > That would be great. I am really curious what is the root problem here.
> 
> In fact, I got some instances of this issue with this new patchset, and,
> still, the backtrace is empty. These are the only 3 issues I got with the new
> patches applied. All of them wiht the "blk_mq_timeout_work" function.
> 
> 	BUG: workqueue lockup - pool 11 cpu 2 curr 686384 (thrmon_agg) stack ffff8002bd200000 cpus=2 node=0 flags=0x0 nice=-20 stuck for 276s!
> 	   work func=blk_mq_timeout_work data=0xffff0000b88e3405
> 	   Showing busy workqueues and worker pools:
> 	   workqueue kblockd: flags=0x18
> 	     pwq 11: cpus=2 node=0 flags=0x0 nice=-20 active=1 refcnt=2
> 	       pending: blk_mq_timeout_work

This is report is showing the stalled "pool 11" in the list of busy
worker pools.


> 	   Showing backtraces of busy workers in stalled CPU-bound worker pools:
> 
> 	BUG: workqueue lockup - pool 7 cpu 1 curr 0 (swapper/1) stack ffff800084f80000 cpus=1 node=0 flags=0x0 nice=-20 stuck for 114s!
>            work func=blk_mq_timeout_work data=0xffff0000b88e3205
>            Showing busy workqueues and worker pools:
>            workqueue events: flags=0x0
>              pwq 510: cpus=127 node=1 flags=0x0 nice=0 active=1 refcnt=2
>                pending: psi_avgs_work

It is strange that "pwq 7" is not listed here.

>            Showing backtraces of busy workers in stalled CPU-bound worker pools:
> 
> 	BUG: workqueue lockup - pool 11 cpu 2 curr 24596 (mcrcfg-fci) stack ffff8002b5a40000 cpus=2 node=0 flags=0x0 nice=-20 stuck for 282s!
>            work func=blk_mq_timeout_work data=0xffff0000b8706805
>            Showing busy workqueues and worker pools:

And the list of busy worker pools is even empty here.

>            Showing backtraces of busy workers in stalled CPU-bound worker pools:

I would expect that the stalled pool was shown by show_one_workqueue().

show_one_workqueue() checks pwq->nr_active instead of
list_empty(&pool->worklist). But my understanding is that work items
added to pool->worklist should be counted by the related
pwq->nr_active. In fact, pwq->nr_active seems to be decremented
only when the work is proceed or removed from the queue. So that
it should be counted as nr_active even when it is already in progress.
As a result, show_one_workqueue() should print even pools which have
the last assigned work in-flight.

Maybe, I miss something. For example, the barriers are not counted
as nr_active, ...

Anyway, the backtrace of the last woken worker might give us
some pointers. It might show that the pool is stuck on some
wq_barrier or so.

> > In all these cases, there is listed some pending work on the stuck
> > "cpus=XXX". So, it looks more sane than the 1st report.
> > 
> > I agree that it looks ugly that it did not print any backtraces.
> > But I am not sure if the backtraces would help.
> > 
> > If there is no running worker then wq_worker_sleeping() should wake up
> > another idle worker. And if this is the last idle worker in the
> > per-CPU pool than it should create another worker.
> > 
> > Honestly, I think that there is only small chance that the backtraces
> > of the sleeping workers will help to solve the problem.
> > 
> > IMHO, the problem is that wq_worker_sleeping() was not able to
> > guarantee forward progress. Note that there should always be
> > at least one idle work on CPU-bound worker pools.
> > 
> > Now, the might be more reasons why it failed:
> > 
> >   1. It did not wake up any idle worker because it though
> >      it has already been done, for example because a messed
> >      worker->sleeping flag, worker->flags & WORKER_NOT_RUNNING flag,
> >      pool->nr_running count.
> > 
> >      IMHO, the chance of this bug is small.
> > 
> > 
> >   2. The scheduler does not schedule the woken idle worker because:
> > 
> > 	+ too big load
> > 	+ soft/hardlockup on the given CPU
> > 	+ the scheduler does not schedule anything, e.g. because of
> > 	  stop_machine()
> > 
> >       It seems that this not the case on the 1st example where
> >       the CPU is idle. But I am not sure how exactly are the IPIs
> >       handled on arm64.
> 
> I don't have information about the load of those machines when the problem
> happens, but, in some case the problem happen when there is no workload
> (production job) running on those machine, thus, it is hard to assume that the
> load is high.
> 
> >    3. There always must be at least one idle worker in each pool.
> >       But the last idle worker newer processes pending work.
> >       It has to create another worker instead.
> > 
> >       create_worker() might fail from more reasons:
> > 
> > 	+ worker pool limit (is there any?)
> > 	+ PID limit
> > 	+ memory limit
> > 
> >       I have personally seen these problems caused by PID limit.
> >       Note that containers might have relatively small limits by
> >       default !!!
> 
> Might this justify the fact that WORK_STRUCT_PENDING bit is set for ~200
> seconds?
> 
> 
> > I think that it might be interesting to print backtrace and
> > state of the worker which is supposed to guarantee progress.
> > Is it "pool->manager" ?
> > 
> > Also create_worker() prints an error when it can't create worker.
> > But the error is printed only once. And it might get lost on
> > huge systems with extensive load and logging.
> 
> That is definitely not the case. I've scan Meta's whole fleet for create_worker
> error, and there is a single instance on a unrelated host.

Good to know. I am more and more curious what would be the culprit
here.

Best Regards,
Petr

Re: [PATCH v2 0/5] workqueue: Detect stalled in-flight workers

Posted by Breno Leitao 2 weeks, 6 days ago

> >            Showing backtraces of busy workers in stalled CPU-bound worker pools:
> 
> I would expect that the stalled pool was shown by show_one_workqueue().
> 
> show_one_workqueue() checks pwq->nr_active instead of
> list_empty(&pool->worklist). But my understanding is that work items
> added to pool->worklist should be counted by the related
> pwq->nr_active. In fact, pwq->nr_active seems to be decremented
> only when the work is proceed or removed from the queue. So that
> it should be counted as nr_active even when it is already in progress.
> As a result, show_one_workqueue() should print even pools which have
> the last assigned work in-flight.
> 
> Maybe, I miss something. For example, the barriers are not counted
> as nr_active, ...

Chatting quickly to Song, he believed that we need a barrier in-between
adding the worklist and updating last_progress_ts, specifically, the
watchdog can see a non-empty worklist (from a list_add) while reading
a stale last_progress_ts value, causing a false positive stall report.
as well

Re: [PATCH v2 0/5] workqueue: Improve stall diagnostics

Posted by Tejun Heo 1 month ago

Hello,

> Breno Leitao (5):
>   workqueue: Use POOL_BH instead of WQ_BH when checking pool flags
>   workqueue: Rename pool->watchdog_ts to pool->last_progress_ts
>   workqueue: Show in-flight work item duration in stall diagnostics
>   workqueue: Show all busy workers in stall diagnostics
>   workqueue: Add stall detector sample module

Applied 1-5 to wq/for-7.0-fixes.

One minor note for a future follow-up: show_cpu_pool_hog() and
show_cpu_pools_hogs() function names no longer reflect the broadened
scope after patch 4 - they now dump all busy workers, not just CPU
hogs.

Thanks.

--
tejun