[PATCH] KVM: irqfd: fix shutdown deadlock by moving SRCU sync outside resampler_lock

Sonam Sanju posted 1 patch 3 weeks ago
virt/kvm/eventfd.c | 21 +++++++++++++++------
1 file changed, 15 insertions(+), 6 deletions(-)
[PATCH] KVM: irqfd: fix shutdown deadlock by moving SRCU sync outside resampler_lock
Posted by Sonam Sanju 3 weeks ago
irqfd_resampler_shutdown() calls synchronize_srcu_expedited() while
holding kvm->irqfds.resampler_lock.  This can deadlock when multiple
irqfd_shutdown workers run concurrently on the kvm-irqfd-cleanup
workqueue during VM teardown (e.g. crosvm shutdown on Android):

  CPU A (mutex holder)               CPU B/C/D (mutex waiters)
  irqfd_shutdown()                   irqfd_shutdown()
   irqfd_resampler_shutdown()         irqfd_resampler_shutdown()
    mutex_lock(resampler_lock)  <---- mutex_lock(resampler_lock)  // BLOCKED
    list_del_rcu(...)                     ...blocked...
    synchronize_srcu_expedited()      // Waiters block workqueue,
      // waits for SRCU grace            preventing SRCU grace
      // period which requires            period from completing
      // workqueue progress          --- DEADLOCK ---

The synchronize_srcu_expedited() in the else branch is called directly
within the mutex.  In the if-last branch, kvm_unregister_irq_ack_notifier()
also calls synchronize_srcu_expedited() internally.  Both paths can
block indefinitely because:

  1. synchronize_srcu_expedited() waits for an SRCU grace period
  2. SRCU grace period completion needs workqueue workers to run
  3. The blocked mutex waiters occupy workqueue slots, preventing
     progress
  4. The mutex holder never releases the lock -> deadlock

Fix by performing all list manipulations and the last-entry check under
the mutex, then releasing the mutex before the SRCU synchronization.
This is safe because:

  - list_del_rcu() removes the irqfd from resampler->list under the
    mutex, so no concurrent reader or writer can access it.
  - When last==true, list_del_rcu(&resampler->link) has already removed
    the resampler from kvm->irqfds.resampler_list under the mutex, so
    no other worker can find or operate on this resampler.
  - kvm_unregister_irq_ack_notifier() uses its own locking
    (kvm->irq_lock) and is safe to call without resampler_lock.
  - synchronize_srcu_expedited() does not require any KVM mutex.
  - kfree(resampler) is safe after SRCU sync guarantees all readers
    have finished.

Signed-off-by: Sonam Sanju <sonam.sanju@intel.com>
---
 virt/kvm/eventfd.c | 21 +++++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 0e8b8a2c5b79..27bcf2b1a81d 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -93,6 +93,7 @@ irqfd_resampler_shutdown(struct kvm_kernel_irqfd *irqfd)
 {
 	struct kvm_kernel_irqfd_resampler *resampler = irqfd->resampler;
 	struct kvm *kvm = resampler->kvm;
+	bool last = false;
 
 	mutex_lock(&kvm->irqfds.resampler_lock);
 
@@ -100,19 +101,27 @@ irqfd_resampler_shutdown(struct kvm_kernel_irqfd *irqfd)
 
 	if (list_empty(&resampler->list)) {
 		list_del_rcu(&resampler->link);
+		last = true;
+	}
+
+	mutex_unlock(&kvm->irqfds.resampler_lock);
+
+	/*
+	 * synchronize_srcu_expedited() (called explicitly below, or internally
+	 * by kvm_unregister_irq_ack_notifier()) must not be invoked under
+	 * resampler_lock.  Holding the mutex while waiting for an SRCU grace
+	 * period creates a deadlock: the blocked mutex waiters occupy workqueue
+	 * slots that the SRCU grace period machinery needs to make forward
+	 * progress.
+	 */
+	if (last) {
 		kvm_unregister_irq_ack_notifier(kvm, &resampler->notifier);
-		/*
-		 * synchronize_srcu_expedited(&kvm->irq_srcu) already called
-		 * in kvm_unregister_irq_ack_notifier().
-		 */
 		kvm_set_irq(kvm, KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID,
 			    resampler->notifier.gsi, 0, false);
 		kfree(resampler);
 	} else {
 		synchronize_srcu_expedited(&kvm->irq_srcu);
 	}
-
-	mutex_unlock(&kvm->irqfds.resampler_lock);
 }
 
 /*
-- 
2.34.1
Re: [PATCH] KVM: irqfd: fix shutdown deadlock by moving SRCU sync outside resampler_lock
Posted by Sonam Sanju 2 weeks, 6 days ago
From: Sonam Sanju <sonam.sanju@intel.com>

On Mon, Mar 16, 2026 at 12:50:26PM +0530, Sonam Sanju wrote:
> From: Sonam Sanju <sonam.sanju@intel.com>
> 
> irqfd_resampler_shutdown() calls synchronize_srcu_expedited() while
> holding kvm->irqfds.resampler_lock.  This can deadlock when multiple

Adding Sean Christopherson to CC (apologies for the omission in the original 
submission).

--
Sonam
Re: [PATCH] KVM: irqfd: fix shutdown deadlock by moving SRCU sync outside resampler_lock
Posted by Vineeth Pillai (Google) 2 weeks, 3 days ago
Hi Sonam,

> irqfd_resampler_shutdown() calls synchronize_srcu_expedited() while
> holding kvm->irqfds.resampler_lock.  This can deadlock when multiple
> irqfd_shutdown workers run concurrently on the kvm-irqfd-cleanup
> workqueue during VM teardown (e.g. crosvm shutdown on Android):
> 
>   CPU A (mutex holder)               CPU B/C/D (mutex waiters)
>   irqfd_shutdown()                   irqfd_shutdown()
>    irqfd_resampler_shutdown()         irqfd_resampler_shutdown()
>     mutex_lock(resampler_lock)  <---- mutex_lock(resampler_lock)  // BLOCKED
>     list_del_rcu(...)                     ...blocked...
>     synchronize_srcu_expedited()      // Waiters block workqueue,
>       // waits for SRCU grace            preventing SRCU grace
>       // period which requires            period from completing
>       // workqueue progress          --- DEADLOCK ---

I think we might have this issue in the kvm_irqfd_assign path as well
where synchronize_srcu_expedited is called with the resampler_lock
held. I saw similar lockup during a stress test where VMs were created
and destroyed continously. I could see one task waiting on SRCU GP:

[   T93] task:crosvm_security state:D stack:0     pid:8215  tgid:8215  ppid:1      task_flags:0x400000 flags:0x00080002.
[   T93] Call Trace:
[   T93]  <TASK>
[   T93]  __schedule+0x87a/0xd60
[   T93]  schedule+0x5e/0xe0
[   T93]  schedule_timeout+0x2e/0x130
[   T93]  ? queue_delayed_work_on+0x7f/0xd0
[   T93]  wait_for_common+0xf7/0x1f0
[   T93]  synchronize_srcu_expedited+0x109/0x140
[   T93]  ? __cfi_wakeme_after_rcu+0x10/0x10
[   T93]  kvm_irqfd+0x362/0x5e0
[   T93]  kvm_vm_ioctl+0x706/0x780
[   T93]  ? fd_install+0x2c/0xf0
[   T93]  __se_sys_ioctl+0x7a/0xd0
[   T93]  do_syscall_64+0x61/0xf10
[   T93]  ? arch_exit_to_user_mode_prepare+0x9/0xb0
[   T93]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   T93] RIP: 0033:0x79048f9bdd67
[   T93] RSP: 002b:00007ffc3aa82028 EFLAGS: 00000206.

And another task waiting on the mutex:

[    C0] task:kworker/11:2    state:R  running task     stack:0     pid:25180 tgid:25180 ppid:2      task_flags:0x4208060 flags:0x00080000
[    C0] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
[    C0] Call Trace:
[    C0]  <TASK>
[    C0]  __schedule+0x87a/0xd60
[    C0]  schedule+0x5e/0xe0
[    C0]  schedule_preempt_disabled+0x10/0x20
[    C0]  __mutex_lock+0x413/0xe40
[    C0]  irqfd_resampler_shutdown+0x23/0x150
[    C0]  irqfd_shutdown+0x66/0xc0
[    C0]  process_scheduled_works+0x219/0x450
[    C0]  worker_thread+0x30b/0x450
[    C0]  ? __cfi_worker_thread+0x10/0x10
[    C0]  kthread+0x230/0x270
[    C0]  ? __cfi_kthread+0x10/0x10
[    C0]  ret_from_fork+0xf2/0x150
[    C0]  ? __cfi_kthread+0x10/0x10
[    C0]  ret_from_fork_asm+0x1a/0x30
[    C0]  </TASK>

The work queue was full as well I think:

[    C0]   pwq 46: cpus=11 node=0 flags=0x0 nice=0 active=1024 refcnt=2062

There were other tasks waiting for SRCU GP completion in the resampler
shutdown path. Also, there were other traces showing lockups (mostly in
mm), but I think thats a secondary effect of this lockup and might not
be relevant. I can provide the full logs if needed.

Please have a look and see if this path needs to be handled to fully fix
this issue.

Thanks,
Vineeth
Re: [PATCH] KVM: irqfd: fix shutdown deadlock by moving SRCU sync outside resampler_lock
Posted by Sonam Sanju 2 weeks ago
On Fri, Mar 20, 2026 at 08:56:33AM -0400, Vineeth Pillai (Google) wrote:
> I think we might have this issue in the kvm_irqfd_assign path as well
> where synchronize_srcu_expedited is called with the resampler_lock
> held. I saw similar lockup during a stress test where VMs were created
> and destroyed continously. I could see one task waiting on SRCU GP:
> 
> [   T93] task:crosvm_security state:D stack:0     pid:8215  tgid:8215  ppid:1      task_flags:0x400000 flags:0x00080002.
> [   T93] Call Trace:
> [   T93]  synchronize_srcu_expedited+0x109/0x140
> [   T93]  kvm_irqfd+0x362/0x5e0
> [   T93]  kvm_vm_ioctl+0x706/0x780
> 
> And another task waiting on the mutex:
> 
> [    C0] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
> [    C0]  __mutex_lock+0x413/0xe40
> [    C0]  irqfd_resampler_shutdown+0x23/0x150
> [    C0]  irqfd_shutdown+0x66/0xc0
> 
> The work queue was full as well I think:
> 
> [    C0]   pwq 46: cpus=11 node=0 flags=0x0 nice=0 active=1024 refcnt=2062

Yes, You are right. The kvm_irqfd_assign() path has the same deadlock pattern. 


> There were other tasks waiting for SRCU GP completion in the resampler
> shutdown path. Also, there were other traces showing lockups (mostly in
> mm), but I think thats a secondary effect of this lockup and might not
> be relevant.

Yes, that matches what we see on our side as well — the primary deadlock
in the KVM irqfd paths causes cascading failures: workqueue starvation
leads to blocked do_sync_work (superblock sync), fsnotify workers stuck
on __synchronize_srcu, and eventually init (pid 1) blocks in
ext4_put_super -> __flush_work.  The mm lockups you see are almost
certainly secondary effects.

Will send v2 shortly with both paths fixed in a single patch.

-- 
Sonam Sanju
[PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock
Posted by Sonam Sanju 2 weeks ago
irqfd_resampler_shutdown() and kvm_irqfd_assign() both call
synchronize_srcu_expedited() while holding kvm->irqfds.resampler_lock.
This can deadlock when multiple irqfd workers run concurrently on the
kvm-irqfd-cleanup workqueue during VM teardown or when VMs are rapidly
created and destroyed:

  CPU A (mutex holder)               CPU B/C/D (mutex waiters)
  irqfd_shutdown()                   irqfd_shutdown() / kvm_irqfd_assign()
   irqfd_resampler_shutdown()         irqfd_resampler_shutdown()
    mutex_lock(resampler_lock)  <---- mutex_lock(resampler_lock) //BLOCKED
    list_del_rcu(...)                     ...blocked...
    synchronize_srcu_expedited()      // Waiters block workqueue,
      // waits for SRCU grace            preventing SRCU grace
      // period which requires            period from completing
      // workqueue progress          --- DEADLOCK ---

In irqfd_resampler_shutdown(), the synchronize_srcu_expedited() in
the else branch is called directly within the mutex.  In the if-last
branch, kvm_unregister_irq_ack_notifier() also calls
synchronize_srcu_expedited() internally.  In kvm_irqfd_assign(),
synchronize_srcu_expedited() is called after list_add_rcu() but
before mutex_unlock().  All paths can block indefinitely because:

  1. synchronize_srcu_expedited() waits for an SRCU grace period
  2. SRCU grace period completion needs workqueue workers to run
  3. The blocked mutex waiters occupy workqueue slots preventing progress
  4. The mutex holder never releases the lock -> deadlock

Fix both paths by releasing the mutex before calling
synchronize_srcu_expedited().

In irqfd_resampler_shutdown(), use a bool last flag to track whether
this is the final irqfd for the resampler, then release the mutex
before the SRCU synchronization.  This is safe because list_del_rcu()
already removed the entries under the mutex, and
kvm_unregister_irq_ack_notifier() uses its own locking (kvm->irq_lock).

In kvm_irqfd_assign(), simply move synchronize_srcu_expedited() after
mutex_unlock().  The SRCU grace period still completes before the irqfd
goes live (the subsequent srcu_read_lock() ensures ordering).

Signed-off-by: Sonam Sanju <sonam.sanju@intel.com>
---
v2:
 - Fix the same deadlock in kvm_irqfd_assign() (Vineeth Pillai)

 virt/kvm/eventfd.c | 30 +++++++++++++++++++++++-------
 1 file changed, 23 insertions(+), 7 deletions(-)

diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 0e8b8a2c5b79..8ae9f81f8bb3 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -93,6 +93,7 @@ irqfd_resampler_shutdown(struct kvm_kernel_irqfd *irqfd)
 {
 	struct kvm_kernel_irqfd_resampler *resampler = irqfd->resampler;
 	struct kvm *kvm = resampler->kvm;
+	bool last = false;
 
 	mutex_lock(&kvm->irqfds.resampler_lock);
 
@@ -100,19 +101,27 @@ irqfd_resampler_shutdown(struct kvm_kernel_irqfd *irqfd)
 
 	if (list_empty(&resampler->list)) {
 		list_del_rcu(&resampler->link);
+		last = true;
+	}
+
+	mutex_unlock(&kvm->irqfds.resampler_lock);
+
+	/*
+	 * synchronize_srcu_expedited() (called explicitly below, or internally
+	 * by kvm_unregister_irq_ack_notifier()) must not be invoked under
+	 * resampler_lock.  Holding the mutex while waiting for an SRCU grace
+	 * period creates a deadlock: the blocked mutex waiters occupy workqueue
+	 * slots that the SRCU grace period machinery needs to make forward
+	 * progress.
+	 */
+	if (last) {
 		kvm_unregister_irq_ack_notifier(kvm, &resampler->notifier);
-		/*
-		 * synchronize_srcu_expedited(&kvm->irq_srcu) already called
-		 * in kvm_unregister_irq_ack_notifier().
-		 */
 		kvm_set_irq(kvm, KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID,
 			    resampler->notifier.gsi, 0, false);
 		kfree(resampler);
 	} else {
 		synchronize_srcu_expedited(&kvm->irq_srcu);
 	}
-
-	mutex_unlock(&kvm->irqfds.resampler_lock);
 }
 
 /*
@@ -450,9 +459,16 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
 		}
 
 		list_add_rcu(&irqfd->resampler_link, &irqfd->resampler->list);
-		synchronize_srcu_expedited(&kvm->irq_srcu);
 
 		mutex_unlock(&kvm->irqfds.resampler_lock);
+
+		/*
+		 * Ensure the resampler_link is SRCU-visible before the irqfd
+		 * itself goes live.  Moving synchronize_srcu_expedited() outside
+		 * the resampler_lock avoids deadlock with shutdown workers waiting
+		 * for the mutex while SRCU waits for workqueue progress.
+		 */
+		synchronize_srcu_expedited(&kvm->irq_srcu);
 	}
 
 	/*
-- 
2.34.1
Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock
Posted by Kunwu Chan 5 days, 17 hours ago
On 3/23/26 14:42, Sonam Sanju wrote:
> irqfd_resampler_shutdown() and kvm_irqfd_assign() both call
> synchronize_srcu_expedited() while holding kvm->irqfds.resampler_lock.
> This can deadlock when multiple irqfd workers run concurrently on the
> kvm-irqfd-cleanup workqueue during VM teardown or when VMs are rapidly
> created and destroyed:
>
>   CPU A (mutex holder)               CPU B/C/D (mutex waiters)
>   irqfd_shutdown()                   irqfd_shutdown() / kvm_irqfd_assign()
>    irqfd_resampler_shutdown()         irqfd_resampler_shutdown()
>     mutex_lock(resampler_lock)  <---- mutex_lock(resampler_lock) //BLOCKED
>     list_del_rcu(...)                     ...blocked...
>     synchronize_srcu_expedited()      // Waiters block workqueue,
>       // waits for SRCU grace            preventing SRCU grace
>       // period which requires            period from completing
>       // workqueue progress          --- DEADLOCK ---
>
> In irqfd_resampler_shutdown(), the synchronize_srcu_expedited() in
> the else branch is called directly within the mutex.  In the if-last
> branch, kvm_unregister_irq_ack_notifier() also calls
> synchronize_srcu_expedited() internally.  In kvm_irqfd_assign(),
> synchronize_srcu_expedited() is called after list_add_rcu() but
> before mutex_unlock().  All paths can block indefinitely because:
>
>   1. synchronize_srcu_expedited() waits for an SRCU grace period
>   2. SRCU grace period completion needs workqueue workers to run
>   3. The blocked mutex waiters occupy workqueue slots preventing progress
>   4. The mutex holder never releases the lock -> deadlock
>
> Fix both paths by releasing the mutex before calling
> synchronize_srcu_expedited().
>
> In irqfd_resampler_shutdown(), use a bool last flag to track whether
> this is the final irqfd for the resampler, then release the mutex
> before the SRCU synchronization.  This is safe because list_del_rcu()
> already removed the entries under the mutex, and
> kvm_unregister_irq_ack_notifier() uses its own locking (kvm->irq_lock).
>
> In kvm_irqfd_assign(), simply move synchronize_srcu_expedited() after
> mutex_unlock().  The SRCU grace period still completes before the irqfd
> goes live (the subsequent srcu_read_lock() ensures ordering).
>
> Signed-off-by: Sonam Sanju <sonam.sanju@intel.com>
> ---
> v2:
>  - Fix the same deadlock in kvm_irqfd_assign() (Vineeth Pillai)
>
>  virt/kvm/eventfd.c | 30 +++++++++++++++++++++++-------
>  1 file changed, 23 insertions(+), 7 deletions(-)
>
> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> index 0e8b8a2c5b79..8ae9f81f8bb3 100644
> --- a/virt/kvm/eventfd.c
> +++ b/virt/kvm/eventfd.c
> @@ -93,6 +93,7 @@ irqfd_resampler_shutdown(struct kvm_kernel_irqfd *irqfd)
>  {
>  	struct kvm_kernel_irqfd_resampler *resampler = irqfd->resampler;
>  	struct kvm *kvm = resampler->kvm;
> +	bool last = false;
>  
>  	mutex_lock(&kvm->irqfds.resampler_lock);
>  
> @@ -100,19 +101,27 @@ irqfd_resampler_shutdown(struct kvm_kernel_irqfd *irqfd)
>  
>  	if (list_empty(&resampler->list)) {
>  		list_del_rcu(&resampler->link);
> +		last = true;
> +	}
> +
> +	mutex_unlock(&kvm->irqfds.resampler_lock);
> +
> +	/*
> +	 * synchronize_srcu_expedited() (called explicitly below, or internally
> +	 * by kvm_unregister_irq_ack_notifier()) must not be invoked under
> +	 * resampler_lock.  Holding the mutex while waiting for an SRCU grace
> +	 * period creates a deadlock: the blocked mutex waiters occupy workqueue
> +	 * slots that the SRCU grace period machinery needs to make forward
> +	 * progress.
> +	 */
> +	if (last) {
>  		kvm_unregister_irq_ack_notifier(kvm, &resampler->notifier);
> -		/*
> -		 * synchronize_srcu_expedited(&kvm->irq_srcu) already called
> -		 * in kvm_unregister_irq_ack_notifier().
> -		 */
>  		kvm_set_irq(kvm, KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID,
>  			    resampler->notifier.gsi, 0, false);
>  		kfree(resampler);
>  	} else {
>  		synchronize_srcu_expedited(&kvm->irq_srcu);
>  	}
> -
> -	mutex_unlock(&kvm->irqfds.resampler_lock);
>  }
>  
>  /*
> @@ -450,9 +459,16 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
>  		}
>  
>  		list_add_rcu(&irqfd->resampler_link, &irqfd->resampler->list);
> -		synchronize_srcu_expedited(&kvm->irq_srcu);
>  
>  		mutex_unlock(&kvm->irqfds.resampler_lock);
> +
> +		/*
> +		 * Ensure the resampler_link is SRCU-visible before the irqfd
> +		 * itself goes live.  Moving synchronize_srcu_expedited() outside
> +		 * the resampler_lock avoids deadlock with shutdown workers waiting
> +		 * for the mutex while SRCU waits for workqueue progress.
> +		 */
> +		synchronize_srcu_expedited(&kvm->irq_srcu);
>  	}
>  
>  	/*

Building on the discussion so far, it would be helpful from the SRCU
side to gather a bit more evidence to classify the issue.

Calling synchronize_srcu_expedited() while holding a mutex is generally
valid, so the observed behavior may be workload-dependent.

The reported deadlock seems to rely on the assumption that SRCU grace
period progress is indirectly blocked by irqfd workqueue saturation.
It would be good to confirm whether that assumption actually holds.

In particular:

1) Are SRCU GP kthreads/workers still making forward progress when
the system is stuck?

2) How many irqfd workers are active in the reported scenario, and
can they saturate CPU or worker pools?

3) Do we have a concrete wait-for cycle showing that tasks blocked
on resampler_lock are in turn preventing SRCU GP completion?

4) Is the behavior reproducible in both irqfd_resampler_shutdown()
and kvm_irqfd_assign() paths?

If SRCU GP remains independent, it would help distinguish whether
this is a strict deadlock or a form of workqueue starvation / lock
contention.

A timestamp-correlated dump (blocked stacks + workqueue state +
SRCU GP activity) would likely be sufficient to classify this.

Happy to help look at traces if available.

Thanx, Kunwu
Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock
Posted by Sonam Sanju 5 days, 12 hours ago
From: Sonam Sanju <sonam.sanju@intel.com>

On Wed, Apr 01, 2026 at 05:34:58PM +0800, Kunwu Chan wrote:
> Building on the discussion so far, it would be helpful from the SRCU
> side to gather a bit more evidence to classify the issue.
>
> Calling synchronize_srcu_expedited() while holding a mutex is generally
> valid, so the observed behavior may be workload-dependent.

> The reported deadlock seems to rely on the assumption that SRCU grace
> period progress is indirectly blocked by irqfd workqueue saturation.
> It would be good to confirm whether that assumption actually holds.

I went back through our logs from two independent crash instances and
can now provide data for each of your questions.

> 1) Are SRCU GP kthreads/workers still making forward progress when
> the system is stuck?

No.  In both crash instances, process_srcu work items remain permanently
"pending" (never "in-flight") throughout the entire hang.

Instance 1 —  kernel 6.18.8, pool 14 (cpus=3):

  [  62.712760] workqueue rcu_gp: flags=0x108
  [  62.717801]   pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=2 refcnt=3
  [  62.717801]     pending: 2*process_srcu

  [  187.735092] workqueue rcu_gp: flags=0x108           (125 seconds later)
  [  187.735093]   pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=2 refcnt=3
  [  187.735093]     pending: 2*process_srcu              (still pending)

  9 consecutive dumps from t=62s to t=312s — process_srcu never runs.

Instance 2 —  kernel 6.18.2, pool 22 (cpus=5):

  [  93.280711] workqueue rcu_gp: flags=0x108
  [  93.280713]   pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=1 refcnt=2
  [  93.280716]     pending: process_srcu

  [  309.040801] workqueue rcu_gp: flags=0x108           (216 seconds later)
  [  309.040806]   pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=1 refcnt=2
  [  309.040806]     pending: process_srcu               (still pending)

  8 consecutive dumps from t=93s to t=341s — process_srcu never runs.

In both cases, rcu_gp's process_srcu is bound to the SAME per-CPU pool
where the kvm-irqfd-cleanup workers are blocked.  Both pools have idle
workers but are marked as hung/stalled:

  Instance 1: pool 14: cpus=3 hung=174s workers=11 idle: 4046 4038 4045 4039 4043 156 77 (7 idle)
  Instance 2: pool 22: cpus=5 hung=297s workers=12 idle: 4242 51 4248 4247 4245 435 4244 4239 (8 idle)

> 2) How many irqfd workers are active in the reported scenario, and
> can they saturate CPU or worker pools?

4 kvm-irqfd-cleanup workers in both instances, consistently across all
dumps:

Instance 1 ( pool 14 / cpus=3):

  [  62.831877] workqueue kvm-irqfd-cleanup: flags=0x0
  [  62.837838]   pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=4 refcnt=5
  [  62.837838]     in-flight: 157:irqfd_shutdown ,4044:irqfd_shutdown ,
                               102:irqfd_shutdown ,39:irqfd_shutdown

Instance 2 ( pool 22 / cpus=5):

  [  93.280894] workqueue kvm-irqfd-cleanup: flags=0x0
  [  93.280896]   pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=4 refcnt=5
  [  93.280900]     in-flight: 151:irqfd_shutdown ,4246:irqfd_shutdown ,
                               4241:irqfd_shutdown ,4243:irqfd_shutdown

These are from crosvm instances with multiple virtio devices
(virtio-blk, virtio-net, virtio-input, etc.), each registering an irqfd
with a resampler.  During VM shutdown, all irqfds are detached
concurrently, queueing that many irqfd_shutdown work items.

The 4 workers are not saturating CPU — they're all in D state.  But they
ARE all bound to the same per-CPU pool as rcu_gp's process_srcu work.

> 3) Do we have a concrete wait-for cycle showing that tasks blocked
> on resampler_lock are in turn preventing SRCU GP completion?

Yes, in both instances the hung task dump identifies the mutex holder
stuck in synchronize_srcu, with the other workers waiting on the mutex.

Instance 1 (t=314s):

  Worker pid 4044 — MUTEX HOLDER, stuck in synchronize_srcu:

    [  315.963979] task:kworker/3:8     state:D  pid:4044
    [  315.977125] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
    [  316.012504]  __synchronize_srcu+0x100/0x130
    [  316.023157]  irqfd_resampler_shutdown+0xf0/0x150  <-- offset 0xf0 (synchronize_srcu)

  Workers pid 39, 102, 157 — MUTEX WAITERS:

    [  314.793025] task:kworker/3:4     state:D  pid:157
    [  314.837472]  __mutex_lock+0x409/0xd90
    [  314.843100]  irqfd_resampler_shutdown+0x23/0x150  <-- offset 0x23 (mutex_lock)

Instance 2 (t=343s):

  Worker pid 4241 — MUTEX HOLDER, stuck in synchronize_srcu:

    [  343.193294] task:kworker/5:4     state:D  pid:4241
    [  343.193299] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
    [  343.193328]  __synchronize_srcu+0x100/0x130
    [  343.193335]  irqfd_resampler_shutdown+0xf0/0x150  <-- offset 0xf0 (synchronize_srcu)

  Workers pid 151, 4243, 4246 — MUTEX WAITERS:

    [  343.193369] task:kworker/5:6     state:D  pid:4243
    [  343.193397]  __mutex_lock+0x37d/0xbb0
    [  343.193397]  irqfd_resampler_shutdown+0x23/0x150  <-- offset 0x23 (mutex_lock)

Both instances show the identical wait-for cycle:

  1. One worker holds resampler_lock, blocks in __synchronize_srcu
     (waiting for SRCU grace period)
  2. SRCU GP needs process_srcu to run — but it stays "pending"
     on the same pool
  3. Other irqfd workers block on __mutex_lock in the same pool
  4. The pool is marked "hung" and no pending work makes progress
     for 250-300 seconds until kernel panic

> 4) Is the behavior reproducible in both irqfd_resampler_shutdown()
> and kvm_irqfd_assign() paths?

In our 4 crash instances the stuck mutex holder is always in 
irqfd_resampler_shutdown() at offset 0xf0 (synchronize_srcu).  This 
is consistent — these are all VM shutdown scenarios where only 
irqfd_shutdown workqueue items run.

The kvm_irqfd_assign() path was identified by Vineeth Pillai (Google)
during a VM create/destroy stress test where assign and shutdown race.
His traces showed kvm_irqfd (the assign path) stuck in
synchronize_srcu_expedited with irqfd_resampler_shutdown blocked on
the mutex, and workqueue pwq 46 at active=1024 refcnt=2062.

> If SRCU GP remains independent, it would help distinguish whether
> this is a strict deadlock or a form of workqueue starvation / lock
> contention.

Based on the data from both instances, SRCU GP is NOT remaining
independent.  process_srcu stays permanently pending on the affected
per-CPU pool for 250-300 seconds.  But it's not just process_srcu —
ALL pending work on the pool is stuck, including items from events,
cgroup, mm, slub, and other workqueues.


> A timestamp-correlated dump (blocked stacks + workqueue state +
> SRCU GP activity) would likely be sufficient to classify this.

I hope the correlated dumps above from both instances are helpful.
To summarize the timeline (consistent across both):

  t=0:   VM shutdown begins, crosvm detaches irqfds
  t=~14: 4 irqfd_shutdown work items queued on WQ_PERCPU pool
         One worker acquires resampler_lock, enters synchronize_srcu
         Other 3 workers block on __mutex_lock
  t=~43: First "BUG: workqueue lockup" — pool detected stuck
         rcu_gp: process_srcu shown as "pending" on same pool
  t=~93  Through t=~312: Repeated dumps every ~30s
         process_srcu remains permanently "pending"
         Pool has idle workers but no pending work executes
  t=~314: Hung task dump confirms mutex holder in __synchronize_srcu
  t=~316: init triggers sysrq crash → kernel panic

> Happy to help look at traces if available.

I can share the full console-ramoops-0 and dmesg-ramoops-0 from both
instances.  Shall I post them or send them off-list?

Thanks,
Sonam
Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock
Posted by Kunwu Chan 12 hours ago
April 1, 2026 at 10:24 PM, "Sonam Sanju" <sonam.sanju@intel.corp-partner.google.com mailto:sonam.sanju@intel.corp-partner.google.com?to=%22Sonam%20Sanju%22%20%3Csonam.sanju%40intel.corp-partner.google.com%3E > wrote:


> 
> From: Sonam Sanju <sonam.sanju@intel.com>
> 
> On Wed, Apr 01, 2026 at 05:34:58PM +0800, Kunwu Chan wrote:
> 
> > 
> > Building on the discussion so far, it would be helpful from the SRCU
> >  side to gather a bit more evidence to classify the issue.
> > 
> >  Calling synchronize_srcu_expedited() while holding a mutex is generally
> >  valid, so the observed behavior may be workload-dependent.
> > 
> >  The reported deadlock seems to rely on the assumption that SRCU grace
> >  period progress is indirectly blocked by irqfd workqueue saturation.
> >  It would be good to confirm whether that assumption actually holds.
> > 
> I went back through our logs from two independent crash instances and
> can now provide data for each of your questions.
> 
> > 
> > 1) Are SRCU GP kthreads/workers still making forward progress when
> >  the system is stuck?
> > 
> No. In both crash instances, process_srcu work items remain permanently
> "pending" (never "in-flight") throughout the entire hang.
> 
> Instance 1 — kernel 6.18.8, pool 14 (cpus=3):
> 
>  [ 62.712760] workqueue rcu_gp: flags=0x108
>  [ 62.717801] pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=2 refcnt=3
>  [ 62.717801] pending: 2*process_srcu
> 
>  [ 187.735092] workqueue rcu_gp: flags=0x108 (125 seconds later)
>  [ 187.735093] pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=2 refcnt=3
>  [ 187.735093] pending: 2*process_srcu (still pending)
> 
>  9 consecutive dumps from t=62s to t=312s — process_srcu never runs.
> 
> Instance 2 — kernel 6.18.2, pool 22 (cpus=5):
> 
>  [ 93.280711] workqueue rcu_gp: flags=0x108
>  [ 93.280713] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=1 refcnt=2
>  [ 93.280716] pending: process_srcu
> 
>  [ 309.040801] workqueue rcu_gp: flags=0x108 (216 seconds later)
>  [ 309.040806] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=1 refcnt=2
>  [ 309.040806] pending: process_srcu (still pending)
> 
>  8 consecutive dumps from t=93s to t=341s — process_srcu never runs.
> 
> In both cases, rcu_gp's process_srcu is bound to the SAME per-CPU pool
> where the kvm-irqfd-cleanup workers are blocked. Both pools have idle
> workers but are marked as hung/stalled:
> 
>  Instance 1: pool 14: cpus=3 hung=174s workers=11 idle: 4046 4038 4045 4039 4043 156 77 (7 idle)
>  Instance 2: pool 22: cpus=5 hung=297s workers=12 idle: 4242 51 4248 4247 4245 435 4244 4239 (8 idle)
> 
> > 
> > 2) How many irqfd workers are active in the reported scenario, and
> >  can they saturate CPU or worker pools?
> > 
> 4 kvm-irqfd-cleanup workers in both instances, consistently across all
> dumps:
> 
> Instance 1 ( pool 14 / cpus=3):
> 
>  [ 62.831877] workqueue kvm-irqfd-cleanup: flags=0x0
>  [ 62.837838] pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=4 refcnt=5
>  [ 62.837838] in-flight: 157:irqfd_shutdown ,4044:irqfd_shutdown ,
>  102:irqfd_shutdown ,39:irqfd_shutdown
> 
> Instance 2 ( pool 22 / cpus=5):
> 
>  [ 93.280894] workqueue kvm-irqfd-cleanup: flags=0x0
>  [ 93.280896] pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=4 refcnt=5
>  [ 93.280900] in-flight: 151:irqfd_shutdown ,4246:irqfd_shutdown ,
>  4241:irqfd_shutdown ,4243:irqfd_shutdown
> 
> These are from crosvm instances with multiple virtio devices
> (virtio-blk, virtio-net, virtio-input, etc.), each registering an irqfd
> with a resampler. During VM shutdown, all irqfds are detached
> concurrently, queueing that many irqfd_shutdown work items.
> 
> The 4 workers are not saturating CPU — they're all in D state. But they
> ARE all bound to the same per-CPU pool as rcu_gp's process_srcu work.
> 
> > 
> > 3) Do we have a concrete wait-for cycle showing that tasks blocked
> >  on resampler_lock are in turn preventing SRCU GP completion?
> > 
> Yes, in both instances the hung task dump identifies the mutex holder
> stuck in synchronize_srcu, with the other workers waiting on the mutex.
> 
> Instance 1 (t=314s):
> 
>  Worker pid 4044 — MUTEX HOLDER, stuck in synchronize_srcu:
> 
>  [ 315.963979] task:kworker/3:8 state:D pid:4044
>  [ 315.977125] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
>  [ 316.012504] __synchronize_srcu+0x100/0x130
>  [ 316.023157] irqfd_resampler_shutdown+0xf0/0x150 <-- offset 0xf0 (synchronize_srcu)
> 
>  Workers pid 39, 102, 157 — MUTEX WAITERS:
> 
>  [ 314.793025] task:kworker/3:4 state:D pid:157
>  [ 314.837472] __mutex_lock+0x409/0xd90
>  [ 314.843100] irqfd_resampler_shutdown+0x23/0x150 <-- offset 0x23 (mutex_lock)
> 
> Instance 2 (t=343s):
> 
>  Worker pid 4241 — MUTEX HOLDER, stuck in synchronize_srcu:
> 
>  [ 343.193294] task:kworker/5:4 state:D pid:4241
>  [ 343.193299] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
>  [ 343.193328] __synchronize_srcu+0x100/0x130
>  [ 343.193335] irqfd_resampler_shutdown+0xf0/0x150 <-- offset 0xf0 (synchronize_srcu)
> 
>  Workers pid 151, 4243, 4246 — MUTEX WAITERS:
> 
>  [ 343.193369] task:kworker/5:6 state:D pid:4243
>  [ 343.193397] __mutex_lock+0x37d/0xbb0
>  [ 343.193397] irqfd_resampler_shutdown+0x23/0x150 <-- offset 0x23 (mutex_lock)
> 
> Both instances show the identical wait-for cycle:
> 
>  1. One worker holds resampler_lock, blocks in __synchronize_srcu
>  (waiting for SRCU grace period)
>  2. SRCU GP needs process_srcu to run — but it stays "pending"
>  on the same pool
>  3. Other irqfd workers block on __mutex_lock in the same pool
>  4. The pool is marked "hung" and no pending work makes progress
>  for 250-300 seconds until kernel panic
> 
> > 
> > 4) Is the behavior reproducible in both irqfd_resampler_shutdown()
> >  and kvm_irqfd_assign() paths?
> > 
> In our 4 crash instances the stuck mutex holder is always in 
> irqfd_resampler_shutdown() at offset 0xf0 (synchronize_srcu). This 
> is consistent — these are all VM shutdown scenarios where only 
> irqfd_shutdown workqueue items run.
> 
> The kvm_irqfd_assign() path was identified by Vineeth Pillai (Google)
> during a VM create/destroy stress test where assign and shutdown race.
> His traces showed kvm_irqfd (the assign path) stuck in
> synchronize_srcu_expedited with irqfd_resampler_shutdown blocked on
> the mutex, and workqueue pwq 46 at active=1024 refcnt=2062.
> 
> > 
> > If SRCU GP remains independent, it would help distinguish whether
> >  this is a strict deadlock or a form of workqueue starvation / lock
> >  contention.
> > 
> Based on the data from both instances, SRCU GP is NOT remaining
> independent. process_srcu stays permanently pending on the affected
> per-CPU pool for 250-300 seconds. But it's not just process_srcu —
> ALL pending work on the pool is stuck, including items from events,
> cgroup, mm, slub, and other workqueues.
> 
> > 
> > A timestamp-correlated dump (blocked stacks + workqueue state +
> >  SRCU GP activity) would likely be sufficient to classify this.
> > 
> I hope the correlated dumps above from both instances are helpful.
> To summarize the timeline (consistent across both):
> 
>  t=0: VM shutdown begins, crosvm detaches irqfds
>  t=~14: 4 irqfd_shutdown work items queued on WQ_PERCPU pool
>  One worker acquires resampler_lock, enters synchronize_srcu
>  Other 3 workers block on __mutex_lock
>  t=~43: First "BUG: workqueue lockup" — pool detected stuck
>  rcu_gp: process_srcu shown as "pending" on same pool
>  t=~93 Through t=~312: Repeated dumps every ~30s
>  process_srcu remains permanently "pending"
>  Pool has idle workers but no pending work executes
>  t=~314: Hung task dump confirms mutex holder in __synchronize_srcu
>  t=~316: init triggers sysrq crash → kernel panic
> 

Thanks, this is useful and much clearer.

One thing that is still unclear is dispatch behavior:
`process_srcu` stays pending for a long time, while the same pwq dump shows idle workers.

So the key question is: what prevents pending work from being dispatched on that pwq?
Is it due to:
  1) pwq stalled/hung state,
  2) worker availability/affinity constraints,
  3) or another dispatch-side condition?

Also, for scope:
- your crash instances consistently show the shutdown path
  (irqfd_resampler_shutdown + synchronize_srcu),
- while assign-path evidence, per current thread data, appears to come
  from a separate stress case.

A time-aligned dump with pwq state, pending/in-flight lists, and worker states
should help clarify this.


> > 
> > Happy to help look at traces if available.
> > 
> I can share the full console-ramoops-0 and dmesg-ramoops-0 from both
> instances. Shall I post them or send them off-list?
> 

If possible, please post sanitized ramoops/dmesg logs on-list so others can validate.

Thanx, Kunwu

> Thanks,
> Sonam
>
Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock
Posted by Sean Christopherson 6 days, 8 hours ago
+srcu folks

Please don't post subsequent versions In-Reply-To previous versions, it tends to
muck up tooling.

On Mon, Mar 23, 2026, Sonam Sanju wrote:
> irqfd_resampler_shutdown() and kvm_irqfd_assign() both call
> synchronize_srcu_expedited() while holding kvm->irqfds.resampler_lock.
> This can deadlock when multiple irqfd workers run concurrently on the
> kvm-irqfd-cleanup workqueue during VM teardown or when VMs are rapidly
> created and destroyed:
> 
>   CPU A (mutex holder)               CPU B/C/D (mutex waiters)
>   irqfd_shutdown()                   irqfd_shutdown() / kvm_irqfd_assign()
>    irqfd_resampler_shutdown()         irqfd_resampler_shutdown()
>     mutex_lock(resampler_lock)  <---- mutex_lock(resampler_lock) //BLOCKED
>     list_del_rcu(...)                     ...blocked...
>     synchronize_srcu_expedited()      // Waiters block workqueue,
>       // waits for SRCU grace            preventing SRCU grace
>       // period which requires            period from completing
>       // workqueue progress          --- DEADLOCK ---
> 
> In irqfd_resampler_shutdown(), the synchronize_srcu_expedited() in
> the else branch is called directly within the mutex.  In the if-last
> branch, kvm_unregister_irq_ack_notifier() also calls
> synchronize_srcu_expedited() internally.  In kvm_irqfd_assign(),
> synchronize_srcu_expedited() is called after list_add_rcu() but
> before mutex_unlock().  All paths can block indefinitely because:
> 
>   1. synchronize_srcu_expedited() waits for an SRCU grace period
>   2. SRCU grace period completion needs workqueue workers to run
>   3. The blocked mutex waiters occupy workqueue slots preventing progress

Unless I'm misunderstanding the bug, "fixing" in this in KVM is papering over an
underlying flaw.  Essentially, this would be establishing a rule that
synchronize_srcu_expedited() can *never* be called while holding a mutex.  That's
not viable.

>   4. The mutex holder never releases the lock -> deadlock
Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock
Posted by Paul E. McKenney 6 days, 6 hours ago
On Tue, Mar 31, 2026 at 11:17:19AM -0700, Sean Christopherson wrote:
> +srcu folks
> 
> Please don't post subsequent versions In-Reply-To previous versions, it tends to
> muck up tooling.
> 
> On Mon, Mar 23, 2026, Sonam Sanju wrote:
> > irqfd_resampler_shutdown() and kvm_irqfd_assign() both call
> > synchronize_srcu_expedited() while holding kvm->irqfds.resampler_lock.
> > This can deadlock when multiple irqfd workers run concurrently on the
> > kvm-irqfd-cleanup workqueue during VM teardown or when VMs are rapidly
> > created and destroyed:
> > 
> >   CPU A (mutex holder)               CPU B/C/D (mutex waiters)
> >   irqfd_shutdown()                   irqfd_shutdown() / kvm_irqfd_assign()
> >    irqfd_resampler_shutdown()         irqfd_resampler_shutdown()
> >     mutex_lock(resampler_lock)  <---- mutex_lock(resampler_lock) //BLOCKED
> >     list_del_rcu(...)                     ...blocked...
> >     synchronize_srcu_expedited()      // Waiters block workqueue,
> >       // waits for SRCU grace            preventing SRCU grace
> >       // period which requires            period from completing
> >       // workqueue progress          --- DEADLOCK ---
> > 
> > In irqfd_resampler_shutdown(), the synchronize_srcu_expedited() in
> > the else branch is called directly within the mutex.  In the if-last
> > branch, kvm_unregister_irq_ack_notifier() also calls
> > synchronize_srcu_expedited() internally.  In kvm_irqfd_assign(),
> > synchronize_srcu_expedited() is called after list_add_rcu() but
> > before mutex_unlock().  All paths can block indefinitely because:
> > 
> >   1. synchronize_srcu_expedited() waits for an SRCU grace period
> >   2. SRCU grace period completion needs workqueue workers to run
> >   3. The blocked mutex waiters occupy workqueue slots preventing progress
> 
> Unless I'm misunderstanding the bug, "fixing" in this in KVM is papering over an
> underlying flaw.  Essentially, this would be establishing a rule that
> synchronize_srcu_expedited() can *never* be called while holding a mutex.  That's
> not viable.

First, it is OK to invoke synchronize_srcu_expedited() while holding
a mutex.  Second, the synchronize_srcu_expedited() function's use of
workqueues is the same as that of synchronize_srcu(), so in an alternate
universe where it was not OK to invoke synchronize_srcu_expedited() while
holding a mutex, it would also not be OK to invoke synchronize_srcu()
while holding that same mutex.  Third, it is also OK to acquire that
same mutex within a workqueue handler.  Fourth, SRCU and RCU use their
own workqueue, which no one else should be using (and that prohibition
most definitely includes the irqfd workers).

As a result, I do have to ask...  When you say "multiple irqfd workers",
exactly how many such workers are you running?

							Thanx, Paul

> >   4. The mutex holder never releases the lock -> deadlock
Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock
Posted by Paul E. McKenney 4 hours ago
On Tue, Mar 31, 2026 at 01:51:11PM -0700, Paul E. McKenney wrote:
> On Tue, Mar 31, 2026 at 11:17:19AM -0700, Sean Christopherson wrote:
> > +srcu folks

[ . . . ]

> > Unless I'm misunderstanding the bug, "fixing" in this in KVM is papering over an
> > underlying flaw.  Essentially, this would be establishing a rule that
> > synchronize_srcu_expedited() can *never* be called while holding a mutex.  That's
> > not viable.
> 
> First, it is OK to invoke synchronize_srcu_expedited() while holding
> a mutex.  Second, the synchronize_srcu_expedited() function's use of
> workqueues is the same as that of synchronize_srcu(), so in an alternate
> universe where it was not OK to invoke synchronize_srcu_expedited() while
> holding a mutex, it would also not be OK to invoke synchronize_srcu()
> while holding that same mutex.  Third, it is also OK to acquire that
> same mutex within a workqueue handler.  Fourth, SRCU and RCU use their
> own workqueue, which no one else should be using (and that prohibition
> most definitely includes the irqfd workers).
> 
> As a result, I do have to ask...  When you say "multiple irqfd workers",
> exactly how many such workers are you running?

Just to be clear, I am guessing that you have the workqueues counterpart
to a fork bomb.  However, if you are using a small finite number of
workqueue handlers, then we need to make adjustments in SRCU, workqueues,
or maybe SRCU's use of workqueues.

So if my fork-bomb guess is incorrect, please let me know.

							Thanx, Paul

> > >   4. The mutex holder never releases the lock -> deadlock
Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock
Posted by Sonam Sanju 5 days, 17 hours ago
From: Sonam Sanju <sonam.sanju@intel.com>

On Tue, Mar 31, 2026 at 01:51:00PM -0700, Paul E. McKenney wrote:
> On Tue, Mar 31, 2026 at 11:17:19AM -0700, Sean Christopherson wrote:
> > Please don't post subsequent versions In-Reply-To previous versions, it tends to
> > muck up tooling.

Noted, will send future versions as new top-level threads. Sorry about
that.

> > Unless I'm misunderstanding the bug, "fixing" in this in KVM is papering over an
> > underlying flaw.  Essentially, this would be establishing a rule that
> > synchronize_srcu_expedited() can *never* be called while holding a mutex.  That's
> > not viable.
>
> First, it is OK to invoke synchronize_srcu_expedited() while holding
> a mutex.  Second, the synchronize_srcu_expedited() function's use of
> workqueues is the same as that of synchronize_srcu(), so in an alternate
> universe where it was not OK to invoke synchronize_srcu_expedited() while
> holding a mutex, it would also not be OK to invoke synchronize_srcu()
> while holding that same mutex.  Third, it is also OK to acquire that
> same mutex within a workqueue handler.  Fourth, SRCU and RCU use their
> own workqueue, which no one else should be using (and that prohibition
> most definitely includes the irqfd workers).

Thank you for clarifying this. 

> As a result, I do have to ask...  When you say "multiple irqfd workers",
> exactly how many such workers are you running?

While running cold reboot/ warm reboot cycling in our Android platforms 
with 6.18 kernel, the hung_task traces consistently show 8-15 
kvm-irqfd-cleanup workers in D state.  These are crosvm instances with 
roughly 10-16 irqfd lines per VM (virtio-blk, virtio-net, virtio-input,
virtio-snd, etc., each with a resampler).

Vineeth Pillai (Google) reproduced a related scenario under a VM
create/destroy stress test where the workqueue reached active=1024
refcnt=2062, though that is a much more extreme case than what we see
during normal shutdown.

The first part of the deadlock is genuinely there. One worker holds 
resampler_lock and blocks in synchronize_srcu_expedited() while the
remaining 8-15 workers block on __mutex_lock at 
irqfd_resampler_shutdown.  

Thanks,
Sonam