[v1] KVM: irqfd: fix shutdown deadlock by moving SRCU sync outside resampler_lock

[PATCH] KVM: irqfd: fix shutdown deadlock by moving SRCU sync outside resampler_lock

Posted by Sonam Sanju 3 weeks ago

irqfd_resampler_shutdown() calls synchronize_srcu_expedited() while
holding kvm->irqfds.resampler_lock.  This can deadlock when multiple
irqfd_shutdown workers run concurrently on the kvm-irqfd-cleanup
workqueue during VM teardown (e.g. crosvm shutdown on Android):

  CPU A (mutex holder)               CPU B/C/D (mutex waiters)
  irqfd_shutdown()                   irqfd_shutdown()
   irqfd_resampler_shutdown()         irqfd_resampler_shutdown()
    mutex_lock(resampler_lock)  <---- mutex_lock(resampler_lock)  // BLOCKED
    list_del_rcu(...)                     ...blocked...
    synchronize_srcu_expedited()      // Waiters block workqueue,
      // waits for SRCU grace            preventing SRCU grace
      // period which requires            period from completing
      // workqueue progress          --- DEADLOCK ---

The synchronize_srcu_expedited() in the else branch is called directly
within the mutex.  In the if-last branch, kvm_unregister_irq_ack_notifier()
also calls synchronize_srcu_expedited() internally.  Both paths can
block indefinitely because:

  1. synchronize_srcu_expedited() waits for an SRCU grace period
  2. SRCU grace period completion needs workqueue workers to run
  3. The blocked mutex waiters occupy workqueue slots, preventing
     progress
  4. The mutex holder never releases the lock -> deadlock

Fix by performing all list manipulations and the last-entry check under
the mutex, then releasing the mutex before the SRCU synchronization.
This is safe because:

  - list_del_rcu() removes the irqfd from resampler->list under the
    mutex, so no concurrent reader or writer can access it.
  - When last==true, list_del_rcu(&resampler->link) has already removed
    the resampler from kvm->irqfds.resampler_list under the mutex, so
    no other worker can find or operate on this resampler.
  - kvm_unregister_irq_ack_notifier() uses its own locking
    (kvm->irq_lock) and is safe to call without resampler_lock.
  - synchronize_srcu_expedited() does not require any KVM mutex.
  - kfree(resampler) is safe after SRCU sync guarantees all readers
    have finished.

Signed-off-by: Sonam Sanju <sonam.sanju@intel.com>
---
 virt/kvm/eventfd.c | 21 +++++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 0e8b8a2c5b79..27bcf2b1a81d 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -93,6 +93,7 @@ irqfd_resampler_shutdown(struct kvm_kernel_irqfd *irqfd)
 {
 	struct kvm_kernel_irqfd_resampler *resampler = irqfd->resampler;
 	struct kvm *kvm = resampler->kvm;
+	bool last = false;
 
 	mutex_lock(&kvm->irqfds.resampler_lock);
 
@@ -100,19 +101,27 @@ irqfd_resampler_shutdown(struct kvm_kernel_irqfd *irqfd)
 
 	if (list_empty(&resampler->list)) {
 		list_del_rcu(&resampler->link);
+		last = true;
+	}
+
+	mutex_unlock(&kvm->irqfds.resampler_lock);
+
+	/*
+	 * synchronize_srcu_expedited() (called explicitly below, or internally
+	 * by kvm_unregister_irq_ack_notifier()) must not be invoked under
+	 * resampler_lock.  Holding the mutex while waiting for an SRCU grace
+	 * period creates a deadlock: the blocked mutex waiters occupy workqueue
+	 * slots that the SRCU grace period machinery needs to make forward
+	 * progress.
+	 */
+	if (last) {
 		kvm_unregister_irq_ack_notifier(kvm, &resampler->notifier);
-		/*
-		 * synchronize_srcu_expedited(&kvm->irq_srcu) already called
-		 * in kvm_unregister_irq_ack_notifier().
-		 */
 		kvm_set_irq(kvm, KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID,
 			    resampler->notifier.gsi, 0, false);
 		kfree(resampler);
 	} else {
 		synchronize_srcu_expedited(&kvm->irq_srcu);
 	}
-
-	mutex_unlock(&kvm->irqfds.resampler_lock);
 }
 
 /*
-- 
2.34.1

Re: [PATCH] KVM: irqfd: fix shutdown deadlock by moving SRCU sync outside resampler_lock

Posted by Sonam Sanju 2 weeks, 6 days ago

From: Sonam Sanju <sonam.sanju@intel.com>

On Mon, Mar 16, 2026 at 12:50:26PM +0530, Sonam Sanju wrote:
> From: Sonam Sanju <sonam.sanju@intel.com>
> 
> irqfd_resampler_shutdown() calls synchronize_srcu_expedited() while
> holding kvm->irqfds.resampler_lock.  This can deadlock when multiple

Adding Sean Christopherson to CC (apologies for the omission in the original 
submission).

--
Sonam

Re: [PATCH] KVM: irqfd: fix shutdown deadlock by moving SRCU sync outside resampler_lock

Posted by Vineeth Pillai (Google) 2 weeks, 3 days ago

Hi Sonam,

> irqfd_resampler_shutdown() calls synchronize_srcu_expedited() while
> holding kvm->irqfds.resampler_lock.  This can deadlock when multiple
> irqfd_shutdown workers run concurrently on the kvm-irqfd-cleanup
> workqueue during VM teardown (e.g. crosvm shutdown on Android):
> 
>   CPU A (mutex holder)               CPU B/C/D (mutex waiters)
>   irqfd_shutdown()                   irqfd_shutdown()
>    irqfd_resampler_shutdown()         irqfd_resampler_shutdown()
>     mutex_lock(resampler_lock)  <---- mutex_lock(resampler_lock)  // BLOCKED
>     list_del_rcu(...)                     ...blocked...
>     synchronize_srcu_expedited()      // Waiters block workqueue,
>       // waits for SRCU grace            preventing SRCU grace
>       // period which requires            period from completing
>       // workqueue progress          --- DEADLOCK ---

I think we might have this issue in the kvm_irqfd_assign path as well
where synchronize_srcu_expedited is called with the resampler_lock
held. I saw similar lockup during a stress test where VMs were created
and destroyed continously. I could see one task waiting on SRCU GP:

[   T93] task:crosvm_security state:D stack:0     pid:8215  tgid:8215  ppid:1      task_flags:0x400000 flags:0x00080002.
[   T93] Call Trace:
[   T93]  <TASK>
[   T93]  __schedule+0x87a/0xd60
[   T93]  schedule+0x5e/0xe0
[   T93]  schedule_timeout+0x2e/0x130
[   T93]  ? queue_delayed_work_on+0x7f/0xd0
[   T93]  wait_for_common+0xf7/0x1f0
[   T93]  synchronize_srcu_expedited+0x109/0x140
[   T93]  ? __cfi_wakeme_after_rcu+0x10/0x10
[   T93]  kvm_irqfd+0x362/0x5e0
[   T93]  kvm_vm_ioctl+0x706/0x780
[   T93]  ? fd_install+0x2c/0xf0
[   T93]  __se_sys_ioctl+0x7a/0xd0
[   T93]  do_syscall_64+0x61/0xf10
[   T93]  ? arch_exit_to_user_mode_prepare+0x9/0xb0
[   T93]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   T93] RIP: 0033:0x79048f9bdd67
[   T93] RSP: 002b:00007ffc3aa82028 EFLAGS: 00000206.

And another task waiting on the mutex:

[    C0] task:kworker/11:2    state:R  running task     stack:0     pid:25180 tgid:25180 ppid:2      task_flags:0x4208060 flags:0x00080000
[    C0] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
[    C0] Call Trace:
[    C0]  <TASK>
[    C0]  __schedule+0x87a/0xd60
[    C0]  schedule+0x5e/0xe0
[    C0]  schedule_preempt_disabled+0x10/0x20
[    C0]  __mutex_lock+0x413/0xe40
[    C0]  irqfd_resampler_shutdown+0x23/0x150
[    C0]  irqfd_shutdown+0x66/0xc0
[    C0]  process_scheduled_works+0x219/0x450
[    C0]  worker_thread+0x30b/0x450
[    C0]  ? __cfi_worker_thread+0x10/0x10
[    C0]  kthread+0x230/0x270
[    C0]  ? __cfi_kthread+0x10/0x10
[    C0]  ret_from_fork+0xf2/0x150
[    C0]  ? __cfi_kthread+0x10/0x10
[    C0]  ret_from_fork_asm+0x1a/0x30
[    C0]  </TASK>

The work queue was full as well I think:

[    C0]   pwq 46: cpus=11 node=0 flags=0x0 nice=0 active=1024 refcnt=2062

There were other tasks waiting for SRCU GP completion in the resampler
shutdown path. Also, there were other traces showing lockups (mostly in
mm), but I think thats a secondary effect of this lockup and might not
be relevant. I can provide the full logs if needed.

Please have a look and see if this path needs to be handled to fully fix
this issue.

Thanks,
Vineeth

Re: [PATCH] KVM: irqfd: fix shutdown deadlock by moving SRCU sync outside resampler_lock

Posted by Sonam Sanju 2 weeks ago

On Fri, Mar 20, 2026 at 08:56:33AM -0400, Vineeth Pillai (Google) wrote:
> I think we might have this issue in the kvm_irqfd_assign path as well
> where synchronize_srcu_expedited is called with the resampler_lock
> held. I saw similar lockup during a stress test where VMs were created
> and destroyed continously. I could see one task waiting on SRCU GP:
> 
> [   T93] task:crosvm_security state:D stack:0     pid:8215  tgid:8215  ppid:1      task_flags:0x400000 flags:0x00080002.
> [   T93] Call Trace:
> [   T93]  synchronize_srcu_expedited+0x109/0x140
> [   T93]  kvm_irqfd+0x362/0x5e0
> [   T93]  kvm_vm_ioctl+0x706/0x780
> 
> And another task waiting on the mutex:
> 
> [    C0] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
> [    C0]  __mutex_lock+0x413/0xe40
> [    C0]  irqfd_resampler_shutdown+0x23/0x150
> [    C0]  irqfd_shutdown+0x66/0xc0
> 
> The work queue was full as well I think:
> 
> [    C0]   pwq 46: cpus=11 node=0 flags=0x0 nice=0 active=1024 refcnt=2062

Yes, You are right. The kvm_irqfd_assign() path has the same deadlock pattern. 


> There were other tasks waiting for SRCU GP completion in the resampler
> shutdown path. Also, there were other traces showing lockups (mostly in
> mm), but I think thats a secondary effect of this lockup and might not
> be relevant.

Yes, that matches what we see on our side as well — the primary deadlock
in the KVM irqfd paths causes cascading failures: workqueue starvation
leads to blocked do_sync_work (superblock sync), fsnotify workers stuck
on __synchronize_srcu, and eventually init (pid 1) blocks in
ext4_put_super -> __flush_work.  The mm lockups you see are almost
certainly secondary effects.

Will send v2 shortly with both paths fixed in a single patch.

-- 
Sonam Sanju

[PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock

Posted by Sonam Sanju 2 weeks ago

irqfd_resampler_shutdown() and kvm_irqfd_assign() both call
synchronize_srcu_expedited() while holding kvm->irqfds.resampler_lock.
This can deadlock when multiple irqfd workers run concurrently on the
kvm-irqfd-cleanup workqueue during VM teardown or when VMs are rapidly
created and destroyed:

  CPU A (mutex holder)               CPU B/C/D (mutex waiters)
  irqfd_shutdown()                   irqfd_shutdown() / kvm_irqfd_assign()
   irqfd_resampler_shutdown()         irqfd_resampler_shutdown()
    mutex_lock(resampler_lock)  <---- mutex_lock(resampler_lock) //BLOCKED
    list_del_rcu(...)                     ...blocked...
    synchronize_srcu_expedited()      // Waiters block workqueue,
      // waits for SRCU grace            preventing SRCU grace
      // period which requires            period from completing
      // workqueue progress          --- DEADLOCK ---

In irqfd_resampler_shutdown(), the synchronize_srcu_expedited() in
the else branch is called directly within the mutex.  In the if-last
branch, kvm_unregister_irq_ack_notifier() also calls
synchronize_srcu_expedited() internally.  In kvm_irqfd_assign(),
synchronize_srcu_expedited() is called after list_add_rcu() but
before mutex_unlock().  All paths can block indefinitely because:

  1. synchronize_srcu_expedited() waits for an SRCU grace period
  2. SRCU grace period completion needs workqueue workers to run
  3. The blocked mutex waiters occupy workqueue slots preventing progress
  4. The mutex holder never releases the lock -> deadlock

Fix both paths by releasing the mutex before calling
synchronize_srcu_expedited().

In irqfd_resampler_shutdown(), use a bool last flag to track whether
this is the final irqfd for the resampler, then release the mutex
before the SRCU synchronization.  This is safe because list_del_rcu()
already removed the entries under the mutex, and
kvm_unregister_irq_ack_notifier() uses its own locking (kvm->irq_lock).

In kvm_irqfd_assign(), simply move synchronize_srcu_expedited() after
mutex_unlock().  The SRCU grace period still completes before the irqfd
goes live (the subsequent srcu_read_lock() ensures ordering).

Signed-off-by: Sonam Sanju <sonam.sanju@intel.com>
---
v2:
 - Fix the same deadlock in kvm_irqfd_assign() (Vineeth Pillai)

 virt/kvm/eventfd.c | 30 +++++++++++++++++++++++-------
 1 file changed, 23 insertions(+), 7 deletions(-)

diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 0e8b8a2c5b79..8ae9f81f8bb3 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -93,6 +93,7 @@ irqfd_resampler_shutdown(struct kvm_kernel_irqfd *irqfd)
 {
 	struct kvm_kernel_irqfd_resampler *resampler = irqfd->resampler;
 	struct kvm *kvm = resampler->kvm;
+	bool last = false;
 
 	mutex_lock(&kvm->irqfds.resampler_lock);
 
@@ -100,19 +101,27 @@ irqfd_resampler_shutdown(struct kvm_kernel_irqfd *irqfd)
 
 	if (list_empty(&resampler->list)) {
 		list_del_rcu(&resampler->link);
+		last = true;
+	}
+
+	mutex_unlock(&kvm->irqfds.resampler_lock);
+
+	/*
+	 * synchronize_srcu_expedited() (called explicitly below, or internally
+	 * by kvm_unregister_irq_ack_notifier()) must not be invoked under
+	 * resampler_lock.  Holding the mutex while waiting for an SRCU grace
+	 * period creates a deadlock: the blocked mutex waiters occupy workqueue
+	 * slots that the SRCU grace period machinery needs to make forward
+	 * progress.
+	 */
+	if (last) {
 		kvm_unregister_irq_ack_notifier(kvm, &resampler->notifier);
-		/*
-		 * synchronize_srcu_expedited(&kvm->irq_srcu) already called
-		 * in kvm_unregister_irq_ack_notifier().
-		 */
 		kvm_set_irq(kvm, KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID,
 			    resampler->notifier.gsi, 0, false);
 		kfree(resampler);
 	} else {
 		synchronize_srcu_expedited(&kvm->irq_srcu);
 	}
-
-	mutex_unlock(&kvm->irqfds.resampler_lock);
 }
 
 /*
@@ -450,9 +459,16 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
 		}
 
 		list_add_rcu(&irqfd->resampler_link, &irqfd->resampler->list);
-		synchronize_srcu_expedited(&kvm->irq_srcu);
 
 		mutex_unlock(&kvm->irqfds.resampler_lock);
+
+		/*
+		 * Ensure the resampler_link is SRCU-visible before the irqfd
+		 * itself goes live.  Moving synchronize_srcu_expedited() outside
+		 * the resampler_lock avoids deadlock with shutdown workers waiting
+		 * for the mutex while SRCU waits for workqueue progress.
+		 */
+		synchronize_srcu_expedited(&kvm->irq_srcu);
 	}
 
 	/*
-- 
2.34.1