KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock

[PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock

Posted by Sonam Sanju 1 week, 4 days ago

irqfd_resampler_shutdown() and kvm_irqfd_assign() both call
synchronize_srcu_expedited() while holding kvm->irqfds.resampler_lock.
This can deadlock when multiple irqfd workers run concurrently on the
kvm-irqfd-cleanup workqueue during VM teardown or when VMs are rapidly
created and destroyed:

  CPU A (mutex holder)               CPU B/C/D (mutex waiters)
  irqfd_shutdown()                   irqfd_shutdown() / kvm_irqfd_assign()
   irqfd_resampler_shutdown()         irqfd_resampler_shutdown()
    mutex_lock(resampler_lock)  <---- mutex_lock(resampler_lock) //BLOCKED
    list_del_rcu(...)                     ...blocked...
    synchronize_srcu_expedited()      // Waiters block workqueue,
      // waits for SRCU grace            preventing SRCU grace
      // period which requires            period from completing
      // workqueue progress          --- DEADLOCK ---

In irqfd_resampler_shutdown(), the synchronize_srcu_expedited() in
the else branch is called directly within the mutex.  In the if-last
branch, kvm_unregister_irq_ack_notifier() also calls
synchronize_srcu_expedited() internally.  In kvm_irqfd_assign(),
synchronize_srcu_expedited() is called after list_add_rcu() but
before mutex_unlock().  All paths can block indefinitely because:

  1. synchronize_srcu_expedited() waits for an SRCU grace period
  2. SRCU grace period completion needs workqueue workers to run
  3. The blocked mutex waiters occupy workqueue slots preventing progress
  4. The mutex holder never releases the lock -> deadlock

Fix both paths by releasing the mutex before calling
synchronize_srcu_expedited().

In irqfd_resampler_shutdown(), use a bool last flag to track whether
this is the final irqfd for the resampler, then release the mutex
before the SRCU synchronization.  This is safe because list_del_rcu()
already removed the entries under the mutex, and
kvm_unregister_irq_ack_notifier() uses its own locking (kvm->irq_lock).

In kvm_irqfd_assign(), simply move synchronize_srcu_expedited() after
mutex_unlock().  The SRCU grace period still completes before the irqfd
goes live (the subsequent srcu_read_lock() ensures ordering).

Signed-off-by: Sonam Sanju <sonam.sanju@intel.com>
---
v2:
 - Fix the same deadlock in kvm_irqfd_assign() (Vineeth Pillai)

 virt/kvm/eventfd.c | 30 +++++++++++++++++++++++-------
 1 file changed, 23 insertions(+), 7 deletions(-)

diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 0e8b8a2c5b79..8ae9f81f8bb3 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -93,6 +93,7 @@ irqfd_resampler_shutdown(struct kvm_kernel_irqfd *irqfd)
 {
 	struct kvm_kernel_irqfd_resampler *resampler = irqfd->resampler;
 	struct kvm *kvm = resampler->kvm;
+	bool last = false;
 
 	mutex_lock(&kvm->irqfds.resampler_lock);
 
@@ -100,19 +101,27 @@ irqfd_resampler_shutdown(struct kvm_kernel_irqfd *irqfd)
 
 	if (list_empty(&resampler->list)) {
 		list_del_rcu(&resampler->link);
+		last = true;
+	}
+
+	mutex_unlock(&kvm->irqfds.resampler_lock);
+
+	/*
+	 * synchronize_srcu_expedited() (called explicitly below, or internally
+	 * by kvm_unregister_irq_ack_notifier()) must not be invoked under
+	 * resampler_lock.  Holding the mutex while waiting for an SRCU grace
+	 * period creates a deadlock: the blocked mutex waiters occupy workqueue
+	 * slots that the SRCU grace period machinery needs to make forward
+	 * progress.
+	 */
+	if (last) {
 		kvm_unregister_irq_ack_notifier(kvm, &resampler->notifier);
-		/*
-		 * synchronize_srcu_expedited(&kvm->irq_srcu) already called
-		 * in kvm_unregister_irq_ack_notifier().
-		 */
 		kvm_set_irq(kvm, KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID,
 			    resampler->notifier.gsi, 0, false);
 		kfree(resampler);
 	} else {
 		synchronize_srcu_expedited(&kvm->irq_srcu);
 	}
-
-	mutex_unlock(&kvm->irqfds.resampler_lock);
 }
 
 /*
@@ -450,9 +459,16 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
 		}
 
 		list_add_rcu(&irqfd->resampler_link, &irqfd->resampler->list);
-		synchronize_srcu_expedited(&kvm->irq_srcu);
 
 		mutex_unlock(&kvm->irqfds.resampler_lock);
+
+		/*
+		 * Ensure the resampler_link is SRCU-visible before the irqfd
+		 * itself goes live.  Moving synchronize_srcu_expedited() outside
+		 * the resampler_lock avoids deadlock with shutdown workers waiting
+		 * for the mutex while SRCU waits for workqueue progress.
+		 */
+		synchronize_srcu_expedited(&kvm->irq_srcu);
 	}
 
 	/*
-- 
2.34.1

Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock

Posted by Kunwu Chan 2 days, 11 hours ago

On 3/23/26 14:42, Sonam Sanju wrote:
> irqfd_resampler_shutdown() and kvm_irqfd_assign() both call
> synchronize_srcu_expedited() while holding kvm->irqfds.resampler_lock.
> This can deadlock when multiple irqfd workers run concurrently on the
> kvm-irqfd-cleanup workqueue during VM teardown or when VMs are rapidly
> created and destroyed:
>
>   CPU A (mutex holder)               CPU B/C/D (mutex waiters)
>   irqfd_shutdown()                   irqfd_shutdown() / kvm_irqfd_assign()
>    irqfd_resampler_shutdown()         irqfd_resampler_shutdown()
>     mutex_lock(resampler_lock)  <---- mutex_lock(resampler_lock) //BLOCKED
>     list_del_rcu(...)                     ...blocked...
>     synchronize_srcu_expedited()      // Waiters block workqueue,
>       // waits for SRCU grace            preventing SRCU grace
>       // period which requires            period from completing
>       // workqueue progress          --- DEADLOCK ---
>
> In irqfd_resampler_shutdown(), the synchronize_srcu_expedited() in
> the else branch is called directly within the mutex.  In the if-last
> branch, kvm_unregister_irq_ack_notifier() also calls
> synchronize_srcu_expedited() internally.  In kvm_irqfd_assign(),
> synchronize_srcu_expedited() is called after list_add_rcu() but
> before mutex_unlock().  All paths can block indefinitely because:
>
>   1. synchronize_srcu_expedited() waits for an SRCU grace period
>   2. SRCU grace period completion needs workqueue workers to run
>   3. The blocked mutex waiters occupy workqueue slots preventing progress
>   4. The mutex holder never releases the lock -> deadlock
>
> Fix both paths by releasing the mutex before calling
> synchronize_srcu_expedited().
>
> In irqfd_resampler_shutdown(), use a bool last flag to track whether
> this is the final irqfd for the resampler, then release the mutex
> before the SRCU synchronization.  This is safe because list_del_rcu()
> already removed the entries under the mutex, and
> kvm_unregister_irq_ack_notifier() uses its own locking (kvm->irq_lock).
>
> In kvm_irqfd_assign(), simply move synchronize_srcu_expedited() after
> mutex_unlock().  The SRCU grace period still completes before the irqfd
> goes live (the subsequent srcu_read_lock() ensures ordering).
>
> Signed-off-by: Sonam Sanju <sonam.sanju@intel.com>
> ---
> v2:
>  - Fix the same deadlock in kvm_irqfd_assign() (Vineeth Pillai)
>
>  virt/kvm/eventfd.c | 30 +++++++++++++++++++++++-------
>  1 file changed, 23 insertions(+), 7 deletions(-)
>
> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> index 0e8b8a2c5b79..8ae9f81f8bb3 100644
> --- a/virt/kvm/eventfd.c
> +++ b/virt/kvm/eventfd.c
> @@ -93,6 +93,7 @@ irqfd_resampler_shutdown(struct kvm_kernel_irqfd *irqfd)
>  {
>  	struct kvm_kernel_irqfd_resampler *resampler = irqfd->resampler;
>  	struct kvm *kvm = resampler->kvm;
> +	bool last = false;
>  
>  	mutex_lock(&kvm->irqfds.resampler_lock);
>  
> @@ -100,19 +101,27 @@ irqfd_resampler_shutdown(struct kvm_kernel_irqfd *irqfd)
>  
>  	if (list_empty(&resampler->list)) {
>  		list_del_rcu(&resampler->link);
> +		last = true;
> +	}
> +
> +	mutex_unlock(&kvm->irqfds.resampler_lock);
> +
> +	/*
> +	 * synchronize_srcu_expedited() (called explicitly below, or internally
> +	 * by kvm_unregister_irq_ack_notifier()) must not be invoked under
> +	 * resampler_lock.  Holding the mutex while waiting for an SRCU grace
> +	 * period creates a deadlock: the blocked mutex waiters occupy workqueue
> +	 * slots that the SRCU grace period machinery needs to make forward
> +	 * progress.
> +	 */
> +	if (last) {
>  		kvm_unregister_irq_ack_notifier(kvm, &resampler->notifier);
> -		/*
> -		 * synchronize_srcu_expedited(&kvm->irq_srcu) already called
> -		 * in kvm_unregister_irq_ack_notifier().
> -		 */
>  		kvm_set_irq(kvm, KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID,
>  			    resampler->notifier.gsi, 0, false);
>  		kfree(resampler);
>  	} else {
>  		synchronize_srcu_expedited(&kvm->irq_srcu);
>  	}
> -
> -	mutex_unlock(&kvm->irqfds.resampler_lock);
>  }
>  
>  /*
> @@ -450,9 +459,16 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
>  		}
>  
>  		list_add_rcu(&irqfd->resampler_link, &irqfd->resampler->list);
> -		synchronize_srcu_expedited(&kvm->irq_srcu);
>  
>  		mutex_unlock(&kvm->irqfds.resampler_lock);
> +
> +		/*
> +		 * Ensure the resampler_link is SRCU-visible before the irqfd
> +		 * itself goes live.  Moving synchronize_srcu_expedited() outside
> +		 * the resampler_lock avoids deadlock with shutdown workers waiting
> +		 * for the mutex while SRCU waits for workqueue progress.
> +		 */
> +		synchronize_srcu_expedited(&kvm->irq_srcu);
>  	}
>  
>  	/*

Building on the discussion so far, it would be helpful from the SRCU
side to gather a bit more evidence to classify the issue.

Calling synchronize_srcu_expedited() while holding a mutex is generally
valid, so the observed behavior may be workload-dependent.

The reported deadlock seems to rely on the assumption that SRCU grace
period progress is indirectly blocked by irqfd workqueue saturation.
It would be good to confirm whether that assumption actually holds.

In particular:

1) Are SRCU GP kthreads/workers still making forward progress when
the system is stuck?

2) How many irqfd workers are active in the reported scenario, and
can they saturate CPU or worker pools?

3) Do we have a concrete wait-for cycle showing that tasks blocked
on resampler_lock are in turn preventing SRCU GP completion?

4) Is the behavior reproducible in both irqfd_resampler_shutdown()
and kvm_irqfd_assign() paths?

If SRCU GP remains independent, it would help distinguish whether
this is a strict deadlock or a form of workqueue starvation / lock
contention.

A timestamp-correlated dump (blocked stacks + workqueue state +
SRCU GP activity) would likely be sufficient to classify this.

Happy to help look at traces if available.

Thanx, Kunwu

Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock

Posted by Sonam Sanju 2 days, 6 hours ago

From: Sonam Sanju <sonam.sanju@intel.com>

On Wed, Apr 01, 2026 at 05:34:58PM +0800, Kunwu Chan wrote:
> Building on the discussion so far, it would be helpful from the SRCU
> side to gather a bit more evidence to classify the issue.
>
> Calling synchronize_srcu_expedited() while holding a mutex is generally
> valid, so the observed behavior may be workload-dependent.

> The reported deadlock seems to rely on the assumption that SRCU grace
> period progress is indirectly blocked by irqfd workqueue saturation.
> It would be good to confirm whether that assumption actually holds.

I went back through our logs from two independent crash instances and
can now provide data for each of your questions.

> 1) Are SRCU GP kthreads/workers still making forward progress when
> the system is stuck?

No.  In both crash instances, process_srcu work items remain permanently
"pending" (never "in-flight") throughout the entire hang.

Instance 1 —  kernel 6.18.8, pool 14 (cpus=3):

  [  62.712760] workqueue rcu_gp: flags=0x108
  [  62.717801]   pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=2 refcnt=3
  [  62.717801]     pending: 2*process_srcu

  [  187.735092] workqueue rcu_gp: flags=0x108           (125 seconds later)
  [  187.735093]   pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=2 refcnt=3
  [  187.735093]     pending: 2*process_srcu              (still pending)

  9 consecutive dumps from t=62s to t=312s — process_srcu never runs.

Instance 2 —  kernel 6.18.2, pool 22 (cpus=5):

  [  93.280711] workqueue rcu_gp: flags=0x108
  [  93.280713]   pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=1 refcnt=2
  [  93.280716]     pending: process_srcu

  [  309.040801] workqueue rcu_gp: flags=0x108           (216 seconds later)
  [  309.040806]   pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=1 refcnt=2
  [  309.040806]     pending: process_srcu               (still pending)

  8 consecutive dumps from t=93s to t=341s — process_srcu never runs.

In both cases, rcu_gp's process_srcu is bound to the SAME per-CPU pool
where the kvm-irqfd-cleanup workers are blocked.  Both pools have idle
workers but are marked as hung/stalled:

  Instance 1: pool 14: cpus=3 hung=174s workers=11 idle: 4046 4038 4045 4039 4043 156 77 (7 idle)
  Instance 2: pool 22: cpus=5 hung=297s workers=12 idle: 4242 51 4248 4247 4245 435 4244 4239 (8 idle)

> 2) How many irqfd workers are active in the reported scenario, and
> can they saturate CPU or worker pools?

4 kvm-irqfd-cleanup workers in both instances, consistently across all
dumps:

Instance 1 ( pool 14 / cpus=3):

  [  62.831877] workqueue kvm-irqfd-cleanup: flags=0x0
  [  62.837838]   pwq 14: cpus=3 node=0 flags=0x0 nice=0 active=4 refcnt=5
  [  62.837838]     in-flight: 157:irqfd_shutdown ,4044:irqfd_shutdown ,
                               102:irqfd_shutdown ,39:irqfd_shutdown

Instance 2 ( pool 22 / cpus=5):

  [  93.280894] workqueue kvm-irqfd-cleanup: flags=0x0
  [  93.280896]   pwq 22: cpus=5 node=0 flags=0x0 nice=0 active=4 refcnt=5
  [  93.280900]     in-flight: 151:irqfd_shutdown ,4246:irqfd_shutdown ,
                               4241:irqfd_shutdown ,4243:irqfd_shutdown

These are from crosvm instances with multiple virtio devices
(virtio-blk, virtio-net, virtio-input, etc.), each registering an irqfd
with a resampler.  During VM shutdown, all irqfds are detached
concurrently, queueing that many irqfd_shutdown work items.

The 4 workers are not saturating CPU — they're all in D state.  But they
ARE all bound to the same per-CPU pool as rcu_gp's process_srcu work.

> 3) Do we have a concrete wait-for cycle showing that tasks blocked
> on resampler_lock are in turn preventing SRCU GP completion?

Yes, in both instances the hung task dump identifies the mutex holder
stuck in synchronize_srcu, with the other workers waiting on the mutex.

Instance 1 (t=314s):

  Worker pid 4044 — MUTEX HOLDER, stuck in synchronize_srcu:

    [  315.963979] task:kworker/3:8     state:D  pid:4044
    [  315.977125] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
    [  316.012504]  __synchronize_srcu+0x100/0x130
    [  316.023157]  irqfd_resampler_shutdown+0xf0/0x150  <-- offset 0xf0 (synchronize_srcu)

  Workers pid 39, 102, 157 — MUTEX WAITERS:

    [  314.793025] task:kworker/3:4     state:D  pid:157
    [  314.837472]  __mutex_lock+0x409/0xd90
    [  314.843100]  irqfd_resampler_shutdown+0x23/0x150  <-- offset 0x23 (mutex_lock)

Instance 2 (t=343s):

  Worker pid 4241 — MUTEX HOLDER, stuck in synchronize_srcu:

    [  343.193294] task:kworker/5:4     state:D  pid:4241
    [  343.193299] Workqueue: kvm-irqfd-cleanup irqfd_shutdown
    [  343.193328]  __synchronize_srcu+0x100/0x130
    [  343.193335]  irqfd_resampler_shutdown+0xf0/0x150  <-- offset 0xf0 (synchronize_srcu)

  Workers pid 151, 4243, 4246 — MUTEX WAITERS:

    [  343.193369] task:kworker/5:6     state:D  pid:4243
    [  343.193397]  __mutex_lock+0x37d/0xbb0
    [  343.193397]  irqfd_resampler_shutdown+0x23/0x150  <-- offset 0x23 (mutex_lock)

Both instances show the identical wait-for cycle:

  1. One worker holds resampler_lock, blocks in __synchronize_srcu
     (waiting for SRCU grace period)
  2. SRCU GP needs process_srcu to run — but it stays "pending"
     on the same pool
  3. Other irqfd workers block on __mutex_lock in the same pool
  4. The pool is marked "hung" and no pending work makes progress
     for 250-300 seconds until kernel panic

> 4) Is the behavior reproducible in both irqfd_resampler_shutdown()
> and kvm_irqfd_assign() paths?

In our 4 crash instances the stuck mutex holder is always in 
irqfd_resampler_shutdown() at offset 0xf0 (synchronize_srcu).  This 
is consistent — these are all VM shutdown scenarios where only 
irqfd_shutdown workqueue items run.

The kvm_irqfd_assign() path was identified by Vineeth Pillai (Google)
during a VM create/destroy stress test where assign and shutdown race.
His traces showed kvm_irqfd (the assign path) stuck in
synchronize_srcu_expedited with irqfd_resampler_shutdown blocked on
the mutex, and workqueue pwq 46 at active=1024 refcnt=2062.

> If SRCU GP remains independent, it would help distinguish whether
> this is a strict deadlock or a form of workqueue starvation / lock
> contention.

Based on the data from both instances, SRCU GP is NOT remaining
independent.  process_srcu stays permanently pending on the affected
per-CPU pool for 250-300 seconds.  But it's not just process_srcu —
ALL pending work on the pool is stuck, including items from events,
cgroup, mm, slub, and other workqueues.


> A timestamp-correlated dump (blocked stacks + workqueue state +
> SRCU GP activity) would likely be sufficient to classify this.

I hope the correlated dumps above from both instances are helpful.
To summarize the timeline (consistent across both):

  t=0:   VM shutdown begins, crosvm detaches irqfds
  t=~14: 4 irqfd_shutdown work items queued on WQ_PERCPU pool
         One worker acquires resampler_lock, enters synchronize_srcu
         Other 3 workers block on __mutex_lock
  t=~43: First "BUG: workqueue lockup" — pool detected stuck
         rcu_gp: process_srcu shown as "pending" on same pool
  t=~93  Through t=~312: Repeated dumps every ~30s
         process_srcu remains permanently "pending"
         Pool has idle workers but no pending work executes
  t=~314: Hung task dump confirms mutex holder in __synchronize_srcu
  t=~316: init triggers sysrq crash → kernel panic

> Happy to help look at traces if available.

I can share the full console-ramoops-0 and dmesg-ramoops-0 from both
instances.  Shall I post them or send them off-list?

Thanks,
Sonam

Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock

Posted by Sean Christopherson 3 days, 2 hours ago

+srcu folks

Please don't post subsequent versions In-Reply-To previous versions, it tends to
muck up tooling.

On Mon, Mar 23, 2026, Sonam Sanju wrote:
> irqfd_resampler_shutdown() and kvm_irqfd_assign() both call
> synchronize_srcu_expedited() while holding kvm->irqfds.resampler_lock.
> This can deadlock when multiple irqfd workers run concurrently on the
> kvm-irqfd-cleanup workqueue during VM teardown or when VMs are rapidly
> created and destroyed:
> 
>   CPU A (mutex holder)               CPU B/C/D (mutex waiters)
>   irqfd_shutdown()                   irqfd_shutdown() / kvm_irqfd_assign()
>    irqfd_resampler_shutdown()         irqfd_resampler_shutdown()
>     mutex_lock(resampler_lock)  <---- mutex_lock(resampler_lock) //BLOCKED
>     list_del_rcu(...)                     ...blocked...
>     synchronize_srcu_expedited()      // Waiters block workqueue,
>       // waits for SRCU grace            preventing SRCU grace
>       // period which requires            period from completing
>       // workqueue progress          --- DEADLOCK ---
> 
> In irqfd_resampler_shutdown(), the synchronize_srcu_expedited() in
> the else branch is called directly within the mutex.  In the if-last
> branch, kvm_unregister_irq_ack_notifier() also calls
> synchronize_srcu_expedited() internally.  In kvm_irqfd_assign(),
> synchronize_srcu_expedited() is called after list_add_rcu() but
> before mutex_unlock().  All paths can block indefinitely because:
> 
>   1. synchronize_srcu_expedited() waits for an SRCU grace period
>   2. SRCU grace period completion needs workqueue workers to run
>   3. The blocked mutex waiters occupy workqueue slots preventing progress

Unless I'm misunderstanding the bug, "fixing" in this in KVM is papering over an
underlying flaw.  Essentially, this would be establishing a rule that
synchronize_srcu_expedited() can *never* be called while holding a mutex.  That's
not viable.

>   4. The mutex holder never releases the lock -> deadlock

Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock

Posted by Paul E. McKenney 3 days ago

On Tue, Mar 31, 2026 at 11:17:19AM -0700, Sean Christopherson wrote:
> +srcu folks
> 
> Please don't post subsequent versions In-Reply-To previous versions, it tends to
> muck up tooling.
> 
> On Mon, Mar 23, 2026, Sonam Sanju wrote:
> > irqfd_resampler_shutdown() and kvm_irqfd_assign() both call
> > synchronize_srcu_expedited() while holding kvm->irqfds.resampler_lock.
> > This can deadlock when multiple irqfd workers run concurrently on the
> > kvm-irqfd-cleanup workqueue during VM teardown or when VMs are rapidly
> > created and destroyed:
> > 
> >   CPU A (mutex holder)               CPU B/C/D (mutex waiters)
> >   irqfd_shutdown()                   irqfd_shutdown() / kvm_irqfd_assign()
> >    irqfd_resampler_shutdown()         irqfd_resampler_shutdown()
> >     mutex_lock(resampler_lock)  <---- mutex_lock(resampler_lock) //BLOCKED
> >     list_del_rcu(...)                     ...blocked...
> >     synchronize_srcu_expedited()      // Waiters block workqueue,
> >       // waits for SRCU grace            preventing SRCU grace
> >       // period which requires            period from completing
> >       // workqueue progress          --- DEADLOCK ---
> > 
> > In irqfd_resampler_shutdown(), the synchronize_srcu_expedited() in
> > the else branch is called directly within the mutex.  In the if-last
> > branch, kvm_unregister_irq_ack_notifier() also calls
> > synchronize_srcu_expedited() internally.  In kvm_irqfd_assign(),
> > synchronize_srcu_expedited() is called after list_add_rcu() but
> > before mutex_unlock().  All paths can block indefinitely because:
> > 
> >   1. synchronize_srcu_expedited() waits for an SRCU grace period
> >   2. SRCU grace period completion needs workqueue workers to run
> >   3. The blocked mutex waiters occupy workqueue slots preventing progress
> 
> Unless I'm misunderstanding the bug, "fixing" in this in KVM is papering over an
> underlying flaw.  Essentially, this would be establishing a rule that
> synchronize_srcu_expedited() can *never* be called while holding a mutex.  That's
> not viable.

First, it is OK to invoke synchronize_srcu_expedited() while holding
a mutex.  Second, the synchronize_srcu_expedited() function's use of
workqueues is the same as that of synchronize_srcu(), so in an alternate
universe where it was not OK to invoke synchronize_srcu_expedited() while
holding a mutex, it would also not be OK to invoke synchronize_srcu()
while holding that same mutex.  Third, it is also OK to acquire that
same mutex within a workqueue handler.  Fourth, SRCU and RCU use their
own workqueue, which no one else should be using (and that prohibition
most definitely includes the irqfd workers).

As a result, I do have to ask...  When you say "multiple irqfd workers",
exactly how many such workers are you running?

							Thanx, Paul

> >   4. The mutex holder never releases the lock -> deadlock

Re: [PATCH v2] KVM: irqfd: fix deadlock by moving synchronize_srcu out of resampler_lock

Posted by Sonam Sanju 2 days, 11 hours ago

From: Sonam Sanju <sonam.sanju@intel.com>

On Tue, Mar 31, 2026 at 01:51:00PM -0700, Paul E. McKenney wrote:
> On Tue, Mar 31, 2026 at 11:17:19AM -0700, Sean Christopherson wrote:
> > Please don't post subsequent versions In-Reply-To previous versions, it tends to
> > muck up tooling.

Noted, will send future versions as new top-level threads. Sorry about
that.

> > Unless I'm misunderstanding the bug, "fixing" in this in KVM is papering over an
> > underlying flaw.  Essentially, this would be establishing a rule that
> > synchronize_srcu_expedited() can *never* be called while holding a mutex.  That's
> > not viable.
>
> First, it is OK to invoke synchronize_srcu_expedited() while holding
> a mutex.  Second, the synchronize_srcu_expedited() function's use of
> workqueues is the same as that of synchronize_srcu(), so in an alternate
> universe where it was not OK to invoke synchronize_srcu_expedited() while
> holding a mutex, it would also not be OK to invoke synchronize_srcu()
> while holding that same mutex.  Third, it is also OK to acquire that
> same mutex within a workqueue handler.  Fourth, SRCU and RCU use their
> own workqueue, which no one else should be using (and that prohibition
> most definitely includes the irqfd workers).

Thank you for clarifying this. 

> As a result, I do have to ask...  When you say "multiple irqfd workers",
> exactly how many such workers are you running?

While running cold reboot/ warm reboot cycling in our Android platforms 
with 6.18 kernel, the hung_task traces consistently show 8-15 
kvm-irqfd-cleanup workers in D state.  These are crosvm instances with 
roughly 10-16 irqfd lines per VM (virtio-blk, virtio-net, virtio-input,
virtio-snd, etc., each with a resampler).

Vineeth Pillai (Google) reproduced a related scenario under a VM
create/destroy stress test where the workqueue reached active=1024
refcnt=2062, though that is a much more extreme case than what we see
during normal shutdown.

The first part of the deadlock is genuinely there. One worker holds 
resampler_lock and blocks in synchronize_srcu_expedited() while the
remaining 8-15 workers block on __mutex_lock at 
irqfd_resampler_shutdown.  

Thanks,
Sonam