[PATCH v14 0/2] PCI/IOV: Fix SR-IOV locking races and AB-BA deadlock

Ionut Nechita (Wind River) posted 2 patches 1 month, 3 weeks ago
drivers/pci/iov.c       |  9 +++++----
drivers/pci/pci-sysfs.c | 30 +++++++++++++++++++++++++++++-
drivers/pci/probe.c     | 18 ++++++++++++++++--
3 files changed, 50 insertions(+), 7 deletions(-)
[PATCH v14 0/2] PCI/IOV: Fix SR-IOV locking races and AB-BA deadlock
Posted by Ionut Nechita (Wind River) 1 month, 3 weeks ago
From: Ionut Nechita <ionut.nechita@windriver.com>

From: Ionut Nechita <ionut.nechita@windriver.com>

Hi Bjorn,

This is v14 of the fix for the SR-IOV race between driver .remove()
and concurrent hotplug events.

Changes since v13 (Apr 21):
  - Patch 1/2: declared pci_rescan_remove_owner as
    'const struct task_struct *' to make it clear the pointer is
    only used for identity comparisons and never to modify the
    task_struct.  Suggested by Benjamin Block.
  - Patch 1/2: picked up Reviewed-by and Tested-by from
    Benjamin Block (given on v13).
  - Patch 2/2: unchanged from v13 (keeps R-b from Niklas Schnelle
    and R-b/T-b from Benjamin Block).

Changes since v12 (Apr 21):
  - Patch 1/2: replace mutex_get_owner() with explicit owner tracking
    (struct task_struct * + depth counter).  mutex_get_owner() is not
    exported outside the scheduler core and caused link failures in
    some build configurations:

      ld: vmlinux.o: in function `pci_lock_rescan_remove':
      drivers/pci/probe.c: undefined reference to `mutex_get_owner'

    The new implementation records the owner on first acquisition and
    clears it on final release, with a WARN_ON() catching mismatched
    unlock calls.  Semantics are unchanged.
  - Patch 1/2: dropped Reviewed-by/Tested-by since the locking
    primitive changed; please re-review.
  - Patch 2/2: unchanged from v12 (keeps R-b from Niklas Schnelle and
    R-b/T-b from Benjamin Block).

Changes since v11 (Mar 26):
  - Patch 2/2 picked up R-b Niklas Schnelle and R-b/T-b Benjamin Block
  - Rebased on linux-next (next-20260420)

Changes since v10 (Mar 18):
  - Patch 2/2: added kill_device() before device_release_driver() to
    prevent a new driver from binding between unbind and removal,
    closing the TOCTOU race window identified by Benjamin Block
  - Patch 1/2 unchanged from v10

Changes since v9 (Mar 10):
  - NEW patch 2/2: fix AB-BA deadlock in remove_store() by calling
    device_release_driver() before pci_stop_and_remove_bus_device_locked(),
    as suggested by Benjamin Block (addresses Guenter Roeck's report)
  - Patch 1/2 unchanged from v9

Changes since v8 (Mar 9):
  - Added Reviewed-by from Niklas Schnelle (IBM) and Tested-by (s390)
  - Added Fixes tags for the three related commits
  - Removed rescan/remove locking from sriov_numvfs_store() since
    locking is now handled in sriov_add_vfs() and sriov_del_vfs()
  - Rebased on linux-next (20260309)

The AB-BA deadlock:

  CPU0 (remove_store)               CPU1 (unbind_store)
  --------------------              --------------------
  pci_lock_rescan_remove()
                                    device_lock()
                                    driver .remove()
                                      sriov_del_vfs()
                                        pci_lock_rescan_remove()  <-- WAITS
  pci_stop_bus_device()
    device_release_driver()
      device_lock()                                               <-- WAITS

Patch 2/2 fixes this by:
  1. Marking the device as dead via kill_device() so no new driver
     can bind (prevents TOCTOU race between unbind and removal)
  2. Calling device_release_driver() before
     pci_stop_and_remove_bus_device_locked(), so both paths take
     locks in the same order: device_lock first, then
     pci_rescan_remove_lock

Note: the concurrent unbind_store + hotplug-event case (where the
hotplug handler takes pci_rescan_remove_lock before device_lock)
remains a known limitation.  This is a pre-existing issue that
Benjamin Block is addressing separately in:
  https://lore.kernel.org/linux-pci/354b9e4a54ced67f3c89df198041df19434fe4c8.1773235561.git.bblock@linux.ibm.com/

This race has been independently observed by multiple organizations:
  - IBM (s390 platform-generated hot-unplug events racing with
    sriov_del_vfs during PF driver unload)
  - NVIDIA (tested by Dragos Tatulea in earlier versions)
  - Intel (xe driver hitting lockdep warnings and deadlocks when
    calling pci_disable_sriov from .remove)
  - Wind River (original reporter and patch author)

Test environment:
  - Tested on s390 by Benjamin Block and Niklas Schnelle (IBM)
  - Tested on x86_64 with Intel and NVIDIA SR-IOV devices (earlier
    versions)

Based on linux-next (next-20260420).

Link: https://lore.kernel.org/linux-pci/20260214193235.262219-3-ionut.nechita@windriver.com/ [v1]
Link: https://lore.kernel.org/linux-pci/20260219212648.82606-1-ionut.nechita@windriver.com/ [v2]
Link: https://lore.kernel.org/lkml/20260225202434.18737-1-ionut.nechita@windriver.com/ [v3]
Link: https://lore.kernel.org/linux-pci/20260228120138.51197-2-ionut.nechita@windriver.com/ [v4]
Link: https://lore.kernel.org/linux-pci/20260303080903.28693-1-ionut.nechita@windriver.com/ [v5]
Link: https://lore.kernel.org/linux-pci/20260306082108.17322-1-ionut.nechita@windriver.com/ [v6]
Link: https://lore.kernel.org/linux-pci/20260308135352.80346-1-ionut.nechita@windriver.com/ [v7]
Link: https://lore.kernel.org/linux-pci/20260309194920.16459-1-ionut.nechita@windriver.com/ [v8]
Link: https://lore.kernel.org/linux-pci/20260310074303.17480-1-ionut.nechita@windriver.com/ [v9]
Link: https://lore.kernel.org/linux-pci/20260318210316.61975-1-ionut.nechita@windriver.com/ [v10]
Link: https://lore.kernel.org/linux-pci/20260326083534.23602-1-ionut.nechita@windriver.com/ [v11]
Link: https://lore.kernel.org/linux-pci/cover.1776755661.git.ionut.nechita@windriver.com/ [v12]
Link: https://lore.kernel.org/linux-pci/cover.1776756380.git.ionut.nechita@windriver.com/ [v13]

Ionut Nechita (Wind River) (2):
  PCI/IOV: Make pci_lock_rescan_remove() reentrant and protect
    sriov_add_vfs/sriov_del_vfs
  PCI: Fix AB-BA deadlock between device_lock and pci_rescan_remove_lock
    in remove_store

 drivers/pci/iov.c       |  9 +++++----
 drivers/pci/pci-sysfs.c | 30 +++++++++++++++++++++++++++++-
 drivers/pci/probe.c     | 18 ++++++++++++++++--
 3 files changed, 50 insertions(+), 7 deletions(-)

-- 
2.53.0
Re: [PATCH v14 0/2] PCI/IOV: Fix SR-IOV locking races and AB-BA deadlock
Posted by Bjorn Helgaas 3 weeks, 1 day ago
On Wed, Apr 22, 2026 at 09:32:40AM +0300, Ionut Nechita (Wind River) wrote:
> Hi Bjorn,
> 
> This is v14 of the fix for the SR-IOV race between driver .remove()
> and concurrent hotplug events.

Thanks for your persistence on this!

Can you take a look at the sashiko feedback and see whether there's
any merit to it?

https://sashiko.dev/#/patchset/cover.1776839248.git.ionut.nechita%40windriver.com
Re: [PATCH v14 0/2] PCI/IOV: Fix SR-IOV locking races and AB-BA deadlock
Posted by Benjamin Block 3 weeks ago
On Thu, May 21, 2026 at 03:13:12PM -0500, Bjorn Helgaas wrote:
> On Wed, Apr 22, 2026 at 09:32:40AM +0300, Ionut Nechita (Wind River) wrote:
> > This is v14 of the fix for the SR-IOV race between driver .remove()
> > and concurrent hotplug events.
> 
> Can you take a look at the sashiko feedback and see whether there's
> any merit to it?
> 
> https://sashiko.dev/#/patchset/cover.1776839248.git.ionut.nechita%40windriver.com

IIRC Ionut already responded to several of the findings a few patch versions
back:
https://lore.kernel.org/linux-pci/20260319202755.16081-1-ionut.nechita%40windriver.com/

Patch 01:

> > --- a/drivers/pci/iov.c
> > +++ b/drivers/pci/iov.c
> [ ... ]
> > @@ -633,15 +629,18 @@ static int sriov_add_vfs(struct pci_dev *dev, u16 num_vfs)
> >  	if (dev->no_vf_scan)
> >  		return 0;
> >  
> > +	pci_lock_rescan_remove();
> Does moving this lock acquisition here create an ABBA deadlock? 
> Looking at sriov_numvfs_store(), the kernel acquires device_lock(&pdev->dev)
> before invoking the driver's sriov_configure() callback, which eventually
> reaches this point and calls pci_lock_rescan_remove(). This establishes a
> dependency of device_lock followed by pci_rescan_remove_lock.
> During a PCI hotplug event or sysfs remove_store, the subsystem first acquires
> pci_rescan_remove_lock, then calls pci_stop_and_remove_bus_device() which
> invokes device_release_driver() and takes device_lock(&pdev->dev). This
> creates the reverse dependency.
> Will an SR-IOV sysfs configuration racing with a hot-unplug event result in a
> deadlock? I understand this regression is fixed later in the patch series by
> commit 'PCI: Fix AB-BA deadlock between device_lock and pci_rescan_remove_lock
> in remove_store', but wanted to point out the interaction here.

This is already addressed in Patch 02.

> > diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> > index b63cd0c310bc0..91f1dae6943b6 100644
> > --- a/drivers/pci/probe.c
> > +++ b/drivers/pci/probe.c
> [ ... ]
> > @@ -3513,16 +3513,30 @@ EXPORT_SYMBOL_GPL(pci_rescan_bus);
> >   * routines should always be executed under this mutex.
> >   */
> >  DEFINE_MUTEX(pci_rescan_remove_lock);
> > +static const struct task_struct *pci_rescan_remove_owner;
> > +static unsigned int pci_rescan_remove_depth;
> >  
> >  void pci_lock_rescan_remove(void)
> >  {
> > -	mutex_lock(&pci_rescan_remove_lock);
> > +	if (pci_rescan_remove_owner == current) {
>
> Can this read cause a data race?
> This reads pci_rescan_remove_owner outside of the mutex critical section.
> Concurrently, another thread executing this same function can acquire the
> mutex and write to this shared variable.
>
> > +		pci_rescan_remove_depth++;
> > +	} else {
> > +		mutex_lock(&pci_rescan_remove_lock);
> > +		pci_rescan_remove_owner = current;
>
> Could the compiler optimize or tear these unprotected accesses? Should these
> reads and writes use READ_ONCE() and WRITE_ONCE() to safely perform the
> lockless accesses and avoid triggering warnings in tools like KCSAN?
>
> > +		pci_rescan_remove_depth = 1;
> > +	}
> >  }

Those two are wrong. `pci_rescan_remove_depth` is only modified when we
already own the mutex `pci_rescan_remove_owner == current`, so the writes are
safe. The reads are safe because, again, the relevant case is when the current
thread already owns the mutex, and then we already have the lock; when we
don't already have the lock we gonna try to take it, but this is also safe
because nobody gonna change the owner to us other than we ourself, and only
once we have taken the lock.

Patch 02:

> > diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
> > --- a/drivers/pci/pci-sysfs.c
> > +++ b/drivers/pci/pci-sysfs.c
> > @@ -521,8 +521,36 @@ static ssize_t remove_store(struct device *dev, struct device_attribute *attr,
> [ ... ]
> > +		 * By unbinding first, the driver's .remove() callback
> > +		 * (including any SR-IOV VF cleanup) completes before
> > +		 * pci_rescan_remove_lock is acquired, ensuring both
> > +		 * paths take locks in the same order.
> > +		 */
> > +		device_release_driver(dev);
>
> Does calling device_release_driver() here violate the bottom-up teardown
> guarantee if this device is a bridge?
>
> If the device is a PCIe bridge, unbinding the pcieport driver will call
> pci_disable_device(). Because child devices do not increment the parent
> bridge's enable_cnt during probe, the bridge's enable_cnt will drop to zero
> and clear the PCI_COMMAND_MASTER bit.
>
> This disables bus-mastering for the bridge while its child devices are still
> fully active. When they are later unbound during removal, their drivers might
> fail to communicate with the hardware, leading to master aborts, PCIe AER
> errors, and timeouts on DMA flushes.

Ionut previously addressed this in the post I linked above; identifying it as
false-positive.

> > 		pci_stop_and_remove_bus_device_locked(to_pci_dev(dev));
>
> Does this leave child devices vulnerable to the exact same AB-BA deadlock?
>
> When pci_stop_and_remove_bus_device_locked() executes, it acquires the
> pci_rescan_remove_lock and unbinds child devices recursively:
>
> pci_stop_and_remove_bus_device_locked()
>   pci_stop_bus_device()
>     pci_stop_dev(child)
>       device_release_driver(&child->dev)
>
> This attempts to acquire device_lock(&child->dev) while already holding
> pci_rescan_remove_lock.
>
> If a concurrent unbind_store() happens on the child device, it will acquire
> device_lock(&child->dev) first and then wait for pci_rescan_remove_lock,
> reproducing the original circular dependency.ecause then we can't concurrently change the state to `current`.

This is a false positive; we marked the device as "killed" and unbound the
device driver. So, when the concurrent unbind_store() is called `dev->driver`
is NULL and so it never gonna call device_driver_detach() and take the lock.
Or rather, it wouldn't even be called, because unbind_store() is supposedly
only called when some driver binds the device, which would not be the case
here.

At least thats my take on those latest findings.

-- 
Best Regards, Benjamin Block        /        Linux on IBM Z Kernel Development
IBM Deutschland Research & Development GmbH    /   https://www.ibm.com/privacy
Vors. Aufs.-R.: Wolfgang Wendt         /        Geschäftsführung: David Faller
Sitz der Ges.: Ehningen     /     Registergericht: AmtsG Stuttgart, HRB 243294
Re: [PATCH v14 0/2] PCI/IOV: Fix SR-IOV locking races and AB-BA deadlock
Posted by Benjamin Block 3 weeks, 1 day ago
Hey Bjorn,

On Wed, Apr 22, 2026 at 09:32:40AM +0300, Ionut Nechita (Wind River) wrote:
> From: Ionut Nechita <ionut.nechita@windriver.com>
> 
> This is v14 of the fix for the SR-IOV race between driver .remove()
> and concurrent hotplug events.
> 
--8<--
> 
> This race has been independently observed by multiple organizations:
>   - IBM (s390 platform-generated hot-unplug events racing with
>     sriov_del_vfs during PF driver unload)
>   - NVIDIA (tested by Dragos Tatulea in earlier versions)
>   - Intel (xe driver hitting lockdep warnings and deadlocks when
>     calling pci_disable_sriov from .remove)
>   - Wind River (original reporter and patch author)
> 
> Test environment:
>   - Tested on s390 by Benjamin Block and Niklas Schnelle (IBM)
>   - Tested on x86_64 with Intel and NVIDIA SR-IOV devices (earlier
>     versions)
> 
> Based on linux-next (next-20260420).
> 
--8<--
> 
> Ionut Nechita (Wind River) (2):
>   PCI/IOV: Make pci_lock_rescan_remove() reentrant and protect
>     sriov_add_vfs/sriov_del_vfs
>   PCI: Fix AB-BA deadlock between device_lock and pci_rescan_remove_lock
>     in remove_store
> 
>  drivers/pci/iov.c       |  9 +++++----
>  drivers/pci/pci-sysfs.c | 30 +++++++++++++++++++++++++++++-
>  drivers/pci/probe.c     | 18 ++++++++++++++++--
>  3 files changed, 50 insertions(+), 7 deletions(-)

do you think this patchset is ready integration? Or what do you think we'd
still need to do to get there?

This race, and especially those I fix in the follow-on patchset
https://lore.kernel.org/linux-pci/cover.1776868550.git.bblock%40linux.ibm.com/
are pretty painful on s390 with PCI hotplug. It forces us to reboot the
system, since there is no other way to get out of the deadlock.

-- 
Best Regards, Benjamin Block        /        Linux on IBM Z Kernel Development
IBM Deutschland Research & Development GmbH    /   https://www.ibm.com/privacy
Vors. Aufs.-R.: Wolfgang Wendt         /        Geschäftsführung: David Faller
Sitz der Ges.: Ehningen     /     Registergericht: AmtsG Stuttgart, HRB 243294