[PATCH rc] PCI: Fix nested pci_dev_reset_iommu_prepare/done()

Nicolin Chen posted 1 patch 2 weeks, 4 days ago
drivers/pci/pci.c    |  7 -------
drivers/pci/quirks.c | 29 +++++++++++------------------
2 files changed, 11 insertions(+), 25 deletions(-)
[PATCH rc] PCI: Fix nested pci_dev_reset_iommu_prepare/done()
Posted by Nicolin Chen 2 weeks, 4 days ago
Shuai found that cxl_reset_bus_function() calls pci_reset_bus_function()
internally while both are calling pci_dev_reset_iommu_prepare/done().

This results in a nested pci_dev_reset_iommu_prepare/done() call:
    cxl_reset_bus_function():
        pci_dev_reset_iommu_prepare(); // outer
        pci_reset_bus_function():
            pci_dev_reset_iommu_prepare(); // inner (re-entry)
            ...
            pci_dev_reset_iommu_done(); // inner
        pci_dev_reset_iommu_done(); // outer

However, pci_dev_reset_iommu_prepare() doesn't allow a re-entry for now.

Digging further, I found that most functions in pci_dev_specific_reset()
except reset_ivb_igd() are calling pcie_flr() that has nested those two
functions as well.

Drop the outer callbacks in:
 - cxl_reset_bus_function()
 - pci_dev_specific_reset()

Replace with adding the callbacks in:
 + reset_ivb_igd()

Fixes: f5b16b802174 ("PCI: Suspend iommu function prior to resetting a device")
Cc: stable@vger.kernel.org
Reported-by: Shuai Xue <xueshuai@linux.alibaba.com>
Closes: https://lore.kernel.org/all/absKsk7qQOwzhpzv@Asurada-Nvidia/
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/pci/pci.c    |  7 -------
 drivers/pci/quirks.c | 29 +++++++++++------------------
 2 files changed, 11 insertions(+), 25 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 8479c2e1f74f..8f1fd083a84c 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4958,12 +4958,6 @@ static int cxl_reset_bus_function(struct pci_dev *dev, bool probe)
 	if (rc)
 		return -ENOTTY;
 
-	rc = pci_dev_reset_iommu_prepare(dev);
-	if (rc) {
-		pci_err(dev, "failed to stop IOMMU for a PCI reset: %d\n", rc);
-		return rc;
-	}
-
 	if (reg & PCI_DVSEC_CXL_PORT_CTL_UNMASK_SBR) {
 		val = reg;
 	} else {
@@ -4978,7 +4972,6 @@ static int cxl_reset_bus_function(struct pci_dev *dev, bool probe)
 		pci_write_config_word(bridge, dvsec + PCI_DVSEC_CXL_PORT_CTL,
 				      reg);
 
-	pci_dev_reset_iommu_done(dev);
 	return rc;
 }
 
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 48946cca4be7..d1d210bc2a15 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -3973,6 +3973,7 @@ static int reset_ivb_igd(struct pci_dev *dev, bool probe)
 	void __iomem *mmio_base;
 	unsigned long timeout;
 	u32 val;
+	int ret;
 
 	if (probe)
 		return 0;
@@ -3981,6 +3982,12 @@ static int reset_ivb_igd(struct pci_dev *dev, bool probe)
 	if (!mmio_base)
 		return -ENOMEM;
 
+	ret = pci_dev_reset_iommu_prepare(dev);
+	if (ret) {
+		pci_err(dev, "failed to stop IOMMU for a PCI reset: %d\n", ret);
+		goto out_iounmap;
+	}
+
 	iowrite32(0x00000002, mmio_base + MSG_CTL);
 
 	/*
@@ -4006,8 +4013,10 @@ static int reset_ivb_igd(struct pci_dev *dev, bool probe)
 reset_complete:
 	iowrite32(0x00000002, mmio_base + NSDE_PWR_STATE);
 
+	pci_dev_reset_iommu_done(dev);
+out_iounmap:
 	pci_iounmap(dev, mmio_base);
-	return 0;
+	return ret;
 }
 
 /* Device-specific reset method for Chelsio T4-based adapters */
@@ -4257,22 +4266,6 @@ static const struct pci_dev_reset_methods pci_dev_reset_methods[] = {
 	{ 0 }
 };
 
-static int __pci_dev_specific_reset(struct pci_dev *dev, bool probe,
-				    const struct pci_dev_reset_methods *i)
-{
-	int ret;
-
-	ret = pci_dev_reset_iommu_prepare(dev);
-	if (ret) {
-		pci_err(dev, "failed to stop IOMMU for a PCI reset: %d\n", ret);
-		return ret;
-	}
-
-	ret = i->reset(dev, probe);
-	pci_dev_reset_iommu_done(dev);
-	return ret;
-}
-
 /*
  * These device-specific reset methods are here rather than in a driver
  * because when a host assigns a device to a guest VM, the host may need
@@ -4287,7 +4280,7 @@ int pci_dev_specific_reset(struct pci_dev *dev, bool probe)
 		     i->vendor == (u16)PCI_ANY_ID) &&
 		    (i->device == dev->device ||
 		     i->device == (u16)PCI_ANY_ID))
-			return __pci_dev_specific_reset(dev, probe, i);
+			return i->reset(dev, probe);
 	}
 
 	return -ENOTTY;
-- 
2.34.1
RE: [PATCH rc] PCI: Fix nested pci_dev_reset_iommu_prepare/done()
Posted by Tian, Kevin 2 weeks, 4 days ago
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Thursday, March 19, 2026 6:00 AM
> 
> Shuai found that cxl_reset_bus_function() calls pci_reset_bus_function()
> internally while both are calling pci_dev_reset_iommu_prepare/done().
> 
> This results in a nested pci_dev_reset_iommu_prepare/done() call:
>     cxl_reset_bus_function():
>         pci_dev_reset_iommu_prepare(); // outer
>         pci_reset_bus_function():
>             pci_dev_reset_iommu_prepare(); // inner (re-entry)
>             ...
>             pci_dev_reset_iommu_done(); // inner
>         pci_dev_reset_iommu_done(); // outer
> 
> However, pci_dev_reset_iommu_prepare() doesn't allow a re-entry for now.
> 
> Digging further, I found that most functions in pci_dev_specific_reset()
> except reset_ivb_igd() are calling pcie_flr() that has nested those two
> functions as well.
> 
> Drop the outer callbacks in:
>  - cxl_reset_bus_function()
>  - pci_dev_specific_reset()
> 
> Replace with adding the callbacks in:
>  + reset_ivb_igd()
> 

I wonder whether it's cleaner to just make pci_dev_reset_iommu_prepare()
reentrant here.

what this patch does requires additional attention on specific reset functions.

e.g. reset_hinic_vf_dev() has a special logic waiting for firmware reset
done after calling pcie_flr(). With this patch the _done() notification will
be triggered earlier than expectation.

the original way is more maintainable with prepare/done equipped in
the highest level reset helpers in pci_reset_fn_methods[].

btw the AI review tool gave a comment:

https://sashiko.dev/#/patchset/20260318220028.1146905-1-nicolinc%40nvidia.com

it won't hold if my proposal is agreed. But it still applies to your quarantine
series which relies on accurate report of success reset.
Re: [PATCH rc] PCI: Fix nested pci_dev_reset_iommu_prepare/done()
Posted by Nicolin Chen 2 weeks, 4 days ago
On Thu, Mar 19, 2026 at 02:22:39AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Thursday, March 19, 2026 6:00 AM
> > 
> > Shuai found that cxl_reset_bus_function() calls pci_reset_bus_function()
> > internally while both are calling pci_dev_reset_iommu_prepare/done().
> > 
> > This results in a nested pci_dev_reset_iommu_prepare/done() call:
> >     cxl_reset_bus_function():
> >         pci_dev_reset_iommu_prepare(); // outer
> >         pci_reset_bus_function():
> >             pci_dev_reset_iommu_prepare(); // inner (re-entry)
> >             ...
> >             pci_dev_reset_iommu_done(); // inner
> >         pci_dev_reset_iommu_done(); // outer
> > 
> > However, pci_dev_reset_iommu_prepare() doesn't allow a re-entry for now.
> > 
> > Digging further, I found that most functions in pci_dev_specific_reset()
> > except reset_ivb_igd() are calling pcie_flr() that has nested those two
> > functions as well.
> > 
> > Drop the outer callbacks in:
> >  - cxl_reset_bus_function()
> >  - pci_dev_specific_reset()
> > 
> > Replace with adding the callbacks in:
> >  + reset_ivb_igd()
> > 
> 
> I wonder whether it's cleaner to just make pci_dev_reset_iommu_prepare()
> reentrant here.
> 
> what this patch does requires additional attention on specific reset functions.
> 
> e.g. reset_hinic_vf_dev() has a special logic waiting for firmware reset
> done after calling pcie_flr(). With this patch the _done() notification will
> be triggered earlier than expectation.

!

> the original way is more maintainable with prepare/done equipped in
> the highest level reset helpers in pci_reset_fn_methods[].

Yes. I agree. I will add a reset_cnt and allow re-entry.

> btw the AI review tool gave a comment:
> 
> https://sashiko.dev/#/patchset/20260318220028.1146905-1-nicolinc%40nvidia.com

My browsers on both Linux and Windows blocked this site :-/

> it won't hold if my proposal is agreed. But it still applies to your quarantine
> series which relies on accurate report of success reset.

Would you mind posting it here, if it's not too long?

Thanks
Nicolin
RE: [PATCH rc] PCI: Fix nested pci_dev_reset_iommu_prepare/done()
Posted by Tian, Kevin 2 weeks, 4 days ago
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Thursday, March 19, 2026 10:43 AM
> 
> On Thu, Mar 19, 2026 at 02:22:39AM +0000, Tian, Kevin wrote:
> > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > Sent: Thursday, March 19, 2026 6:00 AM
> > >
> > > Shuai found that cxl_reset_bus_function() calls pci_reset_bus_function()
> > > internally while both are calling pci_dev_reset_iommu_prepare/done().
> > >
> > > This results in a nested pci_dev_reset_iommu_prepare/done() call:
> > >     cxl_reset_bus_function():
> > >         pci_dev_reset_iommu_prepare(); // outer
> > >         pci_reset_bus_function():
> > >             pci_dev_reset_iommu_prepare(); // inner (re-entry)
> > >             ...
> > >             pci_dev_reset_iommu_done(); // inner
> > >         pci_dev_reset_iommu_done(); // outer
> > >
> > > However, pci_dev_reset_iommu_prepare() doesn't allow a re-entry for
> now.
> > >
> > > Digging further, I found that most functions in pci_dev_specific_reset()
> > > except reset_ivb_igd() are calling pcie_flr() that has nested those two
> > > functions as well.
> > >
> > > Drop the outer callbacks in:
> > >  - cxl_reset_bus_function()
> > >  - pci_dev_specific_reset()
> > >
> > > Replace with adding the callbacks in:
> > >  + reset_ivb_igd()
> > >
> >
> > I wonder whether it's cleaner to just make pci_dev_reset_iommu_prepare()
> > reentrant here.
> >
> > what this patch does requires additional attention on specific reset
> functions.
> >
> > e.g. reset_hinic_vf_dev() has a special logic waiting for firmware reset
> > done after calling pcie_flr(). With this patch the _done() notification will
> > be triggered earlier than expectation.
> 
> !
> 
> > the original way is more maintainable with prepare/done equipped in
> > the highest level reset helpers in pci_reset_fn_methods[].
> 
> Yes. I agree. I will add a reset_cnt and allow re-entry.
> 
> > btw the AI review tool gave a comment:
> >
> > https://sashiko.dev/#/patchset/20260318220028.1146905-1-
> nicolinc%40nvidia.com
> 
> My browsers on both Linux and Windows blocked this site :-/
> 
> > it won't hold if my proposal is agreed. But it still applies to your quarantine
> > series which relies on accurate report of success reset.
> 
> Would you mind posting it here, if it's not too long?
> 

Here it is:
--
By removing __pci_dev_specific_reset(), the IOMMU preparation error checking
is now delegated to the individual device-specific reset handlers. Will this
cause IOMMU preparation failures to be silently ignored?
Looking at handlers like reset_intel_82599_sfp_virtfn(), the return value of
pcie_flr() is not checked:
static int reset_intel_82599_sfp_virtfn(struct pci_dev *dev, bool probe)
{
	if (!probe)
		pcie_flr(dev);
	return 0;
}
Similarly, delay_250ms_after_flr() ignores the return value when actually
performing the reset:
static int delay_250ms_after_flr(struct pci_dev *dev, bool probe)
{
	if (probe)
		return pcie_reset_flr(dev, PCI_RESET_PROBE);
	pcie_reset_flr(dev, PCI_RESET_DO_RESET);
	msleep(250);
	return 0;
}
If pci_dev_reset_iommu_prepare() fails inside pcie_flr(), the function aborts
the FLR and returns an error. However, since the quirk handlers swallow this
error, the PCI subsystem is falsely told that the device reset succeeded. 
Could this falsely reported success leave sensitive device state uncleared
between host and guest environments? Can we ensure the return value from the
inner reset functions is properly propagated to prevent this?
Re: [PATCH rc] PCI: Fix nested pci_dev_reset_iommu_prepare/done()
Posted by Nicolin Chen 2 weeks, 4 days ago
On Thu, Mar 19, 2026 at 03:00:58AM +0000, Tian, Kevin wrote:
> Looking at handlers like reset_intel_82599_sfp_virtfn(), the return value of
> pcie_flr() is not checked:
> static int reset_intel_82599_sfp_virtfn(struct pci_dev *dev, bool probe)
> {
> 	if (!probe)
> 		pcie_flr(dev);
> 	return 0;
> }
> Similarly, delay_250ms_after_flr() ignores the return value when actually
> performing the reset:
> static int delay_250ms_after_flr(struct pci_dev *dev, bool probe)
> {
> 	if (probe)
> 		return pcie_reset_flr(dev, PCI_RESET_PROBE);
> 	pcie_reset_flr(dev, PCI_RESET_DO_RESET);
> 	msleep(250);
> 	return 0;
> }
> If pci_dev_reset_iommu_prepare() fails inside pcie_flr(), the function aborts
> the FLR and returns an error. However, since the quirk handlers swallow this
> error, the PCI subsystem is falsely told that the device reset succeeded. 
> Could this falsely reported success leave sensitive device state uncleared
> between host and guest environments? Can we ensure the return value from the
> inner reset functions is properly propagated to prevent this?

Yea, looks like we need a preparatory patch for the other series.

Thanks
Nicolin
Re: [PATCH rc] PCI: Fix nested pci_dev_reset_iommu_prepare/done()
Posted by Bjorn Helgaas 2 weeks, 4 days ago
On Wed, Mar 18, 2026 at 03:00:28PM -0700, Nicolin Chen wrote:
> Shuai found that cxl_reset_bus_function() calls pci_reset_bus_function()
> internally while both are calling pci_dev_reset_iommu_prepare/done().
> 
> This results in a nested pci_dev_reset_iommu_prepare/done() call:
>     cxl_reset_bus_function():
>         pci_dev_reset_iommu_prepare(); // outer
>         pci_reset_bus_function():
>             pci_dev_reset_iommu_prepare(); // inner (re-entry)
>             ...
>             pci_dev_reset_iommu_done(); // inner
>         pci_dev_reset_iommu_done(); // outer
> 
> However, pci_dev_reset_iommu_prepare() doesn't allow a re-entry for now.
> 
> Digging further, I found that most functions in pci_dev_specific_reset()
> except reset_ivb_igd() are calling pcie_flr() that has nested those two
> functions as well.
> 
> Drop the outer callbacks in:
>  - cxl_reset_bus_function()
>  - pci_dev_specific_reset()
> 
> Replace with adding the callbacks in:
>  + reset_ivb_igd()

This needs to say clearly what the issue is.  Deadlock?  Crash?
Corruption?

I guess it's a deadlock when the second call to
pci_dev_reset_iommu_prepare() acquires the group->mutex?

Since f5b16b802174 appeared in v7.0-rc1 (merged by Joerg), I guess
this fix also needs to be in v7.0 -- please confirm.

> Fixes: f5b16b802174 ("PCI: Suspend iommu function prior to resetting a device")
> Cc: stable@vger.kernel.org
> Reported-by: Shuai Xue <xueshuai@linux.alibaba.com>
> Closes: https://lore.kernel.org/all/absKsk7qQOwzhpzv@Asurada-Nvidia/
> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
> ---
>  drivers/pci/pci.c    |  7 -------
>  drivers/pci/quirks.c | 29 +++++++++++------------------
>  2 files changed, 11 insertions(+), 25 deletions(-)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 8479c2e1f74f..8f1fd083a84c 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -4958,12 +4958,6 @@ static int cxl_reset_bus_function(struct pci_dev *dev, bool probe)
>  	if (rc)
>  		return -ENOTTY;
>  
> -	rc = pci_dev_reset_iommu_prepare(dev);
> -	if (rc) {
> -		pci_err(dev, "failed to stop IOMMU for a PCI reset: %d\n", rc);
> -		return rc;
> -	}
> -
>  	if (reg & PCI_DVSEC_CXL_PORT_CTL_UNMASK_SBR) {
>  		val = reg;
>  	} else {
> @@ -4978,7 +4972,6 @@ static int cxl_reset_bus_function(struct pci_dev *dev, bool probe)
>  		pci_write_config_word(bridge, dvsec + PCI_DVSEC_CXL_PORT_CTL,
>  				      reg);
>  
> -	pci_dev_reset_iommu_done(dev);
>  	return rc;
>  }
>  
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 48946cca4be7..d1d210bc2a15 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -3973,6 +3973,7 @@ static int reset_ivb_igd(struct pci_dev *dev, bool probe)
>  	void __iomem *mmio_base;
>  	unsigned long timeout;
>  	u32 val;
> +	int ret;
>  
>  	if (probe)
>  		return 0;
> @@ -3981,6 +3982,12 @@ static int reset_ivb_igd(struct pci_dev *dev, bool probe)
>  	if (!mmio_base)
>  		return -ENOMEM;
>  
> +	ret = pci_dev_reset_iommu_prepare(dev);
> +	if (ret) {
> +		pci_err(dev, "failed to stop IOMMU for a PCI reset: %d\n", ret);
> +		goto out_iounmap;
> +	}
> +
>  	iowrite32(0x00000002, mmio_base + MSG_CTL);
>  
>  	/*
> @@ -4006,8 +4013,10 @@ static int reset_ivb_igd(struct pci_dev *dev, bool probe)
>  reset_complete:
>  	iowrite32(0x00000002, mmio_base + NSDE_PWR_STATE);
>  
> +	pci_dev_reset_iommu_done(dev);
> +out_iounmap:
>  	pci_iounmap(dev, mmio_base);
> -	return 0;
> +	return ret;
>  }
>  
>  /* Device-specific reset method for Chelsio T4-based adapters */
> @@ -4257,22 +4266,6 @@ static const struct pci_dev_reset_methods pci_dev_reset_methods[] = {
>  	{ 0 }
>  };
>  
> -static int __pci_dev_specific_reset(struct pci_dev *dev, bool probe,
> -				    const struct pci_dev_reset_methods *i)
> -{
> -	int ret;
> -
> -	ret = pci_dev_reset_iommu_prepare(dev);
> -	if (ret) {
> -		pci_err(dev, "failed to stop IOMMU for a PCI reset: %d\n", ret);
> -		return ret;
> -	}
> -
> -	ret = i->reset(dev, probe);
> -	pci_dev_reset_iommu_done(dev);
> -	return ret;
> -}
> -
>  /*
>   * These device-specific reset methods are here rather than in a driver
>   * because when a host assigns a device to a guest VM, the host may need
> @@ -4287,7 +4280,7 @@ int pci_dev_specific_reset(struct pci_dev *dev, bool probe)
>  		     i->vendor == (u16)PCI_ANY_ID) &&
>  		    (i->device == dev->device ||
>  		     i->device == (u16)PCI_ANY_ID))
> -			return __pci_dev_specific_reset(dev, probe, i);
> +			return i->reset(dev, probe);
>  	}
>  
>  	return -ENOTTY;
> -- 
> 2.34.1
>
Re: [PATCH rc] PCI: Fix nested pci_dev_reset_iommu_prepare/done()
Posted by Nicolin Chen 2 weeks, 4 days ago
Hi Bjorn,

On Wed, Mar 18, 2026 at 05:57:05PM -0500, Bjorn Helgaas wrote:
> On Wed, Mar 18, 2026 at 03:00:28PM -0700, Nicolin Chen wrote:
> > Shuai found that cxl_reset_bus_function() calls pci_reset_bus_function()
> > internally while both are calling pci_dev_reset_iommu_prepare/done().
> > 
> > This results in a nested pci_dev_reset_iommu_prepare/done() call:
> >     cxl_reset_bus_function():
> >         pci_dev_reset_iommu_prepare(); // outer
> >         pci_reset_bus_function():
> >             pci_dev_reset_iommu_prepare(); // inner (re-entry)
> >             ...
> >             pci_dev_reset_iommu_done(); // inner
> >         pci_dev_reset_iommu_done(); // outer
> > 
> > However, pci_dev_reset_iommu_prepare() doesn't allow a re-entry for now.
> > 
> > Digging further, I found that most functions in pci_dev_specific_reset()
> > except reset_ivb_igd() are calling pcie_flr() that has nested those two
> > functions as well.
> > 
> > Drop the outer callbacks in:
> >  - cxl_reset_bus_function()
> >  - pci_dev_specific_reset()
> > 
> > Replace with adding the callbacks in:
> >  + reset_ivb_igd()
> 
> This needs to say clearly what the issue is.  Deadlock?  Crash?
> Corruption?
> 
> I guess it's a deadlock when the second call to
> pci_dev_reset_iommu_prepare() acquires the group->mutex?

It will trigger a WARN_ON and fail the inner reset function:
	/* Re-entry is not allowed */
	if (WARN_ON(group->resetting_domain))
		return -EBUSY;

> Since f5b16b802174 appeared in v7.0-rc1 (merged by Joerg), I guess
> this fix also needs to be in v7.0 -- please confirm.

Yes. I marked it "PATCH rc", hoping to cover that.

I'll send a v2 explicitly mentioning the WARN_ON and -EBUSY failure.

Nicolin