drivers/iommu/iommu.c | 125 ++++++++++++++++++++++++++++++------------ 1 file changed, 91 insertions(+), 34 deletions(-)
Shuai found that cxl_reset_bus_function() calls pci_reset_bus_function()
internally while both are calling pci_dev_reset_iommu_prepare/done().
As pci_dev_reset_iommu_prepare() doesn't support re-entry, the inner call
will trigger a WARN_ON and return -EBUSY, resulting in failing the entire
device reset.
On the other hand, removing the outer calls in the PCI callers is unsafe.
As pointed out by Kevin, device-specific quirks like reset_hinic_vf_dev()
execute custom firmware waits after their inner pcie_flr() completes. If
the IOMMU protection relies solely on the inner reset, the IOMMU will be
unblocked prematurely while the device is still resetting.
Instead, fix this by making pci_dev_reset_iommu_prepare/done() reentrant.
Given the IOMMU core tracks the resetting state per iommu_group while the
reset is per device, this has to track at the group_device level as well.
Introduce a 'reset_depth' to struct group_device to handle the re-entries
on the same device. This allows multi-device groups to isolate concurrent
device resets independently.
Note that iommu_deferred_attach() and iommu_driver_get_domain_for_dev()
both now check gdev->reset_depth (per-device) instead of a per-group flag
like "group->resetting_domain". This is actually more precise.
As the reset routine is per gdev, it cannot clear group->resetting_domain
without iterating over the device list to ensure no other device is being
reset. Simplify it by replacing the resetting_domain with a 'recovery_cnt'
in the struct iommu_group.
Since both helpers are now per gdev, call the per-device set_dev_pasid op
to recover PASID domains.
While fixing the bug, also fix the kdoc for pci_dev_reset_iommu_done().
Fixes: c279e83953d9 ("iommu: Introduce pci_dev_reset_iommu_prepare/done()")
Cc: stable@vger.kernel.org
Reported-by: Shuai Xue <xueshuai@linux.alibaba.com>
Closes: https://lore.kernel.org/all/absKsk7qQOwzhpzv@Asurada-Nvidia/
Suggested-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
Changelog
v4:
* Rename 'reset_cnt' to 'recovery_cnt'
v3:
https://lore.kernel.org/all/20260321223930.10836-1-nicolinc@nvidia.com/
* Turn prepare()/done() to be per-gdev
* Use reset_depth to track nested re-entries
* Replace group->resetting_domain with a reset_cnt
v2:
https://lore.kernel.org/all/20260319043135.1153534-1-nicolinc@nvidia.com/
* Fix in the helpers by allowing re-entry
v1:
https://lore.kernel.org/all/20260318220028.1146905-1-nicolinc@nvidia.com/
drivers/iommu/iommu.c | 125 ++++++++++++++++++++++++++++++------------
1 file changed, 91 insertions(+), 34 deletions(-)
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 35db517809540..fbaf4fdd570a7 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -61,14 +61,14 @@ struct iommu_group {
int id;
struct iommu_domain *default_domain;
struct iommu_domain *blocking_domain;
- /*
- * During a group device reset, @resetting_domain points to the physical
- * domain, while @domain points to the attached domain before the reset.
- */
- struct iommu_domain *resetting_domain;
struct iommu_domain *domain;
struct list_head entry;
unsigned int owner_cnt;
+ /*
+ * Number of devices in the group undergoing or awaiting recovery.
+ * If non-zero, concurrent domain attachments are rejected.
+ */
+ unsigned int recovery_cnt;
void *owner;
};
@@ -76,12 +76,27 @@ struct group_device {
struct list_head list;
struct device *dev;
char *name;
+ unsigned int reset_depth;
};
/* Iterate over each struct group_device in a struct iommu_group */
#define for_each_group_device(group, pos) \
list_for_each_entry(pos, &(group)->devices, list)
+static struct group_device *__dev_to_gdev(struct device *dev)
+{
+ struct iommu_group *group = dev->iommu_group;
+ struct group_device *gdev;
+
+ lockdep_assert_held(&group->mutex);
+
+ for_each_group_device(group, gdev) {
+ if (gdev->dev == dev)
+ return gdev;
+ }
+ return NULL;
+}
+
struct iommu_group_attribute {
struct attribute attr;
ssize_t (*show)(struct iommu_group *group, char *buf);
@@ -2191,6 +2206,8 @@ EXPORT_SYMBOL_GPL(iommu_attach_device);
int iommu_deferred_attach(struct device *dev, struct iommu_domain *domain)
{
+ struct group_device *gdev;
+
/*
* This is called on the dma mapping fast path so avoid locking. This is
* racy, but we have an expectation that the driver will setup its DMAs
@@ -2201,6 +2218,9 @@ int iommu_deferred_attach(struct device *dev, struct iommu_domain *domain)
guard(mutex)(&dev->iommu_group->mutex);
+ gdev = __dev_to_gdev(dev);
+ if (WARN_ON(!gdev))
+ return -ENODEV;
/*
* This is a concurrent attach during a device reset. Reject it until
* pci_dev_reset_iommu_done() attaches the device to group->domain.
@@ -2208,7 +2228,7 @@ int iommu_deferred_attach(struct device *dev, struct iommu_domain *domain)
* Note that this might fail the iommu_dma_map(). But there's nothing
* more we can do here.
*/
- if (dev->iommu_group->resetting_domain)
+ if (gdev->reset_depth)
return -EBUSY;
return __iommu_attach_device(domain, dev, NULL);
}
@@ -2265,19 +2285,23 @@ EXPORT_SYMBOL_GPL(iommu_get_domain_for_dev);
struct iommu_domain *iommu_driver_get_domain_for_dev(struct device *dev)
{
struct iommu_group *group = dev->iommu_group;
+ struct group_device *gdev;
lockdep_assert_held(&group->mutex);
+ gdev = __dev_to_gdev(dev);
+ if (WARN_ON(!gdev))
+ return NULL;
/*
* Driver handles the low-level __iommu_attach_device(), including the
* one invoked by pci_dev_reset_iommu_done() re-attaching the device to
* the cached group->domain. In this case, the driver must get the old
- * domain from group->resetting_domain rather than group->domain. This
+ * domain from group->blocking_domain rather than group->domain. This
* prevents it from re-attaching the device from group->domain (old) to
* group->domain (new).
*/
- if (group->resetting_domain)
- return group->resetting_domain;
+ if (gdev->reset_depth)
+ return group->blocking_domain;
return group->domain;
}
@@ -2436,10 +2460,10 @@ static int __iommu_group_set_domain_internal(struct iommu_group *group,
return -EINVAL;
/*
- * This is a concurrent attach during a device reset. Reject it until
+ * This is a concurrent attach during device recovery. Reject it until
* pci_dev_reset_iommu_done() attaches the device to group->domain.
*/
- if (group->resetting_domain)
+ if (group->recovery_cnt)
return -EBUSY;
/*
@@ -3567,10 +3591,10 @@ int iommu_attach_device_pasid(struct iommu_domain *domain,
mutex_lock(&group->mutex);
/*
- * This is a concurrent attach during a device reset. Reject it until
+ * This is a concurrent attach during device recovery. Reject it until
* pci_dev_reset_iommu_done() attaches the device to group->domain.
*/
- if (group->resetting_domain) {
+ if (group->recovery_cnt) {
ret = -EBUSY;
goto out_unlock;
}
@@ -3660,10 +3684,10 @@ int iommu_replace_device_pasid(struct iommu_domain *domain,
mutex_lock(&group->mutex);
/*
- * This is a concurrent attach during a device reset. Reject it until
+ * This is a concurrent attach during device recovery. Reject it until
* pci_dev_reset_iommu_done() attaches the device to group->domain.
*/
- if (group->resetting_domain) {
+ if (group->recovery_cnt) {
ret = -EBUSY;
goto out_unlock;
}
@@ -3934,12 +3958,12 @@ EXPORT_SYMBOL_NS_GPL(iommu_replace_group_handle, "IOMMUFD_INTERNAL");
* routine wants to block any IOMMU activity: translation and ATS invalidation.
*
* This function attaches the device's RID/PASID(s) the group->blocking_domain,
- * setting the group->resetting_domain. This allows the IOMMU driver pausing any
+ * incrementing the group->recovery_cnt, to allow the IOMMU driver pausing any
* IOMMU activity while leaving the group->domain pointer intact. Later when the
* reset is finished, pci_dev_reset_iommu_done() can restore everything.
*
* Caller must use pci_dev_reset_iommu_prepare() with pci_dev_reset_iommu_done()
- * before/after the core-level reset routine, to unset the resetting_domain.
+ * before/after the core-level reset routine, to decrement the recovery_cnt.
*
* Return: 0 on success or negative error code if the preparation failed.
*
@@ -3952,6 +3976,7 @@ EXPORT_SYMBOL_NS_GPL(iommu_replace_group_handle, "IOMMUFD_INTERNAL");
int pci_dev_reset_iommu_prepare(struct pci_dev *pdev)
{
struct iommu_group *group = pdev->dev.iommu_group;
+ struct group_device *gdev;
unsigned long pasid;
void *entry;
int ret;
@@ -3961,20 +3986,23 @@ int pci_dev_reset_iommu_prepare(struct pci_dev *pdev)
guard(mutex)(&group->mutex);
- /* Re-entry is not allowed */
- if (WARN_ON(group->resetting_domain))
- return -EBUSY;
+ gdev = __dev_to_gdev(&pdev->dev);
+ if (WARN_ON(!gdev))
+ return -ENODEV;
+
+ if (gdev->reset_depth++)
+ return 0;
ret = __iommu_group_alloc_blocking_domain(group);
if (ret)
- return ret;
+ goto err_depth;
/* Stage RID domain at blocking_domain while retaining group->domain */
if (group->domain != group->blocking_domain) {
ret = __iommu_attach_device(group->blocking_domain, &pdev->dev,
group->domain);
if (ret)
- return ret;
+ goto err_depth;
}
/*
@@ -3987,7 +4015,11 @@ int pci_dev_reset_iommu_prepare(struct pci_dev *pdev)
iommu_remove_dev_pasid(&pdev->dev, pasid,
pasid_array_entry_to_domain(entry));
- group->resetting_domain = group->blocking_domain;
+ group->recovery_cnt++;
+ return ret;
+
+err_depth:
+ gdev->reset_depth--;
return ret;
}
EXPORT_SYMBOL_GPL(pci_dev_reset_iommu_prepare);
@@ -3997,9 +4029,9 @@ EXPORT_SYMBOL_GPL(pci_dev_reset_iommu_prepare);
* @pdev: PCI device that has finished a reset routine
*
* After a PCIe device finishes a reset routine, it wants to restore its IOMMU
- * IOMMU activity, including new translation as well as cache invalidation, by
- * re-attaching all RID/PASID of the device's back to the domains retained in
- * the core-level structure.
+ * activity, including new translation and cache invalidation, by re-attaching
+ * all RID/PASID of the device back to the domains retained in the core-level
+ * structure.
*
* Caller must pair it with a successful pci_dev_reset_iommu_prepare().
*
@@ -4009,6 +4041,7 @@ EXPORT_SYMBOL_GPL(pci_dev_reset_iommu_prepare);
void pci_dev_reset_iommu_done(struct pci_dev *pdev)
{
struct iommu_group *group = pdev->dev.iommu_group;
+ struct group_device *gdev;
unsigned long pasid;
void *entry;
@@ -4017,11 +4050,27 @@ void pci_dev_reset_iommu_done(struct pci_dev *pdev)
guard(mutex)(&group->mutex);
- /* pci_dev_reset_iommu_prepare() was bypassed for the device */
- if (!group->resetting_domain)
+ gdev = __dev_to_gdev(&pdev->dev);
+ if (WARN_ON(!gdev))
return;
- /* pci_dev_reset_iommu_prepare() was not successfully called */
+ /* Unbalanced done() calls would underflow the counter */
+ if (WARN_ON(gdev->reset_depth == 0))
+ return;
+ /*
+ * Do not decrement reset_depth=1 because other functions can still rely
+ * on it, e.g. iommu_driver_get_domain_for_dev().
+ */
+ if (gdev->reset_depth > 1) {
+ gdev->reset_depth--;
+ return;
+ }
+
+ /*
+ * Leave the counters intact to keep the core state aligned with the HW
+ * state. iommu_driver_get_domain_for_dev() still picks blocking_domain.
+ * This is an unrecoverable state - the device remains blocked.
+ */
if (WARN_ON(!group->blocking_domain))
return;
@@ -4037,12 +4086,20 @@ void pci_dev_reset_iommu_done(struct pci_dev *pdev)
* The pasid_array is mostly fenced by group->mutex, except one reader
* in iommu_attach_handle_get(), so it's safe to read without xa_lock.
*/
- xa_for_each_start(&group->pasid_array, pasid, entry, 1)
- WARN_ON(__iommu_set_group_pasid(
- pasid_array_entry_to_domain(entry), group, pasid,
- group->blocking_domain));
+ if (pdev->dev.iommu->max_pasids > 0) {
+ xa_for_each_start(&group->pasid_array, pasid, entry, 1) {
+ struct iommu_domain *pasid_dom =
+ pasid_array_entry_to_domain(entry);
+
+ WARN_ON(pasid_dom->ops->set_dev_pasid(
+ pasid_dom, &pdev->dev, pasid,
+ group->blocking_domain));
+ }
+ }
- group->resetting_domain = NULL;
+ if (!WARN_ON(group->recovery_cnt == 0))
+ group->recovery_cnt--;
+ gdev->reset_depth = 0;
}
EXPORT_SYMBOL_GPL(pci_dev_reset_iommu_done);
--
2.43.0
Hi, Nicolin,
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Tuesday, March 24, 2026 9:41 AM
>
> @@ -3961,20 +3986,23 @@ int pci_dev_reset_iommu_prepare(struct
> pci_dev *pdev)
>
> guard(mutex)(&group->mutex);
>
> - /* Re-entry is not allowed */
> - if (WARN_ON(group->resetting_domain))
> - return -EBUSY;
> + gdev = __dev_to_gdev(&pdev->dev);
> + if (WARN_ON(!gdev))
> + return -ENODEV;
> +
> + if (gdev->reset_depth++)
> + return 0;
>
> ret = __iommu_group_alloc_blocking_domain(group);
> if (ret)
> - return ret;
> + goto err_depth;
>
> /* Stage RID domain at blocking_domain while retaining group-
> >domain */
> if (group->domain != group->blocking_domain) {
> ret = __iommu_attach_device(group->blocking_domain,
> &pdev->dev,
> group->domain);
> if (ret)
> - return ret;
> + goto err_depth;
> }
>
> /*
> @@ -3987,7 +4015,11 @@ int pci_dev_reset_iommu_prepare(struct pci_dev
> *pdev)
> iommu_remove_dev_pasid(&pdev->dev, pasid,
> pasid_array_entry_to_domain(entry));
>
> - group->resetting_domain = group->blocking_domain;
> + group->recovery_cnt++;
> + return ret;
> +
> +err_depth:
> + gdev->reset_depth--;
> return ret;
> }
I tried to understand how this change interacts with the use of
iommu_driver_get_domain_for_dev() in arm smmu driver, but in the
end got confusing. Better having Jason take a look.
so far there are only two uses:
1) in arm_smmu_set_pasid():
- if cd is not enabled and sid_domain is not identity or blocked, disallow
set_pasid() probably due to the fact that enabling cd in ste could
break in-fly DMAs if they are remapped
- otherwise, enable cd in ste dynamically and set S1DSS according to
identity or blocked
2) in arm_smmu_blocking_set_dev_pasid():
- if the last user of cd table goes away and sid_domain is identity or
blocked, ste is downgraded to a non-cd format by re-attaching to
the sid_domain
Now a device is being reset. No parallel attach_device or set_dev_pasid
requests are allowed outside of iommu core itself. so the only invocation
exists in pci_dev_reset_iommu_prepare() and pci_dev_reset_iommu_done().
let's assume the device being reset has pasid usage, i.e. ste is in cd format.
In pci_dev_reset_iommu_prepare():
if (gdev->reset_depth++)
return 0;
// now iommu_driver_get_domain_for_dev() returns blocking domain
xa_for_each_start(&group->pasid_array, pasid, entry, 1)
iommu_remove_dev_pasid(&pdev->dev, pasid,
pasid_array_entry_to_domain(entry));
arm_smmu_blocking_set_dev_pasid() sees that sid_domain is blocking then
the last iommu_remove_dev_pasid() leads to cd-format ste downgraded to
a non-cd one.
In pci_dev_reset_iommu_done():
if (gdev->reset_depth > 1) {
gdev->reset_depth--;
return;
}
// restore valid pasid domains
gdev->reset_depth = 0;
Upon the first pasid, arm_smmu_set_pasid() sees that cd is not enabled and
sid_domain is blocking then upgrade the non-cd ste to a cd ste.
this is fine.
but then I wonder what is the real value of iommu_driver_get_domain_for_dev().
w/o it, above flow just skips cd downgrade/upgrade in case group->domain is
blocked or identity. No correctness issue.
But with it there might be corner cases to pay attention in case iommu
driver calls it in other places. e.g. Sashiko [1] pointed out one potential issue:
"
Incrementing gdev->reset_depth at the beginning of the function prematurely
exposes the blocking_domain to the driver. When __iommu_attach_device() is
called above, if the driver's attach_dev callback uses
iommu_driver_get_domain_for_dev() to query its current domain, it will observe
reset_depth == 1 and receive the blocking_domain even though it is not yet
attached to it.
Could reset_depth be set to 1 only after a successful attachment, while nested
calls still check for gdev->reset_depth > 0 at the beginning?
"
and the code comment in iommu_driver_get_domain_for_dev() says:
/*
* Driver handles the low-level __iommu_attach_device(), including the
* one invoked by pci_dev_reset_iommu_done() re-attaching the device to
* the cached group->domain. In this case, the driver must get the old
* domain from group->resetting_domain rather than group->domain. This
* prevents it from re-attaching the device from group->domain (old) to
* group->domain (new).
*/
It is not related to set_dev_pasid() at all. Likely I missed the real reason
of adding it?
[1] https://sashiko.dev/#/patchset/20260324014056.36103-1-nicolinc%40nvidia.com
On Thu, Apr 02, 2026 at 07:34:24AM +0000, Tian, Kevin wrote:
> But with it there might be corner cases to pay attention in case iommu
> driver calls it in other places. e.g. Sashiko [1] pointed out one potential issue:
>
> "
> Incrementing gdev->reset_depth at the beginning of the function prematurely
> exposes the blocking_domain to the driver. When __iommu_attach_device() is
> called above, if the driver's attach_dev callback uses
> iommu_driver_get_domain_for_dev() to query its current domain, it will observe
> reset_depth == 1 and receive the blocking_domain even though it is not yet
> attached to it.
> Could reset_depth be set to 1 only after a successful attachment, while nested
> calls still check for gdev->reset_depth > 0 at the beginning?
> "
Ah, that seems a good catch.
I noticed the same problem in done() but overlooked prepare().
Yea, it should do something likewise:
if (gdev->reset_depth) {
gdev->reset_depth++);
return 0;
}
...
attach(blocking_domain);
set_pasid(blocking_domain);
gdev->reset_depth = 1;
Nicolin
© 2016 - 2026 Red Hat, Inc.