[v5] Disable ATS via iommu during PCI resets

[PATCH v5 4/5] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done()

Posted by Nicolin Chen 2 months, 4 weeks ago

PCIe permits a device to ignore ATS invalidation TLPs, while processing a
reset. This creates a problem visible to the OS where an ATS invalidation
command will time out. E.g. an SVA domain will have no coordination with a
reset event and can racily issue ATS invalidations to a resetting device.

The OS should do something to mitigate this as we do not want production
systems to be reporting critical ATS failures, especially in a hypervisor
environment. Broadly, OS could arrange to ignore the timeouts, block page
table mutations to prevent invalidations, or disable and block ATS.

The PCIe spec in sec 10.3.1 IMPLEMENTATION NOTE recommends to disable and
block ATS before initiating a Function Level Reset. It also mentions that
other reset methods could have the same vulnerability as well.

Provide a callback from the PCI subsystem that will enclose the reset and
have the iommu core temporarily change all the attached domain to BLOCKED.
After attaching a BLOCKED domain, IOMMU hardware would fence any incoming
ATS queries. And IOMMU drivers should also synchronously stop issuing new
ATS invalidations and wait for all ATS invalidations to complete. This can
avoid any ATS invaliation timeouts.

However, if there is a domain attachment/replacement happening during an
ongoing reset, ATS routines may be re-activated between the two function
calls. So, introduce a new resetting_domain in the iommu_group structure
to reject any concurrent attach_dev/set_dev_pasid call during a reset for
a concern of compatibility failure. Since this changes the behavior of an
attach operation, update the uAPI accordingly.

Note that there are two corner cases:
 1. Devices in the same iommu_group
    Since an attachment is always per iommu_group, disallowing one device
    to switch domains (or HWPTs in iommufd) would have to disallow others
    in the same iommu_group to switch domains as well. So, play safe by
    preventing a shared iommu_group from going through the iommu reset.
 2. SRIOV devices that its PF is resetting while its VF isn't
    In such case, the VF itself is already broken. So, there is no point
    in preventing PF from going through the iommu reset.

Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 include/linux/iommu.h     |  12 +++
 include/uapi/linux/vfio.h |   3 +
 drivers/iommu/iommu.c     | 183 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 198 insertions(+)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index a42a2d1d7a0b7..25a2c2b00c9f7 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -1169,6 +1169,9 @@ void dev_iommu_priv_set(struct device *dev, void *priv);
 extern struct mutex iommu_probe_device_lock;
 int iommu_probe_device(struct device *dev);
 
+int iommu_dev_reset_prepare(struct device *dev);
+void iommu_dev_reset_done(struct device *dev);
+
 int iommu_device_use_default_domain(struct device *dev);
 void iommu_device_unuse_default_domain(struct device *dev);
 
@@ -1453,6 +1456,15 @@ static inline int iommu_fwspec_add_ids(struct device *dev, u32 *ids,
 	return -ENODEV;
 }
 
+static inline int iommu_dev_reset_prepare(struct device *dev)
+{
+	return 0;
+}
+
+static inline void iommu_dev_reset_done(struct device *dev)
+{
+}
+
 static inline struct iommu_fwspec *dev_iommu_fwspec_get(struct device *dev)
 {
 	return NULL;
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 75100bf009baf..6cc9d2709d13a 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -963,6 +963,9 @@ struct vfio_device_bind_iommufd {
  * hwpt corresponding to the given pt_id.
  *
  * Return: 0 on success, -errno on failure.
+ *
+ * When a device gets reset, any attach will be rejected with -EBUSY until that
+ * reset routine finishes.
  */
 struct vfio_device_attach_iommufd_pt {
 	__u32	argsz;
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 1f4d6ca0937bc..74b9f2bfc0458 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -61,6 +61,11 @@ struct iommu_group {
 	int id;
 	struct iommu_domain *default_domain;
 	struct iommu_domain *blocking_domain;
+	/*
+	 * During a group device reset, @resetting_domain points to the physical
+	 * domain, while @domain points to the attached domain before the reset.
+	 */
+	struct iommu_domain *resetting_domain;
 	struct iommu_domain *domain;
 	struct list_head entry;
 	unsigned int owner_cnt;
@@ -2195,6 +2200,12 @@ int iommu_deferred_attach(struct device *dev, struct iommu_domain *domain)
 
 	guard(mutex)(&dev->iommu_group->mutex);
 
+	/*
+	 * This is a concurrent attach while a group device is resetting. Reject
+	 * it until iommu_dev_reset_done() attaches the device to group->domain.
+	 */
+	if (dev->iommu_group->resetting_domain)
+		return -EBUSY;
 	return __iommu_attach_device(domain, dev, NULL);
 }
 
@@ -2253,6 +2264,16 @@ struct iommu_domain *iommu_driver_get_domain_for_dev(struct device *dev)
 
 	lockdep_assert_held(&group->mutex);
 
+	/*
+	 * Driver handles the low-level __iommu_attach_device(), including the
+	 * one invoked by iommu_dev_reset_done(), in which case the driver must
+	 * get the resetting_domain over group->domain caching the one prior to
+	 * iommu_dev_reset_prepare(), so that it wouldn't end up with attaching
+	 * the device from group->domain (old) to group->domain (new).
+	 */
+	if (group->resetting_domain)
+		return group->resetting_domain;
+
 	return group->domain;
 }
 EXPORT_SYMBOL_GPL(iommu_driver_get_domain_for_dev);
@@ -2409,6 +2430,13 @@ static int __iommu_group_set_domain_internal(struct iommu_group *group,
 	if (WARN_ON(!new_domain))
 		return -EINVAL;
 
+	/*
+	 * This is a concurrent attach while a group device is resetting. Reject
+	 * it until iommu_dev_reset_done() attaches the device to group->domain.
+	 */
+	if (group->resetting_domain)
+		return -EBUSY;
+
 	/*
 	 * Changing the domain is done by calling attach_dev() on the new
 	 * domain. This switch does not have to be atomic and DMA can be
@@ -3527,6 +3555,16 @@ int iommu_attach_device_pasid(struct iommu_domain *domain,
 		return -EINVAL;
 
 	mutex_lock(&group->mutex);
+
+	/*
+	 * This is a concurrent attach while a group device is resetting. Reject
+	 * it until iommu_dev_reset_done() attaches the device to group->domain.
+	 */
+	if (group->resetting_domain) {
+		ret = -EBUSY;
+		goto out_unlock;
+	}
+
 	for_each_group_device(group, device) {
 		/*
 		 * Skip PASID validation for devices without PASID support
@@ -3610,6 +3648,16 @@ int iommu_replace_device_pasid(struct iommu_domain *domain,
 		return -EINVAL;
 
 	mutex_lock(&group->mutex);
+
+	/*
+	 * This is a concurrent attach while a group device is resetting. Reject
+	 * it until iommu_dev_reset_done() attaches the device to group->domain.
+	 */
+	if (group->resetting_domain) {
+		ret = -EBUSY;
+		goto out_unlock;
+	}
+
 	entry = iommu_make_pasid_array_entry(domain, handle);
 	curr = xa_cmpxchg(&group->pasid_array, pasid, NULL,
 			  XA_ZERO_ENTRY, GFP_KERNEL);
@@ -3867,6 +3915,141 @@ int iommu_replace_group_handle(struct iommu_group *group,
 }
 EXPORT_SYMBOL_NS_GPL(iommu_replace_group_handle, "IOMMUFD_INTERNAL");
 
+/**
+ * iommu_dev_reset_prepare() - Block IOMMU to prepare for a device reset
+ * @dev: device that is going to enter a reset routine
+ *
+ * When certain device is entering a reset routine, it wants to block any IOMMU
+ * activity during the reset routine. This includes blocking any translation as
+ * well as cache invalidation (especially the device cache).
+ *
+ * This function attaches all RID/PASID of the device's to IOMMU_DOMAIN_BLOCKED
+ * allowing any blocked-domain-supporting IOMMU driver to pause translation and
+ * cahce invalidation, but leaves the software domain pointers intact so later
+ * the iommu_dev_reset_done() can restore everything.
+ *
+ * Return: 0 on success or negative error code if the preparation failed.
+ *
+ * Caller must use iommu_dev_reset_prepare() and iommu_dev_reset_done() together
+ * before/after the core-level reset routine, to unset the resetting_domain.
+ *
+ * These two functions are designed to be used by PCI reset functions that would
+ * not invoke any racy iommu_release_device(), since PCI sysfs node gets removed
+ * before it notifies with a BUS_NOTIFY_REMOVED_DEVICE. When using them in other
+ * case, callers must ensure there will be no racy iommu_release_device() call,
+ * which otherwise would UAF the dev->iommu_group pointer.
+ */
+int iommu_dev_reset_prepare(struct device *dev)
+{
+	struct iommu_group *group = dev->iommu_group;
+	unsigned long pasid;
+	void *entry;
+	int ret = 0;
+
+	if (!dev_has_iommu(dev))
+		return 0;
+
+	guard(mutex)(&group->mutex);
+
+	/*
+	 * Once the resetting_domain is set, any concurrent attachment to this
+	 * iommu_group will be rejected, which would break the attach routines
+	 * of the sibling devices in the same iommu_group. So, skip this case.
+	 */
+	if (dev_is_pci(dev)) {
+		struct group_device *gdev;
+
+		for_each_group_device(group, gdev) {
+			if (gdev->dev != dev)
+				return 0;
+		}
+	}
+
+	/* Re-entry is not allowed */
+	if (WARN_ON(group->resetting_domain))
+		return -EBUSY;
+
+	ret = __iommu_group_alloc_blocking_domain(group);
+	if (ret)
+		return ret;
+
+	/* Stage RID domain at blocking_domain while retaining group->domain */
+	if (group->domain != group->blocking_domain) {
+		ret = __iommu_attach_device(group->blocking_domain, dev,
+					    group->domain);
+		if (ret)
+			return ret;
+	}
+
+	/*
+	 * Stage PASID domains at blocking_domain while retaining pasid_array.
+	 *
+	 * The pasid_array is mostly fenced by group->mutex, except one reader
+	 * in iommu_attach_handle_get(), so it's safe to read without xa_lock.
+	 */
+	xa_for_each_start(&group->pasid_array, pasid, entry, 1)
+		iommu_remove_dev_pasid(dev, pasid,
+				       pasid_array_entry_to_domain(entry));
+
+	group->resetting_domain = group->blocking_domain;
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_dev_reset_prepare);
+
+/**
+ * iommu_dev_reset_done() - Restore IOMMU after a device reset is finished
+ * @dev: device that has finished a reset routine
+ *
+ * When certain device has finished a reset routine, it wants to restore its
+ * IOMMU activity, including new translation as well as cache invalidation, by
+ * re-attaching all RID/PASID of the device's back to the domains retained in
+ * the core-level structure.
+ *
+ * Caller must pair it with a successfully returned iommu_dev_reset_prepare().
+ *
+ * Note that, although unlikely, there is a risk that re-attaching domains might
+ * fail due to some unexpected happening like OOM.
+ */
+void iommu_dev_reset_done(struct device *dev)
+{
+	struct iommu_group *group = dev->iommu_group;
+	unsigned long pasid;
+	void *entry;
+
+	if (!dev_has_iommu(dev))
+		return;
+
+	guard(mutex)(&group->mutex);
+
+	/* iommu_dev_reset_prepare() was bypassed for the device */
+	if (!group->resetting_domain)
+		return;
+
+	/* iommu_dev_reset_prepare() was not successfully called */
+	if (WARN_ON(!group->blocking_domain))
+		return;
+
+	/* Re-attach RID domain back to group->domain */
+	if (group->domain != group->blocking_domain) {
+		WARN_ON(__iommu_attach_device(group->domain, dev,
+					      group->blocking_domain));
+	}
+
+	/*
+	 * Re-attach PASID domains back to the domains retained in pasid_array.
+	 *
+	 * The pasid_array is mostly fenced by group->mutex, except one reader
+	 * in iommu_attach_handle_get(), so it's safe to read without xa_lock.
+	 */
+	xa_for_each_start(&group->pasid_array, pasid, entry, 1)
+		WARN_ON(__iommu_set_group_pasid(
+			pasid_array_entry_to_domain(entry), group, pasid,
+			group->blocking_domain));
+
+	group->resetting_domain = NULL;
+}
+EXPORT_SYMBOL_GPL(iommu_dev_reset_done);
+
 #if IS_ENABLED(CONFIG_IRQ_MSI_IOMMU)
 /**
  * iommu_dma_prepare_msi() - Map the MSI page in the IOMMU domain
-- 
2.43.0

Re: [PATCH v5 4/5] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done()

Posted by Bjorn Helgaas 2 months, 3 weeks ago

On Mon, Nov 10, 2025 at 09:12:54PM -0800, Nicolin Chen wrote:
> PCIe permits a device to ignore ATS invalidation TLPs, while processing a
> reset. This creates a problem visible to the OS where an ATS invalidation
> command will time out. E.g. an SVA domain will have no coordination with a
> reset event and can racily issue ATS invalidations to a resetting device.

s/TLPs, while/TLPs while/

> The OS should do something to mitigate this as we do not want production
> systems to be reporting critical ATS failures, especially in a hypervisor
> environment. Broadly, OS could arrange to ignore the timeouts, block page
> table mutations to prevent invalidations, or disable and block ATS.
> 
> The PCIe spec in sec 10.3.1 IMPLEMENTATION NOTE recommends to disable and
> block ATS before initiating a Function Level Reset. It also mentions that
> other reset methods could have the same vulnerability as well.
> 
> Provide a callback from the PCI subsystem that will enclose the reset and
> have the iommu core temporarily change all the attached domain to BLOCKED.
> After attaching a BLOCKED domain, IOMMU hardware would fence any incoming
> ATS queries. And IOMMU drivers should also synchronously stop issuing new
> ATS invalidations and wait for all ATS invalidations to complete. This can
> avoid any ATS invaliation timeouts.
> 
> However, if there is a domain attachment/replacement happening during an
> ongoing reset, ATS routines may be re-activated between the two function
> calls. So, introduce a new resetting_domain in the iommu_group structure
> to reject any concurrent attach_dev/set_dev_pasid call during a reset for
> a concern of compatibility failure. Since this changes the behavior of an
> attach operation, update the uAPI accordingly.
> 
> Note that there are two corner cases:
>  1. Devices in the same iommu_group
>     Since an attachment is always per iommu_group, disallowing one device
>     to switch domains (or HWPTs in iommufd) would have to disallow others
>     in the same iommu_group to switch domains as well. So, play safe by
>     preventing a shared iommu_group from going through the iommu reset.
>  2. SRIOV devices that its PF is resetting while its VF isn't

Slightly awkward.  Maybe:

  2. An SR-IOV PF that is being reset while its VF is not

(Obviously resetting a PF destroys all the VFs, which I guess is what
you're hinting at below.)

>     In such case, the VF itself is already broken. So, there is no point
>     in preventing PF from going through the iommu reset.

> + * iommu_dev_reset_prepare() - Block IOMMU to prepare for a device reset
> + * @dev: device that is going to enter a reset routine
> + *
> + * When certain device is entering a reset routine, it wants to block any IOMMU
> + * activity during the reset routine. This includes blocking any translation as
> + * well as cache invalidation (especially the device cache).
> + *
> + * This function attaches all RID/PASID of the device's to IOMMU_DOMAIN_BLOCKED
> + * allowing any blocked-domain-supporting IOMMU driver to pause translation and
> + * cahce invalidation, but leaves the software domain pointers intact so later

s/cahce/cache/

RE: [PATCH v5 4/5] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done()

Posted by Tian, Kevin 2 months, 3 weeks ago

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Tuesday, November 11, 2025 1:13 PM
> 
> + *
> + * This function attaches all RID/PASID of the device's to
> IOMMU_DOMAIN_BLOCKED
> + * allowing any blocked-domain-supporting IOMMU driver to pause
> translation and

__iommu_group_alloc_blocking_domain() will allocate a paging
domain if a driver doesn't support blocked_domain itself. So in the
end this applies to all IOMMU drivers.

I saw several other places mention IOMMU_DOMAIN_BLOCKED in this
series. Not very accurate.

Re: [PATCH v5 4/5] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done()

Posted by Nicolin Chen 2 months, 3 weeks ago

On Mon, Nov 17, 2025 at 04:59:33AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Tuesday, November 11, 2025 1:13 PM
> > 
> > + *
> > + * This function attaches all RID/PASID of the device's to
> > IOMMU_DOMAIN_BLOCKED
> > + * allowing any blocked-domain-supporting IOMMU driver to pause
> > translation and
> 
> __iommu_group_alloc_blocking_domain() will allocate a paging
> domain if a driver doesn't support blocked_domain itself. So in the
> end this applies to all IOMMU drivers.
> 
> I saw several other places mention IOMMU_DOMAIN_BLOCKED in this
> series. Not very accurate.

OK. I will replace that with "group->blocked_domain"

Nicolin

RE: [PATCH v5 4/5] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done()

Posted by Tian, Kevin 2 months, 3 weeks ago

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Tuesday, November 11, 2025 1:13 PM
> 
> Note that there are two corner cases:
>  1. Devices in the same iommu_group
>     Since an attachment is always per iommu_group, disallowing one device
>     to switch domains (or HWPTs in iommufd) would have to disallow others
>     in the same iommu_group to switch domains as well. So, play safe by
>     preventing a shared iommu_group from going through the iommu reset.

It'd be good to make clear that 'preventing' means that the racing problem
is not addressed.

> +	/*
> +	 * During a group device reset, @resetting_domain points to the
> physical
> +	 * domain, while @domain points to the attached domain before the
> reset.
> +	 */
> +	struct iommu_domain *resetting_domain;

'a group device' is a bit confusing. Just remove 'group'?

> @@ -2195,6 +2200,12 @@ int iommu_deferred_attach(struct device *dev,
> struct iommu_domain *domain)
> 
>  	guard(mutex)(&dev->iommu_group->mutex);
> 
> +	/*
> +	 * This is a concurrent attach while a group device is resetting. Reject
> +	 * it until iommu_dev_reset_done() attaches the device to group-
> >domain.
> +	 */
> +	if (dev->iommu_group->resetting_domain)
> +		return -EBUSY;

It might be worth noting that failing a deferred attach leads to failing
the dma map operation. It's different from other explicit attaching paths,
but there is nothing more we can do here.


> @@ -2253,6 +2264,16 @@ struct iommu_domain
> *iommu_driver_get_domain_for_dev(struct device *dev)
> 
>  	lockdep_assert_held(&group->mutex);
> 
> +	/*
> +	 * Driver handles the low-level __iommu_attach_device(), including
> the
> +	 * one invoked by iommu_dev_reset_done(), in which case the driver
> must
> +	 * get the resetting_domain over group->domain caching the one
> prior to
> +	 * iommu_dev_reset_prepare(), so that it wouldn't end up with
> attaching
> +	 * the device from group->domain (old) to group->domain (new).
> +	 */
> +	if (group->resetting_domain)
> +		return group->resetting_domain;

It's a pretty long sentence. Let's break it.

> +int iommu_dev_reset_prepare(struct device *dev)

If this is intended to be used by pci for now, it's clearer to have a 'pci'
word in the name. Later when there is a demand calling it from other
buses, discussion will catch eyes to ensure no racy of UAF etc.

> +	/*
> +	 * Once the resetting_domain is set, any concurrent attachment to
> this
> +	 * iommu_group will be rejected, which would break the attach
> routines
> +	 * of the sibling devices in the same iommu_group. So, skip this case.
> +	 */
> +	if (dev_is_pci(dev)) {
> +		struct group_device *gdev;
> +
> +		for_each_group_device(group, gdev) {
> +			if (gdev->dev != dev)
> +				return 0;
> +		}
> +	}

btw what'd be a real impact to reject concurrent attachment for sibling
devices? This series already documents the impact in uAPI for the device
under attachment, and the userspace already knows the restriction 
of devices in the group which must be attached to a same hwpt.

Combining those knowledge I don't think there is a problem for 
userspace to be aware of that resetting a device in a multi-dev
group affects concurrent attachment of sibling devices...

> +	/* Re-attach RID domain back to group->domain */
> +	if (group->domain != group->blocking_domain) {
> +		WARN_ON(__iommu_attach_device(group->domain, dev,
> +					      group->blocking_domain));
> +	}

Even if we disallow resetting on a multi-dev group, there is still a
corner case not taken care here.

It's possible that there is only one device in the group at prepare,
coming with a device hotplug added to the group in the middle,
then doing reset_done.

In this case the newly-added device will inherit the blocking domain.

Then reset_done should loop all devices in the group and re-attach
all of them to the cached domain.

Re: [PATCH v5 4/5] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done()

Posted by Nicolin Chen 2 months, 3 weeks ago

On Fri, Nov 14, 2025 at 09:37:27AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > @@ -2195,6 +2200,12 @@ int iommu_deferred_attach(struct device *dev,
> > struct iommu_domain *domain)
> > 
> >  	guard(mutex)(&dev->iommu_group->mutex);
> > 
> > +	/*
> > +	 * This is a concurrent attach while a group device is resetting. Reject
> > +	 * it until iommu_dev_reset_done() attaches the device to group-
> > >domain.
> > +	 */
> > +	if (dev->iommu_group->resetting_domain)
> > +		return -EBUSY;
> 
> It might be worth noting that failing a deferred attach leads to failing
> the dma map operation. It's different from other explicit attaching paths,
> but there is nothing more we can do here.

OK.
	/*
	 * This is a concurrent attach while a group device is resetting. Reject
	 * it until iommu_dev_reset_done() attaches the device to group->domain.
	 *
	 * Worth noting that this may fail the dma map operation. But there is
	 * nothing more we can do here.
	 */


> > @@ -2253,6 +2264,16 @@ struct iommu_domain
> > *iommu_driver_get_domain_for_dev(struct device *dev)
> > 
> >  	lockdep_assert_held(&group->mutex);
> > 
> > +	/*
> > +	 * Driver handles the low-level __iommu_attach_device(), including
> > the
> > +	 * one invoked by iommu_dev_reset_done(), in which case the driver
> > must
> > +	 * get the resetting_domain over group->domain caching the one
> > prior to
> > +	 * iommu_dev_reset_prepare(), so that it wouldn't end up with
> > attaching
> > +	 * the device from group->domain (old) to group->domain (new).
> > +	 */
> > +	if (group->resetting_domain)
> > +		return group->resetting_domain;
> 
> It's a pretty long sentence. Let's break it.

OK.
	/*
	 * Driver handles the low-level __iommu_attach_device(), including the
	 * one invoked by iommu_dev_reset_done() that reattaches the device to
	 * the cached group->domain. In this case, the driver must get the old
	 * domain from group->resetting_domain rather than group->domain. This
	 * prevents it from reattaching the device from group->domain (old) to
	 * group->domain (new).
	 */

>> +int iommu_dev_reset_prepare(struct device *dev)
>
> If this is intended to be used by pci for now, it's clearer to have a 'pci'
> word in the name. Later when there is a demand calling it from other
> buses, discussion will catch eyes to ensure no racy of UAF etc.

Well, if we make it exclusive for PCI. Perhaps just move these two
from pci.c to iommu.c:

int pci_reset_iommu_prepare(struct pci_dev *dev);
void pci_reset_iommu_done(struct pci_dev *dev);

> > +	/*
> > +	 * Once the resetting_domain is set, any concurrent attachment to
> > this
> > +	 * iommu_group will be rejected, which would break the attach
> > routines
> > +	 * of the sibling devices in the same iommu_group. So, skip this case.
> > +	 */
> > +	if (dev_is_pci(dev)) {
> > +		struct group_device *gdev;
> > +
> > +		for_each_group_device(group, gdev) {
> > +			if (gdev->dev != dev)
> > +				return 0;
> > +		}
> > +	}
> 
> btw what'd be a real impact to reject concurrent attachment for sibling
> devices? This series already documents the impact in uAPI for the device
> under attachment, and the userspace already knows the restriction 
> of devices in the group which must be attached to a same hwpt.
> 
> Combining those knowledge I don't think there is a problem for 
> userspace to be aware of that resetting a device in a multi-dev
> group affects concurrent attachment of sibling devices...

It's following Jason's remarks:
https://lore.kernel.org/linux-iommu/20250915125357.GH1024672@nvidia.com/

Perhaps we should add that to the uAPI, given the race condition
that you mentioned below.

> > +	/* Re-attach RID domain back to group->domain */
> > +	if (group->domain != group->blocking_domain) {
> > +		WARN_ON(__iommu_attach_device(group->domain, dev,
> > +					      group->blocking_domain));
> > +	}
> 
> Even if we disallow resetting on a multi-dev group, there is still a
> corner case not taken care here.
> 
> It's possible that there is only one device in the group at prepare,
> coming with a device hotplug added to the group in the middle,
> then doing reset_done.
> 
> In this case the newly-added device will inherit the blocking domain.
> 
> Then reset_done should loop all devices in the group and re-attach
> all of them to the cached domain.

Oh, that's a good catch!

I will address all of your notes.

Thank you
Nicolin

Re: [PATCH v5 4/5] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done()

Posted by Baolu Lu 2 months, 3 weeks ago

On 11/11/25 13:12, Nicolin Chen wrote:
> +/**
> + * iommu_dev_reset_prepare() - Block IOMMU to prepare for a device reset
> + * @dev: device that is going to enter a reset routine
> + *
> + * When certain device is entering a reset routine, it wants to block any IOMMU
> + * activity during the reset routine. This includes blocking any translation as
> + * well as cache invalidation (especially the device cache).
> + *
> + * This function attaches all RID/PASID of the device's to IOMMU_DOMAIN_BLOCKED
> + * allowing any blocked-domain-supporting IOMMU driver to pause translation and
> + * cahce invalidation, but leaves the software domain pointers intact so later
> + * the iommu_dev_reset_done() can restore everything.
> + *
> + * Return: 0 on success or negative error code if the preparation failed.
> + *
> + * Caller must use iommu_dev_reset_prepare() and iommu_dev_reset_done() together
> + * before/after the core-level reset routine, to unset the resetting_domain.
> + *
> + * These two functions are designed to be used by PCI reset functions that would
> + * not invoke any racy iommu_release_device(), since PCI sysfs node gets removed
> + * before it notifies with a BUS_NOTIFY_REMOVED_DEVICE. When using them in other
> + * case, callers must ensure there will be no racy iommu_release_device() call,
> + * which otherwise would UAF the dev->iommu_group pointer.
> + */
> +int iommu_dev_reset_prepare(struct device *dev)
> +{
> +	struct iommu_group *group = dev->iommu_group;
> +	unsigned long pasid;
> +	void *entry;
> +	int ret = 0;
> +
> +	if (!dev_has_iommu(dev))
> +		return 0;

Nit: This interface is only for PCI layer, so why not just

	if (WARN_ON(!dev_is_pci(dev)))
		return -EINVAL;
?
> +
> +	guard(mutex)(&group->mutex);
> +
> +	/*
> +	 * Once the resetting_domain is set, any concurrent attachment to this
> +	 * iommu_group will be rejected, which would break the attach routines
> +	 * of the sibling devices in the same iommu_group. So, skip this case.
> +	 */
> +	if (dev_is_pci(dev)) {
> +		struct group_device *gdev;
> +
> +		for_each_group_device(group, gdev) {
> +			if (gdev->dev != dev)
> +				return 0;
> +		}
> +	}

With above dev_is_pci() check, here it can simply be,

	if (list_count_nodes(&group->devices) != 1)
		return 0;		

> +
> +	/* Re-entry is not allowed */
> +	if (WARN_ON(group->resetting_domain))
> +		return -EBUSY;
> +
> +	ret = __iommu_group_alloc_blocking_domain(group);
> +	if (ret)
> +		return ret;
> +
> +	/* Stage RID domain at blocking_domain while retaining group->domain */
> +	if (group->domain != group->blocking_domain) {
> +		ret = __iommu_attach_device(group->blocking_domain, dev,
> +					    group->domain);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	/*
> +	 * Stage PASID domains at blocking_domain while retaining pasid_array.
> +	 *
> +	 * The pasid_array is mostly fenced by group->mutex, except one reader
> +	 * in iommu_attach_handle_get(), so it's safe to read without xa_lock.
> +	 */
> +	xa_for_each_start(&group->pasid_array, pasid, entry, 1)
> +		iommu_remove_dev_pasid(dev, pasid,
> +				       pasid_array_entry_to_domain(entry));
> +
> +	group->resetting_domain = group->blocking_domain;
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(iommu_dev_reset_prepare);
> +
> +/**
> + * iommu_dev_reset_done() - Restore IOMMU after a device reset is finished
> + * @dev: device that has finished a reset routine
> + *
> + * When certain device has finished a reset routine, it wants to restore its
> + * IOMMU activity, including new translation as well as cache invalidation, by
> + * re-attaching all RID/PASID of the device's back to the domains retained in
> + * the core-level structure.
> + *
> + * Caller must pair it with a successfully returned iommu_dev_reset_prepare().
> + *
> + * Note that, although unlikely, there is a risk that re-attaching domains might
> + * fail due to some unexpected happening like OOM.
> + */
> +void iommu_dev_reset_done(struct device *dev)
> +{
> +	struct iommu_group *group = dev->iommu_group;
> +	unsigned long pasid;
> +	void *entry;
> +
> +	if (!dev_has_iommu(dev))
> +		return;
> +
> +	guard(mutex)(&group->mutex);
> +
> +	/* iommu_dev_reset_prepare() was bypassed for the device */
> +	if (!group->resetting_domain)
> +		return;
> +
> +	/* iommu_dev_reset_prepare() was not successfully called */
> +	if (WARN_ON(!group->blocking_domain))
> +		return;
> +
> +	/* Re-attach RID domain back to group->domain */
> +	if (group->domain != group->blocking_domain) {
> +		WARN_ON(__iommu_attach_device(group->domain, dev,
> +					      group->blocking_domain));
> +	}
> +
> +	/*
> +	 * Re-attach PASID domains back to the domains retained in pasid_array.
> +	 *
> +	 * The pasid_array is mostly fenced by group->mutex, except one reader
> +	 * in iommu_attach_handle_get(), so it's safe to read without xa_lock.
> +	 */
> +	xa_for_each_start(&group->pasid_array, pasid, entry, 1)
> +		WARN_ON(__iommu_set_group_pasid(
> +			pasid_array_entry_to_domain(entry), group, pasid,
> +			group->blocking_domain));
> +
> +	group->resetting_domain = NULL;
> +}
> +EXPORT_SYMBOL_GPL(iommu_dev_reset_done);
> +
>   #if IS_ENABLED(CONFIG_IRQ_MSI_IOMMU)
>   /**
>    * iommu_dma_prepare_msi() - Map the MSI page in the IOMMU domain

Thanks,
baolu

Re: [PATCH v5 4/5] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done()

Posted by Nicolin Chen 2 months, 3 weeks ago

On Wed, Nov 12, 2025 at 02:18:09PM +0800, Baolu Lu wrote:
> On 11/11/25 13:12, Nicolin Chen wrote:
> > +int iommu_dev_reset_prepare(struct device *dev)
> > +{
> > +	struct iommu_group *group = dev->iommu_group;
> > +	unsigned long pasid;
> > +	void *entry;
> > +	int ret = 0;
> > +
> > +	if (!dev_has_iommu(dev))
> > +		return 0;
> 
> Nit: This interface is only for PCI layer, so why not just
> 
> 	if (WARN_ON(!dev_is_pci(dev)))
> 		return -EINVAL;
> ?

The function naming was a bit generic, but we do have a specific
use case here. So, yea, let's add one.

> > +
> > +	guard(mutex)(&group->mutex);
> > +
> > +	/*
> > +	 * Once the resetting_domain is set, any concurrent attachment to this
> > +	 * iommu_group will be rejected, which would break the attach routines
> > +	 * of the sibling devices in the same iommu_group. So, skip this case.
> > +	 */
> > +	if (dev_is_pci(dev)) {
> > +		struct group_device *gdev;
> > +
> > +		for_each_group_device(group, gdev) {
> > +			if (gdev->dev != dev)
> > +				return 0;
> > +		}
> > +	}
> 
> With above dev_is_pci() check, here it can simply be,
> 
> 	if (list_count_nodes(&group->devices) != 1)
> 		return 0;		

Will replace that.

Thanks!
Nicolin

[PATCH v5 1/5] iommu: Lock group->mutex in iommu_deferred_attach()
[PATCH v5 2/5] iommu: Tiny domain for iommu_setup_dma_ops()
[PATCH v5 3/5] iommu: Add iommu_driver_get_domain_for_dev() helper
[PATCH v5 4/5] iommu: Introduce iommu_dev_reset_prepare() and iommu_dev_reset_done()
[PATCH v5 5/5] pci: Suspend iommu function prior to resetting a device