[PATCH] iommufd: Enforce IOMMU_RESV_SW_MSI upon hwpt_paging allocation

Nicolin Chen posted 1 patch 1 year, 6 months ago
drivers/iommu/iommufd/device.c          | 67 ++-----------------------
drivers/iommu/iommufd/hw_pagetable.c    | 45 +++++++++++++++++
drivers/iommu/iommufd/io_pagetable.c    | 13 +++--
drivers/iommu/iommufd/iommufd_private.h |  5 +-
4 files changed, 55 insertions(+), 75 deletions(-)
[PATCH] iommufd: Enforce IOMMU_RESV_SW_MSI upon hwpt_paging allocation
Posted by Nicolin Chen 1 year, 6 months ago
IOMMU_RESV_SW_MSI is a unique region defined by an IOMMU driver. Though it
is eventually used by a device for address translation to an MSI location
(including nested cases), practically it is a universal region across all
domains allocated for the IOMMU that defines it.

Currently IOMMUFD core fetches and reserves the region during an attach to
an hwpt_paging. It works with a hwpt_paging-only case, but might not work
with a nested case where a device could directly attach to a hwpt_nested,
bypassing the hwpt_paging attachment.

Move the enforcement forward, to the hwpt_paging allocation function. Then
clean up all the SW_MSI related things in the attach/replace routine.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommufd/device.c          | 67 ++-----------------------
 drivers/iommu/iommufd/hw_pagetable.c    | 45 +++++++++++++++++
 drivers/iommu/iommufd/io_pagetable.c    | 13 +++--
 drivers/iommu/iommufd/iommufd_private.h |  5 +-
 4 files changed, 55 insertions(+), 75 deletions(-)

diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index 9a7ec5997c61c..bc8baee32a9da 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -90,7 +90,6 @@ static struct iommufd_group *iommufd_get_group(struct iommufd_ctx *ictx,
 	kref_init(&new_igroup->ref);
 	mutex_init(&new_igroup->lock);
 	INIT_LIST_HEAD(&new_igroup->device_list);
-	new_igroup->sw_msi_start = PHYS_ADDR_MAX;
 	/* group reference moves into new_igroup */
 	new_igroup->group = group;
 
@@ -293,64 +292,6 @@ u32 iommufd_device_to_id(struct iommufd_device *idev)
 }
 EXPORT_SYMBOL_NS_GPL(iommufd_device_to_id, IOMMUFD);
 
-static int iommufd_group_setup_msi(struct iommufd_group *igroup,
-				   struct iommufd_hwpt_paging *hwpt_paging)
-{
-	phys_addr_t sw_msi_start = igroup->sw_msi_start;
-	int rc;
-
-	/*
-	 * If the IOMMU driver gives a IOMMU_RESV_SW_MSI then it is asking us to
-	 * call iommu_get_msi_cookie() on its behalf. This is necessary to setup
-	 * the MSI window so iommu_dma_prepare_msi() can install pages into our
-	 * domain after request_irq(). If it is not done interrupts will not
-	 * work on this domain.
-	 *
-	 * FIXME: This is conceptually broken for iommufd since we want to allow
-	 * userspace to change the domains, eg switch from an identity IOAS to a
-	 * DMA IOAS. There is currently no way to create a MSI window that
-	 * matches what the IRQ layer actually expects in a newly created
-	 * domain.
-	 */
-	if (sw_msi_start != PHYS_ADDR_MAX && !hwpt_paging->msi_cookie) {
-		rc = iommu_get_msi_cookie(hwpt_paging->common.domain,
-					  sw_msi_start);
-		if (rc)
-			return rc;
-
-		/*
-		 * iommu_get_msi_cookie() can only be called once per domain,
-		 * it returns -EBUSY on later calls.
-		 */
-		hwpt_paging->msi_cookie = true;
-	}
-	return 0;
-}
-
-static int iommufd_hwpt_paging_attach(struct iommufd_hwpt_paging *hwpt_paging,
-				      struct iommufd_device *idev)
-{
-	int rc;
-
-	lockdep_assert_held(&idev->igroup->lock);
-
-	rc = iopt_table_enforce_dev_resv_regions(&hwpt_paging->ioas->iopt,
-						 idev->dev,
-						 &idev->igroup->sw_msi_start);
-	if (rc)
-		return rc;
-
-	if (list_empty(&idev->igroup->device_list)) {
-		rc = iommufd_group_setup_msi(idev->igroup, hwpt_paging);
-		if (rc) {
-			iopt_remove_reserved_iova(&hwpt_paging->ioas->iopt,
-						  idev->dev);
-			return rc;
-		}
-	}
-	return 0;
-}
-
 int iommufd_hw_pagetable_attach(struct iommufd_hw_pagetable *hwpt,
 				struct iommufd_device *idev)
 {
@@ -364,7 +305,8 @@ int iommufd_hw_pagetable_attach(struct iommufd_hw_pagetable *hwpt,
 	}
 
 	if (hwpt_is_paging(hwpt)) {
-		rc = iommufd_hwpt_paging_attach(to_hwpt_paging(hwpt), idev);
+		rc = iopt_table_enforce_dev_resv_regions(
+				&to_hwpt_paging(hwpt)->ioas->iopt, idev->dev);
 		if (rc)
 			goto err_unlock;
 	}
@@ -453,15 +395,12 @@ iommufd_group_do_replace_paging(struct iommufd_group *igroup,
 	    hwpt_paging->ioas != to_hwpt_paging(old_hwpt)->ioas) {
 		list_for_each_entry(cur, &igroup->device_list, group_item) {
 			rc = iopt_table_enforce_dev_resv_regions(
-				&hwpt_paging->ioas->iopt, cur->dev, NULL);
+				&hwpt_paging->ioas->iopt, cur->dev);
 			if (rc)
 				goto err_unresv;
 		}
 	}
 
-	rc = iommufd_group_setup_msi(igroup, hwpt_paging);
-	if (rc)
-		goto err_unresv;
 	return 0;
 
 err_unresv:
diff --git a/drivers/iommu/iommufd/hw_pagetable.c b/drivers/iommu/iommufd/hw_pagetable.c
index aefde4443671e..dfb132e4dfbd2 100644
--- a/drivers/iommu/iommufd/hw_pagetable.c
+++ b/drivers/iommu/iommufd/hw_pagetable.c
@@ -82,6 +82,42 @@ iommufd_hwpt_paging_enforce_cc(struct iommufd_hwpt_paging *hwpt_paging)
 	return 0;
 }
 
+static int
+iommufd_hwpt_paging_enforce_sw_msi(struct iommufd_hwpt_paging *hwpt_paging,
+				   struct iommufd_device *idev)
+{
+	struct iommu_domain *domain = hwpt_paging->common.domain;
+	struct io_pagetable *iopt = &hwpt_paging->ioas->iopt;
+	phys_addr_t sw_msi_start = PHYS_ADDR_MAX;
+	struct iommu_resv_region *resv;
+	LIST_HEAD(resv_regions);
+	int rc = 0;
+
+	if (iommufd_should_fail())
+		return -EINVAL;
+
+	/* FIXME: drivers allocate memory but there is no failure propogated */
+	iommu_get_resv_regions(idev->dev, &resv_regions);
+	list_for_each_entry(resv, &resv_regions, list) {
+		if (resv->type != IOMMU_RESV_SW_MSI)
+			continue;
+		down_write(&iopt->iova_rwsem);
+		/* owner=domain so that abort/destroy() can clean it up */
+		rc = iopt_reserve_iova(iopt, resv->start,
+				       resv->length - 1 + resv->start, domain);
+		up_write(&iopt->iova_rwsem);
+		if (!rc)
+			sw_msi_start = resv->start;
+		break;
+	}
+	iommu_put_resv_regions(idev->dev, &resv_regions);
+
+	if (sw_msi_start == PHYS_ADDR_MAX)
+		return rc;
+
+	return iommu_get_msi_cookie(domain, sw_msi_start);
+}
+
 /**
  * iommufd_hwpt_paging_alloc() - Get a PAGING iommu_domain for a device
  * @ictx: iommufd context
@@ -173,6 +209,15 @@ iommufd_hwpt_paging_alloc(struct iommufd_ctx *ictx, struct iommufd_ioas *ioas,
 			goto out_abort;
 	}
 
+	/*
+	 * IOMMU_RESV_SW_MSI is a universal per-IOMMU IOVA region arbitrarily
+	 * defined by a driver. Any hw_pagetable that is allocated for such an
+	 * IOMMU must enforce the region in its reserved space.
+	 */
+	rc = iommufd_hwpt_paging_enforce_sw_msi(hwpt_paging, idev);
+	if (rc)
+		goto out_abort;
+
 	/*
 	 * immediate_attach exists only to accommodate iommu drivers that cannot
 	 * directly allocate a domain. These drivers do not finish creating the
diff --git a/drivers/iommu/iommufd/io_pagetable.c b/drivers/iommu/iommufd/io_pagetable.c
index 05fd9d3abf1b8..c9b7c7f6e046b 100644
--- a/drivers/iommu/iommufd/io_pagetable.c
+++ b/drivers/iommu/iommufd/io_pagetable.c
@@ -1368,8 +1368,7 @@ void iopt_remove_access(struct io_pagetable *iopt,
 
 /* Narrow the valid_iova_itree to include reserved ranges from a device. */
 int iopt_table_enforce_dev_resv_regions(struct io_pagetable *iopt,
-					struct device *dev,
-					phys_addr_t *sw_msi_start)
+					struct device *dev)
 {
 	struct iommu_resv_region *resv;
 	LIST_HEAD(resv_regions);
@@ -1387,14 +1386,14 @@ int iopt_table_enforce_dev_resv_regions(struct io_pagetable *iopt,
 	list_for_each_entry(resv, &resv_regions, list) {
 		if (resv->type == IOMMU_RESV_DIRECT_RELAXABLE)
 			continue;
-
-		if (sw_msi_start && resv->type == IOMMU_RESV_MSI)
-			num_hw_msi++;
-		if (sw_msi_start && resv->type == IOMMU_RESV_SW_MSI) {
-			*sw_msi_start = resv->start;
+		if (resv->type == IOMMU_RESV_SW_MSI) {
 			num_sw_msi++;
+			continue;
 		}
 
+		if (resv->type == IOMMU_RESV_MSI)
+			num_hw_msi++;
+
 		rc = iopt_reserve_iova(iopt, resv->start,
 				       resv->length - 1 + resv->start, dev);
 		if (rc)
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 92efe30a8f0d0..d61ea73776261 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -89,8 +89,7 @@ int iopt_table_add_domain(struct io_pagetable *iopt,
 void iopt_table_remove_domain(struct io_pagetable *iopt,
 			      struct iommu_domain *domain);
 int iopt_table_enforce_dev_resv_regions(struct io_pagetable *iopt,
-					struct device *dev,
-					phys_addr_t *sw_msi_start);
+					struct device *dev);
 int iopt_set_allow_iova(struct io_pagetable *iopt,
 			struct rb_root_cached *allowed_iova);
 int iopt_reserve_iova(struct io_pagetable *iopt, unsigned long start,
@@ -302,7 +301,6 @@ struct iommufd_hwpt_paging {
 	struct iommufd_ioas *ioas;
 	bool auto_domain : 1;
 	bool enforce_cache_coherency : 1;
-	bool msi_cookie : 1;
 	bool nest_parent : 1;
 	/* Head at iommufd_ioas::hwpt_list */
 	struct list_head hwpt_item;
@@ -382,7 +380,6 @@ struct iommufd_group {
 	struct iommu_group *group;
 	struct iommufd_hw_pagetable *hwpt;
 	struct list_head device_list;
-	phys_addr_t sw_msi_start;
 };
 
 /*
-- 
2.43.0
RE: [PATCH] iommufd: Enforce IOMMU_RESV_SW_MSI upon hwpt_paging allocation
Posted by Tian, Kevin 1 year, 6 months ago
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Monday, July 29, 2024 7:51 AM
> 
> IOMMU_RESV_SW_MSI is a unique region defined by an IOMMU driver.
> Though it
> is eventually used by a device for address translation to an MSI location
> (including nested cases), practically it is a universal region across all
> domains allocated for the IOMMU that defines it.
> 
> Currently IOMMUFD core fetches and reserves the region during an attach to
> an hwpt_paging. It works with a hwpt_paging-only case, but might not work
> with a nested case where a device could directly attach to a hwpt_nested,
> bypassing the hwpt_paging attachment.

This probably needs a bit more context. IIUC it's the ARM-side choice
that instead of letting VMM emulate a vITS for S1 and then map it to
physical ITS range in S2 it relies on the kernel to continue the msi
cookie reservation in S2 and then expects the guest to identity map
it in S1.

With that context if a device is directly attached to a hwpt_nested,
hwpt_paging attachment is bypassed including the msi doorbell
setup on the parent S2 then it's broken.

> @@ -364,7 +305,8 @@ int iommufd_hw_pagetable_attach(struct
> iommufd_hw_pagetable *hwpt,
>  	}
> 
>  	if (hwpt_is_paging(hwpt)) {
> -		rc = iommufd_hwpt_paging_attach(to_hwpt_paging(hwpt),
> idev);
> +		rc = iopt_table_enforce_dev_resv_regions(
> +				&to_hwpt_paging(hwpt)->ioas->iopt, idev-
> >dev);

Is it simpler to extend the original operation to the parent S2 when
it's hwpt_nested?

The name iommufd_hwpt_paging_attach() is a bit misleading. The
actual work there is all about reservations. It doesn't change any
tracking structure about attachment between device and hwpt.

The only downside is unnecessarily reserved regions of dev1
(attached to hwpt_nested) added to S2 which might be directly
attached only by dev2 so the available ranges for dev2 are
unnecessarily shrunk.

but I'm not sure that would be a real problem in practice, given
1) there is no usage using up closely the entire IOVA space yet,
2) guest may change the viommu mode to switch between nested
   and paging then VMM has to take all devices' reserved regions
   into consideration anyway, when composing the GPA space.

With that I think continuing this per-device reservation scheme is
easier than adding specific reservation for SW_MSI at hwpt creation
time and then further requiring check at attach time to verify 
the attached device is allocated with the same address as the one
during allocation.
Re: [PATCH] iommufd: Enforce IOMMU_RESV_SW_MSI upon hwpt_paging allocation
Posted by Nicolin Chen 1 year, 6 months ago
On Wed, Jul 31, 2024 at 07:45:46AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Monday, July 29, 2024 7:51 AM
> >
> > IOMMU_RESV_SW_MSI is a unique region defined by an IOMMU driver.
> > Though it
> > is eventually used by a device for address translation to an MSI location
> > (including nested cases), practically it is a universal region across all
> > domains allocated for the IOMMU that defines it.
> >
> > Currently IOMMUFD core fetches and reserves the region during an attach to
> > an hwpt_paging. It works with a hwpt_paging-only case, but might not work
> > with a nested case where a device could directly attach to a hwpt_nested,
> > bypassing the hwpt_paging attachment.
> 
> This probably needs a bit more context. IIUC it's the ARM-side choice
> that instead of letting VMM emulate a vITS for S1 and then map it to
> physical ITS range in S2 it relies on the kernel to continue the msi
> cookie reservation in S2 and then expects the guest to identity map
> it in S1.
> 
> With that context if a device is directly attached to a hwpt_nested,
> hwpt_paging attachment is bypassed including the msi doorbell
> setup on the parent S2 then it's broken.

Yes. That's exactly the issue. My bad that made it simplified.

> > @@ -364,7 +305,8 @@ int iommufd_hw_pagetable_attach(struct
> > iommufd_hw_pagetable *hwpt,
> >       }
> >
> >       if (hwpt_is_paging(hwpt)) {
> > -             rc = iommufd_hwpt_paging_attach(to_hwpt_paging(hwpt),
> > idev);
> > +             rc = iopt_table_enforce_dev_resv_regions(
> > +                             &to_hwpt_paging(hwpt)->ioas->iopt, idev-
> > >dev);
> 
> Is it simpler to extend the original operation to the parent S2 when
> it's hwpt_nested?

Likely. I recall that was what one of our WIP versions did.

> The name iommufd_hwpt_paging_attach() is a bit misleading. The
> actual work there is all about reservations. It doesn't change any
> tracking structure about attachment between device and hwpt.

How about iommufd_hwpt_enforce/remove_rr() taking hwpt v.s.
hwpt_paging.

> The only downside is unnecessarily reserved regions of dev1
> (attached to hwpt_nested) added to S2 which might be directly
> attached only by dev2 so the available ranges for dev2 are
> unnecessarily shrunk.
> 
> but I'm not sure that would be a real problem in practice, given
> 1) there is no usage using up closely the entire IOVA space yet,
> 2) guest may change the viommu mode to switch between nested
>    and paging then VMM has to take all devices' reserved regions
>    into consideration anyway, when composing the GPA space.

That sounds reasonable to me.

> With that I think continuing this per-device reservation scheme is
> easier than adding specific reservation for SW_MSI at hwpt creation
> time and then further requiring check at attach time to verify
> the attached device is allocated with the same address as the one
> during allocation.

Jason, do you agree?

Thanks
Nic
Re: [PATCH] iommufd: Enforce IOMMU_RESV_SW_MSI upon hwpt_paging allocation
Posted by Nicolin Chen 1 year, 6 months ago
On Wed, Jul 31, 2024 at 11:13:11AM -0700, Nicolin Chen wrote:
> On Wed, Jul 31, 2024 at 07:45:46AM +0000, Tian, Kevin wrote:
> > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > Sent: Monday, July 29, 2024 7:51 AM
> > > @@ -364,7 +305,8 @@ int iommufd_hw_pagetable_attach(struct
> > > iommufd_hw_pagetable *hwpt,
> > >       }
> > >
> > >       if (hwpt_is_paging(hwpt)) {
> > > -             rc = iommufd_hwpt_paging_attach(to_hwpt_paging(hwpt),
> > > idev);
> > > +             rc = iopt_table_enforce_dev_resv_regions(
> > > +                             &to_hwpt_paging(hwpt)->ioas->iopt, idev-
> > > >dev);
> > 
> > Is it simpler to extend the original operation to the parent S2 when
> > it's hwpt_nested?
> 
> Likely. I recall that was what one of our WIP versions did.
> 
> > The name iommufd_hwpt_paging_attach() is a bit misleading. The
> > actual work there is all about reservations. It doesn't change any
> > tracking structure about attachment between device and hwpt.
> 
> How about iommufd_hwpt_enforce/remove_rr() taking hwpt v.s.
> hwpt_paging.
 
> > With that I think continuing this per-device reservation scheme is
> > easier than adding specific reservation for SW_MSI at hwpt creation
> > time and then further requiring check at attach time to verify
> > the attached device is allocated with the same address as the one
> > during allocation.
> 
> Jason, do you agree?

I came up with something plus a bit of naming alignment:
    iommufd_device_attach_reserved_iova()
    iommufd_group_remove_reserved_iova()
    iommufd_group_do_replace_reserved_iova()

If it looks good to both of you, I will send a formal patch.

Thanks
Nic

diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c
index 76b7297d22b0f..f03218dc4861e 100644
--- a/drivers/iommu/iommufd/device.c
+++ b/drivers/iommu/iommufd/device.c
@@ -338,8 +338,9 @@ static int iommufd_group_setup_msi(struct iommufd_group *igroup,
 	return 0;
 }
 
-static int iommufd_hwpt_paging_attach(struct iommufd_hwpt_paging *hwpt_paging,
-				      struct iommufd_device *idev)
+static int
+iommufd_device_attach_reserved_iova(struct iommufd_device *idev,
+				    struct iommufd_hwpt_paging *hwpt_paging)
 {
 	int rc;
 
@@ -374,11 +375,9 @@ int iommufd_hw_pagetable_attach(struct iommufd_hw_pagetable *hwpt,
 		goto err_unlock;
 	}
 
-	if (hwpt_is_paging(hwpt)) {
-		rc = iommufd_hwpt_paging_attach(to_hwpt_paging(hwpt), idev);
-		if (rc)
-			goto err_unlock;
-	}
+	rc = iommufd_device_attach_reserved_iova(idev, to_hwpt_paging(hwpt));
+	if (rc)
+		goto err_unlock;
 
 	/*
 	 * Only attach to the group once for the first device that is in the
@@ -398,9 +397,7 @@ int iommufd_hw_pagetable_attach(struct iommufd_hw_pagetable *hwpt,
 	mutex_unlock(&idev->igroup->lock);
 	return 0;
 err_unresv:
-	if (hwpt_is_paging(hwpt))
-		iopt_remove_reserved_iova(&to_hwpt_paging(hwpt)->ioas->iopt,
-					  idev->dev);
+	iopt_remove_reserved_iova(&to_hwpt_paging(hwpt)->ioas->iopt, idev->dev);
 err_unlock:
 	mutex_unlock(&idev->igroup->lock);
 	return rc;
@@ -417,9 +414,7 @@ iommufd_hw_pagetable_detach(struct iommufd_device *idev)
 		iommufd_hwpt_detach_device(hwpt, idev);
 		idev->igroup->hwpt = NULL;
 	}
-	if (hwpt_is_paging(hwpt))
-		iopt_remove_reserved_iova(&to_hwpt_paging(hwpt)->ioas->iopt,
-					  idev->dev);
+	iopt_remove_reserved_iova(&to_hwpt_paging(hwpt)->ioas->iopt, idev->dev);
 	mutex_unlock(&idev->igroup->lock);
 
 	/* Caller must destroy hwpt */
@@ -451,8 +446,8 @@ iommufd_group_remove_reserved_iova(struct iommufd_group *igroup,
 }
 
 static int
-iommufd_group_do_replace_paging(struct iommufd_group *igroup,
-				struct iommufd_hwpt_paging *hwpt_paging)
+iommufd_group_do_replace_reserved_iova(struct iommufd_group *igroup,
+				       struct iommufd_hwpt_paging *hwpt_paging)
 {
 	struct iommufd_hw_pagetable *old_hwpt = igroup->hwpt;
 	struct iommufd_device *cur;
@@ -460,8 +455,7 @@ iommufd_group_do_replace_paging(struct iommufd_group *igroup,
 
 	lockdep_assert_held(&igroup->lock);
 
-	if (!hwpt_is_paging(old_hwpt) ||
-	    hwpt_paging->ioas != to_hwpt_paging(old_hwpt)->ioas) {
+	if (hwpt_paging->ioas != to_hwpt_paging(old_hwpt)->ioas) {
 		list_for_each_entry(cur, &igroup->device_list, group_item) {
 			rc = iopt_table_enforce_dev_resv_regions(
 				&hwpt_paging->ioas->iopt, cur->dev, NULL);
@@ -502,20 +496,15 @@ iommufd_device_do_replace(struct iommufd_device *idev,
 	}
 
 	old_hwpt = igroup->hwpt;
-	if (hwpt_is_paging(hwpt)) {
-		rc = iommufd_group_do_replace_paging(igroup,
-						     to_hwpt_paging(hwpt));
-		if (rc)
-			goto err_unlock;
-	}
+	rc = iommufd_group_do_replace_reserved_iova(igroup, to_hwpt_paging(hwpt));
+	if (rc)
+		goto err_unlock;
 
 	rc = iommufd_hwpt_replace_device(idev, hwpt, old_hwpt);
 	if (rc)
 		goto err_unresv;
 
-	if (hwpt_is_paging(old_hwpt) &&
-	    (!hwpt_is_paging(hwpt) ||
-	     to_hwpt_paging(hwpt)->ioas != to_hwpt_paging(old_hwpt)->ioas))
+	if (to_hwpt_paging(hwpt)->ioas != to_hwpt_paging(old_hwpt)->ioas)
 		iommufd_group_remove_reserved_iova(igroup,
 						   to_hwpt_paging(old_hwpt));
 
@@ -535,9 +524,7 @@ iommufd_device_do_replace(struct iommufd_device *idev,
 	/* Caller must destroy old_hwpt */
 	return old_hwpt;
 err_unresv:
-	if (hwpt_is_paging(hwpt))
-		iommufd_group_remove_reserved_iova(igroup,
-						   to_hwpt_paging(hwpt));
+	iommufd_group_remove_reserved_iova(igroup, to_hwpt_paging(hwpt));
 err_unlock:
 	mutex_unlock(&idev->igroup->lock);
 	return ERR_PTR(rc);
diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h
index 04109572a53ca..618524e1ce9c5 100644
--- a/drivers/iommu/iommufd/iommufd_private.h
+++ b/drivers/iommu/iommufd/iommufd_private.h
@@ -316,7 +316,9 @@ static inline bool hwpt_is_paging(struct iommufd_hw_pagetable *hwpt)
 static inline struct iommufd_hwpt_paging *
 to_hwpt_paging(struct iommufd_hw_pagetable *hwpt)
 {
-	return container_of(hwpt, struct iommufd_hwpt_paging, common);
+	if (hwpt_is_paging(hwpt))
+		return container_of(hwpt, struct iommufd_hwpt_paging, common);
+	return container_of(hwpt, struct iommufd_hwpt_nested, common)->parent;
 }
 
 static inline struct iommufd_hwpt_paging *
Re: [PATCH] iommufd: Enforce IOMMU_RESV_SW_MSI upon hwpt_paging allocation
Posted by Jason Gunthorpe 1 year, 6 months ago
On Wed, Jul 31, 2024 at 02:21:40PM -0700, Nicolin Chen wrote:
> On Wed, Jul 31, 2024 at 11:13:11AM -0700, Nicolin Chen wrote:
> > On Wed, Jul 31, 2024 at 07:45:46AM +0000, Tian, Kevin wrote:
> > > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > > Sent: Monday, July 29, 2024 7:51 AM
> > > > @@ -364,7 +305,8 @@ int iommufd_hw_pagetable_attach(struct
> > > > iommufd_hw_pagetable *hwpt,
> > > >       }
> > > >
> > > >       if (hwpt_is_paging(hwpt)) {
> > > > -             rc = iommufd_hwpt_paging_attach(to_hwpt_paging(hwpt),
> > > > idev);
> > > > +             rc = iopt_table_enforce_dev_resv_regions(
> > > > +                             &to_hwpt_paging(hwpt)->ioas->iopt, idev-
> > > > >dev);
> > > 
> > > Is it simpler to extend the original operation to the parent S2 when
> > > it's hwpt_nested?
> > 
> > Likely. I recall that was what one of our WIP versions did.
> > 
> > > The name iommufd_hwpt_paging_attach() is a bit misleading. The
> > > actual work there is all about reservations. It doesn't change any
> > > tracking structure about attachment between device and hwpt.
> > 
> > How about iommufd_hwpt_enforce/remove_rr() taking hwpt v.s.
> > hwpt_paging.
>  
> > > With that I think continuing this per-device reservation scheme is
> > > easier than adding specific reservation for SW_MSI at hwpt creation
> > > time and then further requiring check at attach time to verify
> > > the attached device is allocated with the same address as the one
> > > during allocation.
> > 
> > Jason, do you agree?
> 
> I came up with something plus a bit of naming alignment:
>     iommufd_device_attach_reserved_iova()
>     iommufd_group_remove_reserved_iova()
>     iommufd_group_do_replace_reserved_iova()
> 
> If it looks good to both of you, I will send a formal patch.

This seems like a more consistent direction, let's try to make

Jason
RE: [PATCH] iommufd: Enforce IOMMU_RESV_SW_MSI upon hwpt_paging allocation
Posted by Tian, Kevin 1 year, 6 months ago
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Thursday, August 1, 2024 5:22 AM
> 
> @@ -316,7 +316,9 @@ static inline bool hwpt_is_paging(struct
> iommufd_hw_pagetable *hwpt)
>  static inline struct iommufd_hwpt_paging *
>  to_hwpt_paging(struct iommufd_hw_pagetable *hwpt)
>  {
> -	return container_of(hwpt, struct iommufd_hwpt_paging, common);
> +	if (hwpt_is_paging(hwpt))
> +		return container_of(hwpt, struct iommufd_hwpt_paging,
> common);
> +	return container_of(hwpt, struct iommufd_hwpt_nested, common)-
> >parent;
>  }
> 

hmm this doesn't work with future new hwpt types.

It's clearer to pass hwpt into earlier helpers and then do the type check
insided and make it a nop for types other than paging/nesting.
Re: [PATCH] iommufd: Enforce IOMMU_RESV_SW_MSI upon hwpt_paging allocation
Posted by Nicolin Chen 1 year, 6 months ago
On Thu, Aug 01, 2024 at 08:10:57AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Thursday, August 1, 2024 5:22 AM
> >
> > @@ -316,7 +316,9 @@ static inline bool hwpt_is_paging(struct
> > iommufd_hw_pagetable *hwpt)
> >  static inline struct iommufd_hwpt_paging *
> >  to_hwpt_paging(struct iommufd_hw_pagetable *hwpt)
> >  {
> > -     return container_of(hwpt, struct iommufd_hwpt_paging, common);
> > +     if (hwpt_is_paging(hwpt))
> > +             return container_of(hwpt, struct iommufd_hwpt_paging,
> > common);
> > +     return container_of(hwpt, struct iommufd_hwpt_nested, common)-
> > >parent;
> >  }
> >
> 
> hmm this doesn't work with future new hwpt types.

Was trying to make it a smaller change, but short-sighted.

> It's clearer to pass hwpt into earlier helpers and then do the type check
> insided and make it a nop for types other than paging/nesting.

Ack.

Thanks
Nicolin
Re: [PATCH] iommufd: Enforce IOMMU_RESV_SW_MSI upon hwpt_paging allocation
Posted by Jason Gunthorpe 1 year, 6 months ago
On Sun, Jul 28, 2024 at 04:51:06PM -0700, Nicolin Chen wrote:
> IOMMU_RESV_SW_MSI is a unique region defined by an IOMMU driver. Though it
> is eventually used by a device for address translation to an MSI location
> (including nested cases), practically it is a universal region across all
> domains allocated for the IOMMU that defines it.
> 
> Currently IOMMUFD core fetches and reserves the region during an attach to
> an hwpt_paging. It works with a hwpt_paging-only case, but might not work
> with a nested case where a device could directly attach to a hwpt_nested,
> bypassing the hwpt_paging attachment.

Well, it does this because the attach is the only place where we have
*all* the devices available.

Doing it doing allocation means you get only one device.

So, I'd imagine more like we allocate the MSI region during allocation
for the device specific during allocation

But continue to enforce that every attached device also has its MSI
region allocated.. Which probably just means checking that the
driver's reported MSI address is the same as the address during
allocation?

> @@ -364,7 +305,8 @@ int iommufd_hw_pagetable_attach(struct iommufd_hw_pagetable *hwpt,
>  	}
>  
>  	if (hwpt_is_paging(hwpt)) {
> -		rc = iommufd_hwpt_paging_attach(to_hwpt_paging(hwpt), idev);
> +		rc = iopt_table_enforce_dev_resv_regions(
> +				&to_hwpt_paging(hwpt)->ioas->iopt, idev->dev);
>  		if (rc)
>  			goto err_unlock;

And this seems kind of weird, shouldn't any change the ioas regions
happen when the domain is joined to the ioas with the MSI mapping, not
during attach?? Like we don't want any change of the IOAS to blow away
the MSI region.

This should probably get selftest coverage as well, it seems easy
enough to add to the mock iommu driver?

Jason
Re: [PATCH] iommufd: Enforce IOMMU_RESV_SW_MSI upon hwpt_paging allocation
Posted by Nicolin Chen 1 year, 6 months ago
On Mon, Jul 29, 2024 at 03:24:46PM -0300, Jason Gunthorpe wrote:
> On Sun, Jul 28, 2024 at 04:51:06PM -0700, Nicolin Chen wrote:
> > IOMMU_RESV_SW_MSI is a unique region defined by an IOMMU driver. Though it
> > is eventually used by a device for address translation to an MSI location
> > (including nested cases), practically it is a universal region across all
> > domains allocated for the IOMMU that defines it.
> > 
> > Currently IOMMUFD core fetches and reserves the region during an attach to
> > an hwpt_paging. It works with a hwpt_paging-only case, but might not work
> > with a nested case where a device could directly attach to a hwpt_nested,
> > bypassing the hwpt_paging attachment.
> 
> Well, it does this because the attach is the only place where we have
> *all* the devices available.
> 
> Doing it doing allocation means you get only one device.
>
> So, I'd imagine more like we allocate the MSI region during allocation
> for the device specific during allocation
> 
> But continue to enforce that every attached device also has its MSI
> region allocated.. Which probably just means checking that the
> driver's reported MSI address is the same as the address during
> allocation?

The idea here is to treat IOMMU_RESV_SW_MSI as a per-IOMMU thing
v.s. a per-device thing, because it's defined by SW for an IOMMU
that doesn't have a HW MSI window. In another word, devices don't
really matter, so long as we know which IOMMU that can be decided
by any "one device".

This is based on an enforcement that there won't be a case mixing
IOMMU_RESV_SW_MSI and IOMMU_RESV_MSI on a single IOMMU instance.

> > @@ -364,7 +305,8 @@ int iommufd_hw_pagetable_attach(struct iommufd_hw_pagetable *hwpt,
> >  	}
> >  
> >  	if (hwpt_is_paging(hwpt)) {
> > -		rc = iommufd_hwpt_paging_attach(to_hwpt_paging(hwpt), idev);
> > +		rc = iopt_table_enforce_dev_resv_regions(
> > +				&to_hwpt_paging(hwpt)->ioas->iopt, idev->dev);
> >  		if (rc)
> >  			goto err_unlock;
> 
> And this seems kind of weird, shouldn't any change the ioas regions
> happen when the domain is joined to the ioas with the MSI mapping, not
> during attach?? Like we don't want any change of the IOAS to blow away
> the MSI region.

Hmm, not quite getting this part..

This doesn't change the current attach flow except moving that
IOMMU_RESV_SW_MSI from iopt_table_enforce_dev_resv_regions() to
the allocation routine, since other types of reserved regions
are still per-device. The unwra of iommufd_hwpt_paging_attach()
is a cleanup since we moved iommu_get_msi_cookie() away too.

> This should probably get selftest coverage as well, it seems easy
> enough to add to the mock iommu driver?

OK. I will try adding one. I assume we only need a reserved MSI
region in mock device and another testing ioctl verifying if it
is reserved in ioas/iopt upon an allocation?

Thanks
Nicolin
RE: [PATCH] iommufd: Enforce IOMMU_RESV_SW_MSI upon hwpt_paging allocation
Posted by Tian, Kevin 1 year, 6 months ago
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Tuesday, July 30, 2024 4:06 AM
> 
> On Mon, Jul 29, 2024 at 03:24:46PM -0300, Jason Gunthorpe wrote:
> > On Sun, Jul 28, 2024 at 04:51:06PM -0700, Nicolin Chen wrote:
> > > IOMMU_RESV_SW_MSI is a unique region defined by an IOMMU driver.
> Though it
> > > is eventually used by a device for address translation to an MSI location
> > > (including nested cases), practically it is a universal region across all
> > > domains allocated for the IOMMU that defines it.
> > >
> > > Currently IOMMUFD core fetches and reserves the region during an
> attach to
> > > an hwpt_paging. It works with a hwpt_paging-only case, but might not
> work
> > > with a nested case where a device could directly attach to a hwpt_nested,
> > > bypassing the hwpt_paging attachment.
> >
> > Well, it does this because the attach is the only place where we have
> > *all* the devices available.
> >
> > Doing it doing allocation means you get only one device.
> >
> > So, I'd imagine more like we allocate the MSI region during allocation
> > for the device specific during allocation
> >
> > But continue to enforce that every attached device also has its MSI
> > region allocated.. Which probably just means checking that the
> > driver's reported MSI address is the same as the address during
> > allocation?
> 
> The idea here is to treat IOMMU_RESV_SW_MSI as a per-IOMMU thing
> v.s. a per-device thing, because it's defined by SW for an IOMMU
> that doesn't have a HW MSI window. In another word, devices don't
> really matter, so long as we know which IOMMU that can be decided
> by any "one device".
> 

The problem is that the entire reserved region interface in IOMMU is
per-device, leaving the room that though IOMMU_RESV_SW_MSI
is per-IOMMU on existing platforms there might be a new
implementation future with per-device difference.

Then we'll have trouble in the callers which assume a specific type
is per-IOMMU.

If we really want to go this route I wonder whether a per-IOMMU
get_resv_regions() API should first exist then there is a clear definition
what types are per-IOMMU and what are not.
Re: [PATCH] iommufd: Enforce IOMMU_RESV_SW_MSI upon hwpt_paging allocation
Posted by Nicolin Chen 1 year, 6 months ago
On Wed, Jul 31, 2024 at 07:50:53AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Tuesday, July 30, 2024 4:06 AM
> >
> > On Mon, Jul 29, 2024 at 03:24:46PM -0300, Jason Gunthorpe wrote:
> > > On Sun, Jul 28, 2024 at 04:51:06PM -0700, Nicolin Chen wrote:
> > > > IOMMU_RESV_SW_MSI is a unique region defined by an IOMMU driver.
> > Though it
> > > > is eventually used by a device for address translation to an MSI location
> > > > (including nested cases), practically it is a universal region across all
> > > > domains allocated for the IOMMU that defines it.
> > > >
> > > > Currently IOMMUFD core fetches and reserves the region during an
> > attach to
> > > > an hwpt_paging. It works with a hwpt_paging-only case, but might not
> > work
> > > > with a nested case where a device could directly attach to a hwpt_nested,
> > > > bypassing the hwpt_paging attachment.
> > >
> > > Well, it does this because the attach is the only place where we have
> > > *all* the devices available.
> > >
> > > Doing it doing allocation means you get only one device.
> > >
> > > So, I'd imagine more like we allocate the MSI region during allocation
> > > for the device specific during allocation
> > >
> > > But continue to enforce that every attached device also has its MSI
> > > region allocated.. Which probably just means checking that the
> > > driver's reported MSI address is the same as the address during
> > > allocation?
> >
> > The idea here is to treat IOMMU_RESV_SW_MSI as a per-IOMMU thing
> > v.s. a per-device thing, because it's defined by SW for an IOMMU
> > that doesn't have a HW MSI window. In another word, devices don't
> > really matter, so long as we know which IOMMU that can be decided
> > by any "one device".
> >
> 
> The problem is that the entire reserved region interface in IOMMU is
> per-device, leaving the room that though IOMMU_RESV_SW_MSI
> is per-IOMMU on existing platforms there might be a new
> implementation future with per-device difference.
>
> Then we'll have trouble in the callers which assume a specific type
> is per-IOMMU.
 
Looking at the existing cases, we are fine. But yes, I agree it
would potentially break if an IOMMU defines an IOMMU_RESV_SW_MSI
for one device not for another, i.e. one device has MSI behind
the IOMMU while another one doesn't.

> If we really want to go this route I wonder whether a per-IOMMU
> get_resv_regions() API should first exist then there is a clear definition
> what types are per-IOMMU and what are not.

Probably it'd be necessary.

Thanks
Nic
Re: [PATCH] iommufd: Enforce IOMMU_RESV_SW_MSI upon hwpt_paging allocation
Posted by Jason Gunthorpe 1 year, 6 months ago
On Wed, Jul 31, 2024 at 11:23:24AM -0700, Nicolin Chen wrote:
> > If we really want to go this route I wonder whether a per-IOMMU
> > get_resv_regions() API should first exist then there is a clear definition
> > what types are per-IOMMU and what are not.
> 
> Probably it'd be necessary.

But domains can cross IOMMU's as well, it is not per-iommu you are
after but hardwired per-IOMMU driver type.

What was the issue with trying to fix this on the nesting attach side?

Jason
Re: [PATCH] iommufd: Enforce IOMMU_RESV_SW_MSI upon hwpt_paging allocation
Posted by Nicolin Chen 1 year, 6 months ago
On Thu, Aug 01, 2024 at 10:28:48AM -0300, Jason Gunthorpe wrote:
> On Wed, Jul 31, 2024 at 11:23:24AM -0700, Nicolin Chen wrote:
> > > If we really want to go this route I wonder whether a per-IOMMU
> > > get_resv_regions() API should first exist then there is a clear definition
> > > what types are per-IOMMU and what are not.
> > 
> > Probably it'd be necessary.
> 
> But domains can cross IOMMU's as well, it is not per-iommu you are
> after but hardwired per-IOMMU driver type.
> 
> What was the issue with trying to fix this on the nesting attach side?

No, seems that we all agreed on enforcing IOVAs to the parent 
HWPT, so we will Just go on the nesting attach side.

Thanks
Nicolin