[PATCH v5 12/32] hw/arm/smmuv3-accel: Add set/unset_iommu_device callback

Shameer Kolothum posted 32 patches 1 week, 6 days ago
[PATCH v5 12/32] hw/arm/smmuv3-accel: Add set/unset_iommu_device callback
Posted by Shameer Kolothum 1 week, 6 days ago
From: Nicolin Chen <nicolinc@nvidia.com>

Implement the VFIO/PCI callbacks to attach and detach a HostIOMMUDevice
to a vSMMUv3 when accel=on,

 - set_iommu_device(): attach a HostIOMMUDevice to a vIOMMU
 - unset_iommu_device(): detach and release associated resources

In SMMUv3 accel=on mode, the guest SMMUv3 is backed by the host SMMUv3 via
IOMMUFD. A vIOMMU object (created via IOMMU_VIOMMU_ALLOC) provides a per-VM,
security-isolated handle to the physical SMMUv3. Without a vIOMMU, the
vSMMUv3 cannot relay guest operations to the host hardware nor maintain
isolation across VMs or devices. Therefore, set_iommu_device() allocates
a vIOMMU object if one does not already exist.

There are two main points to consider in this implementation:

1) VFIO core allocates and attaches a S2 HWPT that acts as the nesting
   parent for nested HWPTs(IOMMU_DOMAIN_NESTED). This parent HWPT will
   be shared across multiple vSMMU instances within a VM.

2) A device cannot attach directly to a vIOMMU. Instead, it attaches
   through a proxy nested HWPT (IOMMU_DOMAIN_NESTED). Based on the STE
   configuration,there are three types of nested HWPTs: bypass, abort,
   and translate.
    -The bypass and abort proxy HWPTs are pre-allocated. When SMMUv3
     operates in global abort or bypass modes, as controlled by the GBPA
     register, or issues a vSTE for bypass or abort we attach these
     pre-allocated nested HWPTs.
    -The translate HWPT requires a vDEVICE to be allocated first, since
     invalidations and events depend on a valid vSID.
    -The vDEVICE allocation and actual attach operations for these proxy
     HWPTs are implemented in subsequent patches.

In summary, a device placed behind a vSMMU instance must have a vSID for
translate vSTE. The bypass and abort vSTEs are pre-allocated as proxy
nested HWPTs and is attached based on GBPA register. The core-managed
nesting parent S2 HWPT is used as parent S2 HWPT for all the nested
HWPTs and is intended to be shared across vSMMU instances within the
same VM.

set_iommu_device():
  - Reuse an existing vIOMMU for the same physical SMMU if available.
    If not, allocate a new one using the nesting parent S2 HWPT.
  - Pre-allocate two proxy nested HWPTs (bypass and abort) under the
    vIOMMU.
  - Add the device to the vIOMMU’s device list.

unset_iommu_device():
  - Re-attach device to the nesting parent S2 HWPT.
  - Remove the device from the vIOMMU’s device list.
  - If the list is empty, free the proxy HWPTs (bypass and abort)
    and release the vIOMMU object.

Introduce SMMUv3AccelState, which holds a reference to an SMMUViommu
structure representing a virtual SMMU instance backed by an iommufd
vIOMMU object.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
 hw/arm/smmuv3-accel.c    | 150 +++++++++++++++++++++++++++++++++++++++
 hw/arm/smmuv3-accel.h    |  22 ++++++
 hw/arm/smmuv3-internal.h |   5 ++
 hw/arm/trace-events      |   4 ++
 include/hw/arm/smmuv3.h  |   1 +
 5 files changed, 182 insertions(+)

diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
index a1d672208f..d4d65299a8 100644
--- a/hw/arm/smmuv3-accel.c
+++ b/hw/arm/smmuv3-accel.c
@@ -8,6 +8,7 @@
 
 #include "qemu/osdep.h"
 #include "qemu/error-report.h"
+#include "trace.h"
 
 #include "hw/arm/smmuv3.h"
 #include "hw/iommu.h"
@@ -15,6 +16,7 @@
 #include "hw/pci-host/gpex.h"
 #include "hw/vfio/pci.h"
 
+#include "smmuv3-internal.h"
 #include "smmuv3-accel.h"
 
 /*
@@ -44,6 +46,151 @@ static SMMUv3AccelDevice *smmuv3_accel_get_dev(SMMUState *bs, SMMUPciBus *sbus,
     return accel_dev;
 }
 
+static bool
+smmuv3_accel_dev_alloc_viommu(SMMUv3AccelDevice *accel_dev,
+                              HostIOMMUDeviceIOMMUFD *idev, Error **errp)
+{
+    struct iommu_hwpt_arm_smmuv3 bypass_data = {
+        .ste = { SMMU_STE_CFG_BYPASS | SMMU_STE_VALID, 0x0ULL },
+    };
+    struct iommu_hwpt_arm_smmuv3 abort_data = {
+        .ste = { SMMU_STE_VALID, 0x0ULL },
+    };
+    SMMUDevice *sdev = &accel_dev->sdev;
+    SMMUState *bs = sdev->smmu;
+    SMMUv3State *s = ARM_SMMUV3(bs);
+    SMMUv3AccelState *s_accel = s->s_accel;
+    uint32_t s2_hwpt_id = idev->hwpt_id;
+    SMMUViommu *vsmmu;
+    uint32_t viommu_id;
+
+    if (s_accel->vsmmu) {
+        accel_dev->vsmmu = s_accel->vsmmu;
+        return true;
+    }
+
+    if (!iommufd_backend_alloc_viommu(idev->iommufd, idev->devid,
+                                      IOMMU_VIOMMU_TYPE_ARM_SMMUV3,
+                                      s2_hwpt_id, &viommu_id, errp)) {
+        return false;
+    }
+
+    vsmmu = g_new0(SMMUViommu, 1);
+    vsmmu->viommu.viommu_id = viommu_id;
+    vsmmu->viommu.s2_hwpt_id = s2_hwpt_id;
+    vsmmu->viommu.iommufd = idev->iommufd;
+
+    /*
+     * Pre-allocate HWPTs for S1 bypass and abort cases. These will be attached
+     * later for guest STEs or GBPAs that require bypass or abort configuration.
+     */
+    if (!iommufd_backend_alloc_hwpt(idev->iommufd, idev->devid, viommu_id,
+                                    0, IOMMU_HWPT_DATA_ARM_SMMUV3,
+                                    sizeof(abort_data), &abort_data,
+                                    &vsmmu->abort_hwpt_id, errp)) {
+        goto free_viommu;
+    }
+
+    if (!iommufd_backend_alloc_hwpt(idev->iommufd, idev->devid, viommu_id,
+                                    0, IOMMU_HWPT_DATA_ARM_SMMUV3,
+                                    sizeof(bypass_data), &bypass_data,
+                                    &vsmmu->bypass_hwpt_id, errp)) {
+        goto free_abort_hwpt;
+    }
+
+    vsmmu->iommufd = idev->iommufd;
+    s_accel->vsmmu = vsmmu;
+    accel_dev->vsmmu = vsmmu;
+    return true;
+
+free_abort_hwpt:
+    iommufd_backend_free_id(idev->iommufd, vsmmu->abort_hwpt_id);
+free_viommu:
+    iommufd_backend_free_id(idev->iommufd, vsmmu->viommu.viommu_id);
+    g_free(vsmmu);
+    return false;
+}
+
+static bool smmuv3_accel_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
+                                          HostIOMMUDevice *hiod, Error **errp)
+{
+    HostIOMMUDeviceIOMMUFD *idev = HOST_IOMMU_DEVICE_IOMMUFD(hiod);
+    SMMUState *bs = opaque;
+    SMMUv3State *s = ARM_SMMUV3(bs);
+    SMMUv3AccelState *s_accel = s->s_accel;
+    SMMUPciBus *sbus = smmu_get_sbus(bs, bus);
+    SMMUv3AccelDevice *accel_dev = smmuv3_accel_get_dev(bs, sbus, bus, devfn);
+    SMMUDevice *sdev = &accel_dev->sdev;
+    uint16_t sid = smmu_get_sid(sdev);
+
+    if (!idev) {
+        return true;
+    }
+
+    if (accel_dev->idev) {
+        if (accel_dev->idev != idev) {
+            error_setg(errp, "Device 0x%x already has an associated IOMMU dev",
+                       sid);
+            return false;
+        }
+        return true;
+    }
+
+    if (!smmuv3_accel_dev_alloc_viommu(accel_dev, idev, errp)) {
+        error_append_hint(errp, "Device 0x%x: Unable to alloc viommu", sid);
+        return false;
+    }
+
+    accel_dev->idev = idev;
+    QLIST_INSERT_HEAD(&s_accel->vsmmu->device_list, accel_dev, next);
+    trace_smmuv3_accel_set_iommu_device(devfn, idev->devid);
+    return true;
+}
+
+static void smmuv3_accel_unset_iommu_device(PCIBus *bus, void *opaque,
+                                            int devfn)
+{
+    SMMUState *bs = opaque;
+    SMMUv3State *s = ARM_SMMUV3(bs);
+    SMMUPciBus *sbus = g_hash_table_lookup(bs->smmu_pcibus_by_busptr, bus);
+    SMMUv3AccelDevice *accel_dev;
+    SMMUViommu *vsmmu;
+    SMMUDevice *sdev;
+    uint16_t sid;
+
+    if (!sbus) {
+        return;
+    }
+
+    sdev = sbus->pbdev[devfn];
+    if (!sdev) {
+        return;
+    }
+
+    sid = smmu_get_sid(sdev);
+    accel_dev = container_of(sdev, SMMUv3AccelDevice, sdev);
+    /* Re-attach the default s2 hwpt id */
+    if (!host_iommu_device_iommufd_attach_hwpt(accel_dev->idev,
+                                               accel_dev->idev->hwpt_id,
+                                               NULL)) {
+        error_report("Unable to attach dev 0x%x to the default HW pagetable",
+                     sid);
+    }
+
+    accel_dev->idev = NULL;
+    QLIST_REMOVE(accel_dev, next);
+    trace_smmuv3_accel_unset_iommu_device(devfn, sid);
+
+    vsmmu = s->s_accel->vsmmu;
+    if (QLIST_EMPTY(&vsmmu->device_list)) {
+        iommufd_backend_free_id(vsmmu->iommufd, vsmmu->bypass_hwpt_id);
+        iommufd_backend_free_id(vsmmu->iommufd, vsmmu->abort_hwpt_id);
+        iommufd_backend_free_id(vsmmu->iommufd, vsmmu->viommu.viommu_id);
+        g_free(vsmmu);
+        s->s_accel->vsmmu = NULL;
+    }
+}
+
 static bool smmuv3_accel_pdev_allowed(PCIDevice *pdev, bool *vfio_pci)
 {
 
@@ -135,6 +282,8 @@ static const PCIIOMMUOps smmuv3_accel_ops = {
     .supports_address_space = smmuv3_accel_supports_as,
     .get_address_space = smmuv3_accel_find_add_as,
     .get_viommu_flags = smmuv3_accel_get_viommu_flags,
+    .set_iommu_device = smmuv3_accel_set_iommu_device,
+    .unset_iommu_device = smmuv3_accel_unset_iommu_device,
 };
 
 static void smmuv3_accel_as_init(SMMUv3State *s)
@@ -160,4 +309,5 @@ void smmuv3_accel_init(SMMUv3State *s)
 
     bs->iommu_ops = &smmuv3_accel_ops;
     smmuv3_accel_as_init(s);
+    s->s_accel = g_new0(SMMUv3AccelState, 1);
 }
diff --git a/hw/arm/smmuv3-accel.h b/hw/arm/smmuv3-accel.h
index 0dc6b00d35..d81f90c32c 100644
--- a/hw/arm/smmuv3-accel.h
+++ b/hw/arm/smmuv3-accel.h
@@ -10,12 +10,34 @@
 #define HW_ARM_SMMUV3_ACCEL_H
 
 #include "hw/arm/smmu-common.h"
+#include "system/iommufd.h"
+#include <linux/iommufd.h>
 #include CONFIG_DEVICES
 
+/*
+ * Represents a virtual SMMU instance backed by an iommufd vIOMMU object.
+ * Holds references to the core iommufd vIOMMU object and to proxy HWPTs
+ * (bypass and abort) used for device attachment.
+ */
+typedef struct SMMUViommu {
+    IOMMUFDBackend *iommufd;
+    IOMMUFDViommu viommu;
+    uint32_t bypass_hwpt_id;
+    uint32_t abort_hwpt_id;
+    QLIST_HEAD(, SMMUv3AccelDevice) device_list;
+} SMMUViommu;
+
 typedef struct SMMUv3AccelDevice {
     SMMUDevice sdev;
+    HostIOMMUDeviceIOMMUFD *idev;
+    SMMUViommu *vsmmu;
+    QLIST_ENTRY(SMMUv3AccelDevice) next;
 } SMMUv3AccelDevice;
 
+typedef struct SMMUv3AccelState {
+    SMMUViommu *vsmmu;
+} SMMUv3AccelState;
+
 #ifdef CONFIG_ARM_SMMUV3_ACCEL
 void smmuv3_accel_init(SMMUv3State *s);
 #else
diff --git a/hw/arm/smmuv3-internal.h b/hw/arm/smmuv3-internal.h
index b6b7399347..03d86cfc5c 100644
--- a/hw/arm/smmuv3-internal.h
+++ b/hw/arm/smmuv3-internal.h
@@ -583,6 +583,11 @@ typedef struct CD {
     ((extract64((x)->word[7], 0, 16) << 32) |           \
      ((x)->word[6] & 0xfffffff0))
 
+#define SMMU_STE_VALID      (1ULL << 0)
+#define SMMU_STE_CFG_BYPASS (1ULL << 3)
+
+#define SMMU_GBPA_ABORT (1UL << 20)
+
 static inline int oas2bits(int oas_field)
 {
     switch (oas_field) {
diff --git a/hw/arm/trace-events b/hw/arm/trace-events
index f3386bd7ae..49c0460f30 100644
--- a/hw/arm/trace-events
+++ b/hw/arm/trace-events
@@ -66,6 +66,10 @@ smmuv3_notify_flag_del(const char *iommu) "DEL SMMUNotifier node for iommu mr=%s
 smmuv3_inv_notifiers_iova(const char *name, int asid, int vmid, uint64_t iova, uint8_t tg, uint64_t num_pages, int stage) "iommu mr=%s asid=%d vmid=%d iova=0x%"PRIx64" tg=%d num_pages=0x%"PRIx64" stage=%d"
 smmu_reset_exit(void) ""
 
+#smmuv3-accel.c
+smmuv3_accel_set_iommu_device(int devfn, uint32_t sid) "devfn=0x%x (idev devid=0x%x)"
+smmuv3_accel_unset_iommu_device(int devfn, uint32_t sid) "devfn=0x%x (idev devid=0x%x)"
+
 # strongarm.c
 strongarm_uart_update_parameters(const char *label, int speed, char parity, int data_bits, int stop_bits) "%s speed=%d parity=%c data=%d stop=%d"
 strongarm_ssp_read_underrun(void) "SSP rx underrun"
diff --git a/include/hw/arm/smmuv3.h b/include/hw/arm/smmuv3.h
index bb7076286b..e54ece2d38 100644
--- a/include/hw/arm/smmuv3.h
+++ b/include/hw/arm/smmuv3.h
@@ -66,6 +66,7 @@ struct SMMUv3State {
 
     /* SMMU has HW accelerator support for nested S1 + s2 */
     bool accel;
+    struct SMMUv3AccelState *s_accel;
 };
 
 typedef enum {
-- 
2.43.0


Re: [PATCH v5 12/32] hw/arm/smmuv3-accel: Add set/unset_iommu_device callback
Posted by Nicolin Chen 1 week, 6 days ago
On Fri, Oct 31, 2025 at 10:49:45AM +0000, Shameer Kolothum wrote:
> +static bool
> +smmuv3_accel_dev_alloc_viommu(SMMUv3AccelDevice *accel_dev,
> +                              HostIOMMUDeviceIOMMUFD *idev, Error **errp)

Let's make it simply do alloc() on s_accel:

static bool smmuv3_accel_alloc_viommu(SMMUv3AccelState *s_accel,
                                      HostIOMMUDeviceIOMMUFD *idev,
                                      Error **errp)

Then..

> +    SMMUDevice *sdev = &accel_dev->sdev;
> +    SMMUState *bs = sdev->smmu;
> +    SMMUv3State *s = ARM_SMMUV3(bs);
> +    SMMUv3AccelState *s_accel = s->s_accel;

Drop these.

> +    if (s_accel->vsmmu) {
> +        accel_dev->vsmmu = s_accel->vsmmu;
> +        return true;
> +    }

And this.

> +static bool smmuv3_accel_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
> +                                          HostIOMMUDevice *hiod, Error **errp)
[...]
> +    if (!smmuv3_accel_dev_alloc_viommu(accel_dev, idev, errp)) {
> +        error_append_hint(errp, "Device 0x%x: Unable to alloc viommu", sid);
> +        return false;
> +    }

And here:

    if (!s_accel->vsmmu && !smmuv3_accel_alloc_viommu(s_accel, idev, errp)) {
        error_append_hint(errp, "Device 0x%x: Unable to alloc viommu", sid);
        return false;
    }

    accel_dev->idev = idev;
    accel_dev->vsmmu = s_accel->vsmmu;

Feels slightly cleaner.

> +/*
> + * Represents a virtual SMMU instance backed by an iommufd vIOMMU object.
> + * Holds references to the core iommufd vIOMMU object and to proxy HWPTs

I read "reference" as a pointer, yet...

> + * (bypass and abort) used for device attachment.
> + */
> +typedef struct SMMUViommu {
> +    IOMMUFDBackend *iommufd;
> +    IOMMUFDViommu viommu;
> +    uint32_t bypass_hwpt_id;
> +    uint32_t abort_hwpt_id;

...viommu is a containment and two HWPTs are IDs.

So, it'd sound more accurate, being:

/*
 * Represents a virtual SMMU instance backed by an iommufd vIOMMU object.
 * Holds bypass and abort proxy HWPT ids used for device attachment.
 */

> +typedef struct SMMUv3AccelState {
> +    SMMUViommu *vsmmu;
> +} SMMUv3AccelState;

Hmm, maybe we don't need another layer of structure. Every access
or validation to s_accel is for s_accel->vsmmu.

e.g.
    if (!s_accel || !s_accel->vsmmu) {
could be
    if (!s_accel) {

So, let's try merging them into one. And feel free to leave one
of your favorite.

Nic
RE: [PATCH v5 12/32] hw/arm/smmuv3-accel: Add set/unset_iommu_device callback
Posted by Shameer Kolothum 1 week, 3 days ago

> -----Original Message-----
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: 31 October 2025 22:02
> To: Shameer Kolothum <skolothumtho@nvidia.com>
> Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
> eric.auger@redhat.com; peter.maydell@linaro.org; Jason Gunthorpe
> <jgg@nvidia.com>; ddutile@redhat.com; berrange@redhat.com; Nathan
> Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> Krishnakant Jaju <kjaju@nvidia.com>
> Subject: Re: [PATCH v5 12/32] hw/arm/smmuv3-accel: Add
> set/unset_iommu_device callback
> 
> On Fri, Oct 31, 2025 at 10:49:45AM +0000, Shameer Kolothum wrote:
> > +static bool
> > +smmuv3_accel_dev_alloc_viommu(SMMUv3AccelDevice *accel_dev,
> > +                              HostIOMMUDeviceIOMMUFD *idev, Error
> > +**errp)
> 
> Let's make it simply do alloc() on s_accel:
> 
> static bool smmuv3_accel_alloc_viommu(SMMUv3AccelState *s_accel,
>                                       HostIOMMUDeviceIOMMUFD *idev,
>                                       Error **errp)
> 
> Then..
> 
> > +    SMMUDevice *sdev = &accel_dev->sdev;
> > +    SMMUState *bs = sdev->smmu;
> > +    SMMUv3State *s = ARM_SMMUV3(bs);
> > +    SMMUv3AccelState *s_accel = s->s_accel;
> 
> Drop these.
> 
> > +    if (s_accel->vsmmu) {
> > +        accel_dev->vsmmu = s_accel->vsmmu;
> > +        return true;
> > +    }
> 
> And this.
> 
> > +static bool smmuv3_accel_set_iommu_device(PCIBus *bus, void *opaque,
> int devfn,
> > +                                          HostIOMMUDevice *hiod,
> > +Error **errp)
> [...]
> > +    if (!smmuv3_accel_dev_alloc_viommu(accel_dev, idev, errp)) {
> > +        error_append_hint(errp, "Device 0x%x: Unable to alloc viommu", sid);
> > +        return false;
> > +    }
> 
> And here:
> 
>     if (!s_accel->vsmmu && !smmuv3_accel_alloc_viommu(s_accel, idev, errp))
> {
>         error_append_hint(errp, "Device 0x%x: Unable to alloc viommu", sid);
>         return false;
>     }
> 
>     accel_dev->idev = idev;
>     accel_dev->vsmmu = s_accel->vsmmu;
> 
> Feels slightly cleaner.
> 
> > +/*
> > + * Represents a virtual SMMU instance backed by an iommufd vIOMMU
> object.
> > + * Holds references to the core iommufd vIOMMU object and to proxy
> > +HWPTs
> 
> I read "reference" as a pointer, yet...
> 
> > + * (bypass and abort) used for device attachment.
> > + */
> > +typedef struct SMMUViommu {
> > +    IOMMUFDBackend *iommufd;
> > +    IOMMUFDViommu viommu;
> > +    uint32_t bypass_hwpt_id;
> > +    uint32_t abort_hwpt_id;
> 
> ...viommu is a containment and two HWPTs are IDs.
> 
> So, it'd sound more accurate, being:
> 
> /*
>  * Represents a virtual SMMU instance backed by an iommufd vIOMMU
> object.
>  * Holds bypass and abort proxy HWPT ids used for device attachment.
>  */
> 
> > +typedef struct SMMUv3AccelState {
> > +    SMMUViommu *vsmmu;
> > +} SMMUv3AccelState;
> 
> Hmm, maybe we don't need another layer of structure. Every access or
> validation to s_accel is for s_accel->vsmmu.
> 
> e.g.
>     if (!s_accel || !s_accel->vsmmu) {
> could be
>     if (!s_accel) {
> 
> So, let's try merging them into one. And feel free to leave one of your favorite.

Looks sensible to me. I will take a look during next revision.

Thanks,
Shameer
Re: [PATCH v5 12/32] hw/arm/smmuv3-accel: Add set/unset_iommu_device callback
Posted by Nicolin Chen 1 week, 6 days ago
On Fri, Oct 31, 2025 at 03:02:08PM -0700, Nicolin Chen wrote:
> On Fri, Oct 31, 2025 at 10:49:45AM +0000, Shameer Kolothum wrote:
> > +static bool smmuv3_accel_set_iommu_device(PCIBus *bus, void *opaque, int devfn,
> > +                                          HostIOMMUDevice *hiod, Error **errp)
> [...]
> > +    if (!smmuv3_accel_dev_alloc_viommu(accel_dev, idev, errp)) {
> > +        error_append_hint(errp, "Device 0x%x: Unable to alloc viommu", sid);
> > +        return false;
> > +    }
> 
> And here:
> 
>     if (!s_accel->vsmmu && !smmuv3_accel_alloc_viommu(s_accel, idev, errp)) {
>         error_append_hint(errp, "Device 0x%x: Unable to alloc viommu", sid);
>         return false;
>     }
> 
>     accel_dev->idev = idev;
>     accel_dev->vsmmu = s_accel->vsmmu;
> 
> Feels slightly cleaner.

Also, because we set accel_dev->vsmmu under the hood, we missed
"accel_dev->vsmmu = NULL" in the revert path and unset().