[v3] hw/arm/virt: Add support for user-creatable accelerated SMMUv3

[RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd

Posted by Shameer Kolothum via 4 months ago

Accelerated SMMUv3 is only useful when the device can take advantage of
the host's SMMUv3 in nested mode. To keep things simple and correct, we
only allow this feature for vfio-pci endpoint devices that use the iommufd
backend. We also allow non-endpoint emulated devices like PCI bridges and
root ports, so that users can plug in these vfio-pci devices.

Another reason for this limit is to avoid problems with IOTLB
invalidations. Some commands (e.g., CMD_TLBI_NH_ASID) lack an associated
SID, making it difficult to trace the originating device. If we allowed
emulated endpoint devices, QEMU would have to invalidate both its own
software IOTLB and the host's hardware IOTLB, which could slow things
down.

Since vfio-pci devices in nested mode rely on the host SMMUv3's nested
translation (S1+S2), their get_address_space() callback must return the
system address space to enable correct S2 mappings of guest RAM.

So in short:
 - vfio-pci devices return the system address space
 - bridges and root ports return the IOMMU address space

Note: On ARM, MSI doorbell addresses are also translated via SMMUv3.
Hence, if a vfio-pci device is behind the SMMuv3 with translation enabled,
it must return the IOMMU address space for MSI. Support for this will be
added in a follow-up patch.

Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
---
 hw/arm/smmuv3-accel.c               | 50 ++++++++++++++++++++++++++++-
 hw/arm/smmuv3-accel.h               | 15 +++++++++
 hw/arm/smmuv3.c                     |  4 +++
 hw/pci-bridge/pci_expander_bridge.c |  1 -
 include/hw/arm/smmuv3.h             |  1 +
 include/hw/pci/pci_bridge.h         |  1 +
 6 files changed, 70 insertions(+), 2 deletions(-)

diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
index 2eac9c6ff4..0b0ddb03e2 100644
--- a/hw/arm/smmuv3-accel.c
+++ b/hw/arm/smmuv3-accel.c
@@ -7,13 +7,19 @@
  */
 
 #include "qemu/osdep.h"
+#include "qemu/error-report.h"
 
 #include "hw/arm/smmuv3.h"
+#include "hw/pci/pci_bridge.h"
+#include "hw/pci-host/gpex.h"
+#include "hw/vfio/pci.h"
+
 #include "smmuv3-accel.h"
 
 static SMMUv3AccelDevice *smmuv3_accel_get_dev(SMMUState *bs, SMMUPciBus *sbus,
                                                 PCIBus *bus, int devfn)
 {
+    SMMUv3State *s = ARM_SMMUV3(bs);
     SMMUDevice *sdev = sbus->pbdev[devfn];
     SMMUv3AccelDevice *accel_dev;
 
@@ -25,30 +31,72 @@ static SMMUv3AccelDevice *smmuv3_accel_get_dev(SMMUState *bs, SMMUPciBus *sbus,
 
         sbus->pbdev[devfn] = sdev;
         smmu_init_sdev(bs, sdev, bus, devfn);
+        address_space_init(&accel_dev->as_sysmem, &s->s_accel->root,
+                           "smmuv3-accel-sysmem");
     }
 
     return accel_dev;
 }
 
+static bool smmuv3_accel_pdev_allowed(PCIDevice *pdev, bool *vfio_pci)
+{
+
+    if (object_dynamic_cast(OBJECT(pdev), TYPE_PCI_BRIDGE) ||
+        object_dynamic_cast(OBJECT(pdev), "pxb-pcie") ||
+        object_dynamic_cast(OBJECT(pdev), "gpex-root")) {
+        return true;
+    } else if ((object_dynamic_cast(OBJECT(pdev), TYPE_VFIO_PCI) &&
+        object_property_find(OBJECT(pdev), "iommufd"))) {
+        *vfio_pci = true;
+        return true;
+    }
+    return false;
+}
+
 static AddressSpace *smmuv3_accel_find_add_as(PCIBus *bus, void *opaque,
                                               int devfn)
 {
+    PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), devfn);
     SMMUState *bs = opaque;
+    bool vfio_pci = false;
     SMMUPciBus *sbus;
     SMMUv3AccelDevice *accel_dev;
     SMMUDevice *sdev;
 
+    if (pdev && !smmuv3_accel_pdev_allowed(pdev, &vfio_pci)) {
+        error_report("Device(%s) not allowed. Only PCIe root complex devices "
+                     "or PCI bridge devices or vfio-pci endpoint devices with "
+                     "iommufd as backend is allowed with arm-smmuv3,accel=on",
+                     pdev->name);
+        exit(1);
+    }
     sbus = smmu_get_sbus(bs, bus);
     accel_dev = smmuv3_accel_get_dev(bs, sbus, bus, devfn);
     sdev = &accel_dev->sdev;
 
-    return &sdev->as;
+    if (vfio_pci) {
+        return &accel_dev->as_sysmem;
+    } else {
+        return &sdev->as;
+    }
 }
 
 static const PCIIOMMUOps smmuv3_accel_ops = {
     .get_address_space = smmuv3_accel_find_add_as,
 };
 
+void smmuv3_accel_init(SMMUv3State *s)
+{
+    SMMUv3AccelState *s_accel;
+
+    s->s_accel = s_accel = g_new0(SMMUv3AccelState, 1);
+    memory_region_init(&s_accel->root, OBJECT(s), "root", UINT64_MAX);
+    memory_region_init_alias(&s_accel->sysmem, OBJECT(s),
+                             "smmuv3-accel-sysmem", get_system_memory(), 0,
+                             memory_region_size(get_system_memory()));
+    memory_region_add_subregion(&s_accel->root, 0, &s_accel->sysmem);
+}
+
 static void smmuv3_accel_class_init(ObjectClass *oc, const void *data)
 {
     SMMUBaseClass *sbc = ARM_SMMU_CLASS(oc);
diff --git a/hw/arm/smmuv3-accel.h b/hw/arm/smmuv3-accel.h
index 4cf30b1291..2cd343103f 100644
--- a/hw/arm/smmuv3-accel.h
+++ b/hw/arm/smmuv3-accel.h
@@ -9,11 +9,26 @@
 #ifndef HW_ARM_SMMUV3_ACCEL_H
 #define HW_ARM_SMMUV3_ACCEL_H
 
+#include "hw/arm/smmuv3.h"
 #include "hw/arm/smmu-common.h"
 #include CONFIG_DEVICES
 
 typedef struct SMMUv3AccelDevice {
     SMMUDevice  sdev;
+    AddressSpace as_sysmem;
 } SMMUv3AccelDevice;
 
+typedef struct SMMUv3AccelState {
+    MemoryRegion root;
+    MemoryRegion sysmem;
+} SMMUv3AccelState;
+
+#if defined(CONFIG_ARM_SMMUV3) && defined(CONFIG_IOMMUFD)
+void smmuv3_accel_init(SMMUv3State *s);
+#else
+static inline void smmuv3_accel_init(SMMUv3State *d)
+{
+}
+#endif
+
 #endif /* HW_ARM_SMMUV3_ACCEL_H */
diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index bcf8af8dc7..2f5a8157dd 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -32,6 +32,7 @@
 #include "qapi/error.h"
 
 #include "hw/arm/smmuv3.h"
+#include "smmuv3-accel.h"
 #include "smmuv3-internal.h"
 #include "smmu-internal.h"
 
@@ -1898,6 +1899,9 @@ static void smmu_realize(DeviceState *d, Error **errp)
     sysbus_init_mmio(dev, &sys->iomem);
 
     smmu_init_irq(s, dev);
+    if (sys->accel) {
+        smmuv3_accel_init(s);
+    }
 }
 
 static const VMStateDescription vmstate_smmuv3_queue = {
diff --git a/hw/pci-bridge/pci_expander_bridge.c b/hw/pci-bridge/pci_expander_bridge.c
index 1bcceddbc4..a8eb2d2426 100644
--- a/hw/pci-bridge/pci_expander_bridge.c
+++ b/hw/pci-bridge/pci_expander_bridge.c
@@ -48,7 +48,6 @@ struct PXBBus {
     char bus_path[8];
 };
 
-#define TYPE_PXB_PCIE_DEV "pxb-pcie"
 OBJECT_DECLARE_SIMPLE_TYPE(PXBPCIEDev, PXB_PCIE_DEV)
 
 static GList *pxb_dev_list;
diff --git a/include/hw/arm/smmuv3.h b/include/hw/arm/smmuv3.h
index d183a62766..3bdb92391a 100644
--- a/include/hw/arm/smmuv3.h
+++ b/include/hw/arm/smmuv3.h
@@ -63,6 +63,7 @@ struct SMMUv3State {
     qemu_irq     irq[4];
     QemuMutex mutex;
     char *stage;
+    struct SMMUv3AccelState  *s_accel;
 };
 
 typedef enum {
diff --git a/include/hw/pci/pci_bridge.h b/include/hw/pci/pci_bridge.h
index a055fd8d32..b61360b900 100644
--- a/include/hw/pci/pci_bridge.h
+++ b/include/hw/pci/pci_bridge.h
@@ -106,6 +106,7 @@ typedef struct PXBPCIEDev {
 
 #define TYPE_PXB_PCIE_BUS "pxb-pcie-bus"
 #define TYPE_PXB_CXL_BUS "pxb-cxl-bus"
+#define TYPE_PXB_PCIE_DEV "pxb-pcie"
 #define TYPE_PXB_DEV "pxb"
 OBJECT_DECLARE_SIMPLE_TYPE(PXBDev, PXB_DEV)
 
-- 
2.34.1

Re: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd

Posted by Eric Auger 2 months, 1 week ago


On 7/14/25 5:59 PM, Shameer Kolothum wrote:
> Accelerated SMMUv3 is only useful when the device can take advantage of
> the host's SMMUv3 in nested mode. To keep things simple and correct, we
> only allow this feature for vfio-pci endpoint devices that use the iommufd
> backend. We also allow non-endpoint emulated devices like PCI bridges and
> root ports, so that users can plug in these vfio-pci devices.
>
> Another reason for this limit is to avoid problems with IOTLB
> invalidations. Some commands (e.g., CMD_TLBI_NH_ASID) lack an associated
> SID, making it difficult to trace the originating device. If we allowed
> emulated endpoint devices, QEMU would have to invalidate both its own
> software IOTLB and the host's hardware IOTLB, which could slow things
> down.
>
> Since vfio-pci devices in nested mode rely on the host SMMUv3's nested
> translation (S1+S2), their get_address_space() callback must return the
> system address space to enable correct S2 mappings of guest RAM.
>
> So in short:
>  - vfio-pci devices return the system address space
>  - bridges and root ports return the IOMMU address space
>
> Note: On ARM, MSI doorbell addresses are also translated via SMMUv3.
> Hence, if a vfio-pci device is behind the SMMuv3 with translation enabled,
> it must return the IOMMU address space for MSI. Support for this will be
> added in a follow-up patch.
It sounds antithetical to what is said above:

"vfio-pci devices return the system address space"

Eric

>
> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> ---
>  hw/arm/smmuv3-accel.c               | 50 ++++++++++++++++++++++++++++-
>  hw/arm/smmuv3-accel.h               | 15 +++++++++
>  hw/arm/smmuv3.c                     |  4 +++
>  hw/pci-bridge/pci_expander_bridge.c |  1 -
>  include/hw/arm/smmuv3.h             |  1 +
>  include/hw/pci/pci_bridge.h         |  1 +
>  6 files changed, 70 insertions(+), 2 deletions(-)
>
> diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
> index 2eac9c6ff4..0b0ddb03e2 100644
> --- a/hw/arm/smmuv3-accel.c
> +++ b/hw/arm/smmuv3-accel.c
> @@ -7,13 +7,19 @@
>   */
>  
>  #include "qemu/osdep.h"
> +#include "qemu/error-report.h"
>  
>  #include "hw/arm/smmuv3.h"
> +#include "hw/pci/pci_bridge.h"
> +#include "hw/pci-host/gpex.h"
> +#include "hw/vfio/pci.h"
> +
>  #include "smmuv3-accel.h"
>  
>  static SMMUv3AccelDevice *smmuv3_accel_get_dev(SMMUState *bs, SMMUPciBus *sbus,
>                                                  PCIBus *bus, int devfn)
>  {
> +    SMMUv3State *s = ARM_SMMUV3(bs);
>      SMMUDevice *sdev = sbus->pbdev[devfn];
>      SMMUv3AccelDevice *accel_dev;
>  
> @@ -25,30 +31,72 @@ static SMMUv3AccelDevice *smmuv3_accel_get_dev(SMMUState *bs, SMMUPciBus *sbus,
>  
>          sbus->pbdev[devfn] = sdev;
>          smmu_init_sdev(bs, sdev, bus, devfn);
> +        address_space_init(&accel_dev->as_sysmem, &s->s_accel->root,
> +                           "smmuv3-accel-sysmem");
>      }
>  
>      return accel_dev;
>  }
>  
> +static bool smmuv3_accel_pdev_allowed(PCIDevice *pdev, bool *vfio_pci)
> +{
> +
> +    if (object_dynamic_cast(OBJECT(pdev), TYPE_PCI_BRIDGE) ||
> +        object_dynamic_cast(OBJECT(pdev), "pxb-pcie") ||
> +        object_dynamic_cast(OBJECT(pdev), "gpex-root")) {
> +        return true;
> +    } else if ((object_dynamic_cast(OBJECT(pdev), TYPE_VFIO_PCI) &&
> +        object_property_find(OBJECT(pdev), "iommufd"))) {
> +        *vfio_pci = true;
> +        return true;
> +    }
> +    return false;
> +}
> +
>  static AddressSpace *smmuv3_accel_find_add_as(PCIBus *bus, void *opaque,
>                                                int devfn)
>  {
> +    PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), devfn);
>      SMMUState *bs = opaque;
> +    bool vfio_pci = false;
>      SMMUPciBus *sbus;
>      SMMUv3AccelDevice *accel_dev;
>      SMMUDevice *sdev;
>  
> +    if (pdev && !smmuv3_accel_pdev_allowed(pdev, &vfio_pci)) {
> +        error_report("Device(%s) not allowed. Only PCIe root complex devices "
> +                     "or PCI bridge devices or vfio-pci endpoint devices with "
> +                     "iommufd as backend is allowed with arm-smmuv3,accel=on",
> +                     pdev->name);
> +        exit(1);
> +    }
>      sbus = smmu_get_sbus(bs, bus);
>      accel_dev = smmuv3_accel_get_dev(bs, sbus, bus, devfn);
>      sdev = &accel_dev->sdev;
>  
> -    return &sdev->as;
> +    if (vfio_pci) {
> +        return &accel_dev->as_sysmem;
> +    } else {
> +        return &sdev->as;
> +    }
>  }
>  
>  static const PCIIOMMUOps smmuv3_accel_ops = {
>      .get_address_space = smmuv3_accel_find_add_as,
>  };
>  
> +void smmuv3_accel_init(SMMUv3State *s)
> +{
> +    SMMUv3AccelState *s_accel;
> +
> +    s->s_accel = s_accel = g_new0(SMMUv3AccelState, 1);
> +    memory_region_init(&s_accel->root, OBJECT(s), "root", UINT64_MAX);
> +    memory_region_init_alias(&s_accel->sysmem, OBJECT(s),
> +                             "smmuv3-accel-sysmem", get_system_memory(), 0,
> +                             memory_region_size(get_system_memory()));
> +    memory_region_add_subregion(&s_accel->root, 0, &s_accel->sysmem);
> +}
> +
>  static void smmuv3_accel_class_init(ObjectClass *oc, const void *data)
>  {
>      SMMUBaseClass *sbc = ARM_SMMU_CLASS(oc);
> diff --git a/hw/arm/smmuv3-accel.h b/hw/arm/smmuv3-accel.h
> index 4cf30b1291..2cd343103f 100644
> --- a/hw/arm/smmuv3-accel.h
> +++ b/hw/arm/smmuv3-accel.h
> @@ -9,11 +9,26 @@
>  #ifndef HW_ARM_SMMUV3_ACCEL_H
>  #define HW_ARM_SMMUV3_ACCEL_H
>  
> +#include "hw/arm/smmuv3.h"
>  #include "hw/arm/smmu-common.h"
>  #include CONFIG_DEVICES
>  
>  typedef struct SMMUv3AccelDevice {
>      SMMUDevice  sdev;
> +    AddressSpace as_sysmem;
>  } SMMUv3AccelDevice;
>  
> +typedef struct SMMUv3AccelState {
> +    MemoryRegion root;
> +    MemoryRegion sysmem;
> +} SMMUv3AccelState;
> +
> +#if defined(CONFIG_ARM_SMMUV3) && defined(CONFIG_IOMMUFD)
> +void smmuv3_accel_init(SMMUv3State *s);
> +#else
> +static inline void smmuv3_accel_init(SMMUv3State *d)
> +{
> +}
> +#endif
> +
>  #endif /* HW_ARM_SMMUV3_ACCEL_H */
> diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
> index bcf8af8dc7..2f5a8157dd 100644
> --- a/hw/arm/smmuv3.c
> +++ b/hw/arm/smmuv3.c
> @@ -32,6 +32,7 @@
>  #include "qapi/error.h"
>  
>  #include "hw/arm/smmuv3.h"
> +#include "smmuv3-accel.h"
>  #include "smmuv3-internal.h"
>  #include "smmu-internal.h"
>  
> @@ -1898,6 +1899,9 @@ static void smmu_realize(DeviceState *d, Error **errp)
>      sysbus_init_mmio(dev, &sys->iomem);
>  
>      smmu_init_irq(s, dev);
> +    if (sys->accel) {
> +        smmuv3_accel_init(s);
> +    }
>  }
>  
>  static const VMStateDescription vmstate_smmuv3_queue = {
> diff --git a/hw/pci-bridge/pci_expander_bridge.c b/hw/pci-bridge/pci_expander_bridge.c
> index 1bcceddbc4..a8eb2d2426 100644
> --- a/hw/pci-bridge/pci_expander_bridge.c
> +++ b/hw/pci-bridge/pci_expander_bridge.c
> @@ -48,7 +48,6 @@ struct PXBBus {
>      char bus_path[8];
>  };
>  
> -#define TYPE_PXB_PCIE_DEV "pxb-pcie"
>  OBJECT_DECLARE_SIMPLE_TYPE(PXBPCIEDev, PXB_PCIE_DEV)
>  
>  static GList *pxb_dev_list;
> diff --git a/include/hw/arm/smmuv3.h b/include/hw/arm/smmuv3.h
> index d183a62766..3bdb92391a 100644
> --- a/include/hw/arm/smmuv3.h
> +++ b/include/hw/arm/smmuv3.h
> @@ -63,6 +63,7 @@ struct SMMUv3State {
>      qemu_irq     irq[4];
>      QemuMutex mutex;
>      char *stage;
> +    struct SMMUv3AccelState  *s_accel;
>  };
>  
>  typedef enum {
> diff --git a/include/hw/pci/pci_bridge.h b/include/hw/pci/pci_bridge.h
> index a055fd8d32..b61360b900 100644
> --- a/include/hw/pci/pci_bridge.h
> +++ b/include/hw/pci/pci_bridge.h
> @@ -106,6 +106,7 @@ typedef struct PXBPCIEDev {
>  
>  #define TYPE_PXB_PCIE_BUS "pxb-pcie-bus"
>  #define TYPE_PXB_CXL_BUS "pxb-cxl-bus"
> +#define TYPE_PXB_PCIE_DEV "pxb-pcie"
>  #define TYPE_PXB_DEV "pxb"
>  OBJECT_DECLARE_SIMPLE_TYPE(PXBDev, PXB_DEV)
>

RE: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd

Posted by Shameer Kolothum 2 months, 1 week ago


> -----Original Message-----
> From: Eric Auger <eric.auger@redhat.com>
> Sent: 05 September 2025 09:43
> To: qemu-arm@nongnu.org; qemu-devel@nongnu.org; Shameer Kolothum
> <skolothumtho@nvidia.com>
> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; linuxarm@huawei.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com;
> shameerkolothum@gmail.com
> Subject: Re: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict
> accelerated SMMUv3 to vfio-pci endpoints with iommufd
> 
> External email: Use caution opening links or attachments
> 
> 
> On 7/14/25 5:59 PM, Shameer Kolothum wrote:
> > Accelerated SMMUv3 is only useful when the device can take advantage
> > of the host's SMMUv3 in nested mode. To keep things simple and
> > correct, we only allow this feature for vfio-pci endpoint devices that
> > use the iommufd backend. We also allow non-endpoint emulated devices
> > like PCI bridges and root ports, so that users can plug in these vfio-pci
> devices.
> >
> > Another reason for this limit is to avoid problems with IOTLB
> > invalidations. Some commands (e.g., CMD_TLBI_NH_ASID) lack an
> > associated SID, making it difficult to trace the originating device.
> > If we allowed emulated endpoint devices, QEMU would have to invalidate
> > both its own software IOTLB and the host's hardware IOTLB, which could
> > slow things down.
> >
> > Since vfio-pci devices in nested mode rely on the host SMMUv3's nested
> > translation (S1+S2), their get_address_space() callback must return
> > the system address space to enable correct S2 mappings of guest RAM.
> >
> > So in short:
> >  - vfio-pci devices return the system address space
> >  - bridges and root ports return the IOMMU address space
> >
> > Note: On ARM, MSI doorbell addresses are also translated via SMMUv3.
> > Hence, if a vfio-pci device is behind the SMMuv3 with translation
> > enabled, it must return the IOMMU address space for MSI. Support for
> > this will be added in a follow-up patch.
> It sounds antithetical to what is said above:
> 
> "vfio-pci devices return the system address space"

This is not related to this patch per se. I only added the note to
highlight that address space for MSI translation required in nested case
is addressed later in this series(patch  #11). I can remove this note
if it is not helping and causing confusion 😊.

Thanks,
Shameer

Re: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd

Posted by Nicolin Chen 3 months, 1 week ago

Hi Shameer,

On Mon, Jul 14, 2025 at 04:59:32PM +0100, Shameer Kolothum wrote:
> @@ -25,30 +31,72 @@ static SMMUv3AccelDevice *smmuv3_accel_get_dev(SMMUState *bs, SMMUPciBus *sbus,
>  
>          sbus->pbdev[devfn] = sdev;
>          smmu_init_sdev(bs, sdev, bus, devfn);
> +        address_space_init(&accel_dev->as_sysmem, &s->s_accel->root,
> +                           "smmuv3-accel-sysmem");
>      }
>  
>      return accel_dev;
>  }
[..]
>  static AddressSpace *smmuv3_accel_find_add_as(PCIBus *bus, void *opaque,
>                                                int devfn)
>  {
> +    PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), devfn);
>      SMMUState *bs = opaque;
> +    bool vfio_pci = false;
>      SMMUPciBus *sbus;
>      SMMUv3AccelDevice *accel_dev;
>      SMMUDevice *sdev;
>  
> +    if (pdev && !smmuv3_accel_pdev_allowed(pdev, &vfio_pci)) {
> +        error_report("Device(%s) not allowed. Only PCIe root complex devices "
> +                     "or PCI bridge devices or vfio-pci endpoint devices with "
> +                     "iommufd as backend is allowed with arm-smmuv3,accel=on",
> +                     pdev->name);
> +        exit(1);
> +    }
>      sbus = smmu_get_sbus(bs, bus);
>      accel_dev = smmuv3_accel_get_dev(bs, sbus, bus, devfn);
>      sdev = &accel_dev->sdev;
>  
> -    return &sdev->as;
> +    if (vfio_pci) {
> +        return &accel_dev->as_sysmem;

I found a new problem here that we initialize new as_sysmem per
VFIO device. So, sdevs would return own individual AS pointers
here at this get_address_space function, although they point to
the same system address space.

Since address space pointers are returned differently for VFIO
devices, this fails to reuse ioas_id in iommufd_cdev_attach(),
and ends up with allocating a new ioas for each device.

I think we can try the following change to make sure all accel
linked VFIO devices would share the same ioas_id, though I am
not sure if SMMUBaseClass is the right place to go. Please take
a look.

Then, once kernel is patched to share S2 hwpt across vSMMUs,
iommufd_cdev_attach() will go further to reuse the S2 HWPT in
the same container.

Thanks
Nicolin

---
 hw/arm/smmuv3-accel.c        | 14 ++++++++++----
 hw/arm/smmuv3-accel.h        |  2 +-
 include/hw/arm/smmu-common.h |  2 ++
 3 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
index c6dc50cf45..405b72095f 100644
--- a/hw/arm/smmuv3-accel.c
+++ b/hw/arm/smmuv3-accel.c
@@ -370,13 +370,19 @@ static SMMUv3AccelDevice *smmuv3_accel_get_dev(SMMUState *bs, SMMUPciBus *sbus,
     if (sdev) {
         accel_dev = container_of(sdev, SMMUv3AccelDevice, sdev);
     } else {
+        SMMUBaseClass *sbc = ARM_SMMU_GET_CLASS(s);
+
         accel_dev = g_new0(SMMUv3AccelDevice, 1);
         sdev = &accel_dev->sdev;
 
         sbus->pbdev[devfn] = sdev;
         smmu_init_sdev(bs, sdev, bus, devfn);
-        address_space_init(&accel_dev->as_sysmem, &s->s_accel->root,
-                           "smmuv3-accel-sysmem");
+        if (!sbc->as_sysmem) {
+            sbc->as_sysmem = g_new0(AddressSpace, 1);
+            address_space_init(sbc->as_sysmem, &s->s_accel->root,
+                               "smmuv3-accel-sysmem");
+        }
+        accel_dev->as_sysmem = sbc->as_sysmem;
     }
 
     return accel_dev;
@@ -558,7 +564,7 @@ static AddressSpace *smmuv3_accel_find_msi_as(PCIBus *bus, void *opaque,
     if (accel_dev->s1_hwpt) {
         return &sdev->as;
     } else {
-        return &accel_dev->as_sysmem;
+        return accel_dev->as_sysmem;
     }
 }
 
@@ -599,7 +605,7 @@ static AddressSpace *smmuv3_accel_find_add_as(PCIBus *bus, void *opaque,
     sdev = &accel_dev->sdev;
 
     if (vfio_pci) {
-        return &accel_dev->as_sysmem;
+        return accel_dev->as_sysmem;
     } else {
         return &sdev->as;
     }
diff --git a/hw/arm/smmuv3-accel.h b/hw/arm/smmuv3-accel.h
index e1e99598b4..9faa0818d7 100644
--- a/hw/arm/smmuv3-accel.h
+++ b/hw/arm/smmuv3-accel.h
@@ -37,7 +37,7 @@ typedef struct SMMUS1Hwpt {
 
 typedef struct SMMUv3AccelDevice {
     SMMUDevice  sdev;
-    AddressSpace as_sysmem;
+    AddressSpace *as_sysmem;
     HostIOMMUDeviceIOMMUFD *idev;
     SMMUS1Hwpt  *s1_hwpt;
     SMMUViommu *viommu;
diff --git a/include/hw/arm/smmu-common.h b/include/hw/arm/smmu-common.h
index c459d24427..9bb9554435 100644
--- a/include/hw/arm/smmu-common.h
+++ b/include/hw/arm/smmu-common.h
@@ -168,6 +168,8 @@ struct SMMUState {
 struct SMMUBaseClass {
     /* <private> */
     SysBusDeviceClass parent_class;
+    /* FIXME is there any better shared place for different vSMMU instances? */
+    AddressSpace *as_sysmem;
 
     /*< public >*/
 
--

Re: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd

Posted by Shameer Kolothum 2 months ago

Hi Nicolin,

On Wed, 6 Aug 2025 at 01:55, Nicolin Chen <nicolinc@nvidia.com> wrote:
>
> Hi Shameer,
>
> On Mon, Jul 14, 2025 at 04:59:32PM +0100, Shameer Kolothum wrote:
> > @@ -25,30 +31,72 @@ static SMMUv3AccelDevice *smmuv3_accel_get_dev(SMMUState *bs, SMMUPciBus *sbus,
> >
> >          sbus->pbdev[devfn] = sdev;
> >          smmu_init_sdev(bs, sdev, bus, devfn);
> > +        address_space_init(&accel_dev->as_sysmem, &s->s_accel->root,
> > +                           "smmuv3-accel-sysmem");
> >      }
> >
> >      return accel_dev;
> >  }
> [..]
> >  static AddressSpace *smmuv3_accel_find_add_as(PCIBus *bus, void *opaque,
> >                                                int devfn)
> >  {
> > +    PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), devfn);
> >      SMMUState *bs = opaque;
> > +    bool vfio_pci = false;
> >      SMMUPciBus *sbus;
> >      SMMUv3AccelDevice *accel_dev;
> >      SMMUDevice *sdev;
> >
> > +    if (pdev && !smmuv3_accel_pdev_allowed(pdev, &vfio_pci)) {
> > +        error_report("Device(%s) not allowed. Only PCIe root complex devices "
> > +                     "or PCI bridge devices or vfio-pci endpoint devices with "
> > +                     "iommufd as backend is allowed with arm-smmuv3,accel=on",
> > +                     pdev->name);
> > +        exit(1);
> > +    }
> >      sbus = smmu_get_sbus(bs, bus);
> >      accel_dev = smmuv3_accel_get_dev(bs, sbus, bus, devfn);
> >      sdev = &accel_dev->sdev;
> >
> > -    return &sdev->as;
> > +    if (vfio_pci) {
> > +        return &accel_dev->as_sysmem;
>
> I found a new problem here that we initialize new as_sysmem per
> VFIO device. So, sdevs would return own individual AS pointers
> here at this get_address_space function, although they point to
> the same system address space.
>
> Since address space pointers are returned differently for VFIO
> devices, this fails to reuse ioas_id in iommufd_cdev_attach(),
> and ends up with allocating a new ioas for each device.
>
> I think we can try the following change to make sure all accel
> linked VFIO devices would share the same ioas_id, though I am
> not sure if SMMUBaseClass is the right place to go. Please take
> a look.

Ok. I think it is better to move that to SMMUv3AccelState and call
address_space_init() in smmuv3_accel_init() instead. Something like
below. Please take a look and let me know.

Thanks,
Shameer

diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
index 3b2f45bd88..e7feae931d 100644
--- a/hw/arm/smmuv3-accel.c
+++ b/hw/arm/smmuv3-accel.c
@@ -339,7 +339,6 @@ void smmuv3_accel_batch_cmd(SMMUState *bs, SMMUDevice *sdev,
 static SMMUv3AccelDevice *smmuv3_accel_get_dev(SMMUState *bs, SMMUPciBus *sbus,
                                                 PCIBus *bus, int devfn)
 {
-    SMMUv3State *s = ARM_SMMUV3(bs);
     SMMUDevice *sdev = sbus->pbdev[devfn];
     SMMUv3AccelDevice *accel_dev;

@@ -351,8 +350,6 @@ static SMMUv3AccelDevice
*smmuv3_accel_get_dev(SMMUState *bs, SMMUPciBus *sbus,

         sbus->pbdev[devfn] = sdev;
         smmu_init_sdev(bs, sdev, bus, devfn);
-        address_space_init(&accel_dev->as_sysmem, &s->s_accel->root,
-                           "smmuv3-accel-sysmem");
     }

     return accel_dev;
@@ -518,6 +515,7 @@ static AddressSpace
*smmuv3_accel_find_msi_as(PCIBus *bus, void *opaque,
                                                   int devfn)
 {
     SMMUState *bs = opaque;
+    SMMUv3State *s = ARM_SMMUV3(bs);
     SMMUPciBus *sbus;
     SMMUv3AccelDevice *accel_dev;
     SMMUDevice *sdev;
@@ -534,7 +532,7 @@ static AddressSpace
*smmuv3_accel_find_msi_as(PCIBus *bus, void *opaque,
     if (accel_dev->s1_hwpt) {
         return &sdev->as;
     } else {
-        return &accel_dev->as_sysmem;
+        return &s->s_accel->as_sysmem;
     }
 }

@@ -558,6 +556,7 @@ static AddressSpace
*smmuv3_accel_find_add_as(PCIBus *bus, void *opaque,
 {
     PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), devfn);
     SMMUState *bs = opaque;
+    SMMUv3State *s = ARM_SMMUV3(bs);
     bool vfio_pci = false;
     SMMUPciBus *sbus;
     SMMUv3AccelDevice *accel_dev;
@@ -575,7 +574,7 @@ static AddressSpace
*smmuv3_accel_find_add_as(PCIBus *bus, void *opaque,
     sdev = &accel_dev->sdev;

     if (vfio_pci) {
-        return &accel_dev->as_sysmem;
+        return &s->s_accel->as_sysmem;
     } else {
         return &sdev->as;
     }
@@ -612,6 +611,8 @@ void smmuv3_accel_init(SMMUv3State *s)
                              "smmuv3-accel-sysmem", get_system_memory(), 0,
                              memory_region_size(get_system_memory()));
     memory_region_add_subregion(&s_accel->root, 0, &s_accel->sysmem);
+    address_space_init(&s_accel->as_sysmem, &s_accel->root,
+                       "smmuv3-accel-sysmem");
 }

 static void smmuv3_accel_class_init(ObjectClass *oc, const void *data)
diff --git a/hw/arm/smmuv3-accel.h b/hw/arm/smmuv3-accel.h
index e1e99598b4..8f04f2fa5d 100644
--- a/hw/arm/smmuv3-accel.h
+++ b/hw/arm/smmuv3-accel.h
@@ -37,7 +37,6 @@ typedef struct SMMUS1Hwpt {

 typedef struct SMMUv3AccelDevice {
     SMMUDevice  sdev;
-    AddressSpace as_sysmem;
     HostIOMMUDeviceIOMMUFD *idev;
     SMMUS1Hwpt  *s1_hwpt;
     SMMUViommu *viommu;
@@ -48,6 +47,7 @@ typedef struct SMMUv3AccelDevice {
 typedef struct SMMUv3AccelState {
     MemoryRegion root;
     MemoryRegion sysmem;
+    AddressSpace as_sysmem;
     SMMUViommu *viommu;
     struct iommu_hw_info_arm_smmuv3 info;
 } SMMUv3AccelState;

Re: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd

Posted by Nicolin Chen 1 month, 4 weeks ago

On Tue, Sep 16, 2025 at 11:33:21AM +0100, Shameer Kolothum wrote:
> > I found a new problem here that we initialize new as_sysmem per
> > VFIO device. So, sdevs would return own individual AS pointers
> > here at this get_address_space function, although they point to
> > the same system address space.
> >
> > Since address space pointers are returned differently for VFIO
> > devices, this fails to reuse ioas_id in iommufd_cdev_attach(),
> > and ends up with allocating a new ioas for each device.
> >
> > I think we can try the following change to make sure all accel
> > linked VFIO devices would share the same ioas_id, though I am
> > not sure if SMMUBaseClass is the right place to go. Please take
> > a look.
> 
> Ok. I think it is better to move that to SMMUv3AccelState and call
> address_space_init() in smmuv3_accel_init() instead. Something like
> below. Please take a look and let me know.

[...]

> @@ -37,7 +37,6 @@ typedef struct SMMUS1Hwpt {
> 
>  typedef struct SMMUv3AccelDevice {
>      SMMUDevice  sdev;
> -    AddressSpace as_sysmem;
>      HostIOMMUDeviceIOMMUFD *idev;
>      SMMUS1Hwpt  *s1_hwpt;
>      SMMUViommu *viommu;
> @@ -48,6 +47,7 @@ typedef struct SMMUv3AccelDevice {
>  typedef struct SMMUv3AccelState {
>      MemoryRegion root;
>      MemoryRegion sysmem;
> +    AddressSpace as_sysmem;
>      SMMUViommu *viommu;
>      struct iommu_hw_info_arm_smmuv3 info;
>  } SMMUv3AccelState;

That's changing from an ioas_id per VFIO device to an ioas_id per
vSMMU instance, right? I think it's still not enough.

All vSMMU instances could share the same ioas_id. That is why I
put in the SMMUBaseClass as it's shared structure across vSMMUs.

Nicolin

RE: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd

Posted by Shameer Kolothum 1 month, 4 weeks ago


> -----Original Message-----
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: 17 September 2025 19:45
> To: Shameer Kolothum <shameerkolothum@gmail.com>
> Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
> eric.auger@redhat.com; peter.maydell@linaro.org; Jason Gunthorpe
> <jgg@nvidia.com>; ddutile@redhat.com; berrange@redhat.com; Nathan
> Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; linuxarm@huawei.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; Shameer Kolothum
> <skolothumtho@nvidia.com>
> Subject: Re: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict accelerated
> SMMUv3 to vfio-pci endpoints with iommufd
> 
> On Tue, Sep 16, 2025 at 11:33:21AM +0100, Shameer Kolothum wrote:
> > > I found a new problem here that we initialize new as_sysmem per
> > > VFIO device. So, sdevs would return own individual AS pointers
> > > here at this get_address_space function, although they point to
> > > the same system address space.
> > >
> > > Since address space pointers are returned differently for VFIO
> > > devices, this fails to reuse ioas_id in iommufd_cdev_attach(),
> > > and ends up with allocating a new ioas for each device.
> > >
> > > I think we can try the following change to make sure all accel
> > > linked VFIO devices would share the same ioas_id, though I am
> > > not sure if SMMUBaseClass is the right place to go. Please take
> > > a look.
> >
> > Ok. I think it is better to move that to SMMUv3AccelState and call
> > address_space_init() in smmuv3_accel_init() instead. Something like
> > below. Please take a look and let me know.
> 
> [...]
> 
> > @@ -37,7 +37,6 @@ typedef struct SMMUS1Hwpt {
> >
> >  typedef struct SMMUv3AccelDevice {
> >      SMMUDevice  sdev;
> > -    AddressSpace as_sysmem;
> >      HostIOMMUDeviceIOMMUFD *idev;
> >      SMMUS1Hwpt  *s1_hwpt;
> >      SMMUViommu *viommu;
> > @@ -48,6 +47,7 @@ typedef struct SMMUv3AccelDevice {
> >  typedef struct SMMUv3AccelState {
> >      MemoryRegion root;
> >      MemoryRegion sysmem;
> > +    AddressSpace as_sysmem;
> >      SMMUViommu *viommu;
> >      struct iommu_hw_info_arm_smmuv3 info;
> >  } SMMUv3AccelState;
> 
> That's changing from an ioas_id per VFIO device to an ioas_id per
> vSMMU instance, right? I think it's still not enough.
> 
> All vSMMU instances could share the same ioas_id. That is why I
> put in the SMMUBaseClass as it's shared structure across vSMMUs.

Ah..you mean it is basically per VM then. Got it.

Thanks,
Shameer

RE: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd

Posted by Shameer Kolothum 1 month, 3 weeks ago


> -----Original Message-----
> From: Shameer Kolothum
> Sent: 17 September 2025 19:53
> To: Nicolin Chen <nicolinc@nvidia.com>; Shameer Kolothum
> <shameerkolothum@gmail.com>
> Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
> eric.auger@redhat.com; peter.maydell@linaro.org; Jason Gunthorpe
> <jgg@nvidia.com>; ddutile@redhat.com; berrange@redhat.com; Nathan
> Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; linuxarm@huawei.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com
> Subject: RE: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict
> accelerated SMMUv3 to vfio-pci endpoints with iommufd
> 
> 
> 
> > -----Original Message-----
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: 17 September 2025 19:45
> > To: Shameer Kolothum <shameerkolothum@gmail.com>
> > Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
> eric.auger@redhat.com;
> > peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>;
> > ddutile@redhat.com; berrange@redhat.com; Nathan Chen
> > <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> > smostafa@google.com; linuxarm@huawei.com; wangzhou1@hisilicon.com;
> > jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> > zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; Shameer Kolothum
> > <skolothumtho@nvidia.com>
> > Subject: Re: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict
> > accelerated
> > SMMUv3 to vfio-pci endpoints with iommufd
> >
> > On Tue, Sep 16, 2025 at 11:33:21AM +0100, Shameer Kolothum wrote:
> > > > I found a new problem here that we initialize new as_sysmem per
> > > > VFIO device. So, sdevs would return own individual AS pointers
> > > > here at this get_address_space function, although they point to
> > > > the same system address space.
> > > >
> > > > Since address space pointers are returned differently for VFIO
> > > > devices, this fails to reuse ioas_id in iommufd_cdev_attach(), and
> > > > ends up with allocating a new ioas for each device.
> > > >
> > > > I think we can try the following change to make sure all accel
> > > > linked VFIO devices would share the same ioas_id, though I am not
> > > > sure if SMMUBaseClass is the right place to go. Please take a
> > > > look.
> > >
> > > Ok. I think it is better to move that to SMMUv3AccelState and call
> > > address_space_init() in smmuv3_accel_init() instead. Something like
> > > below. Please take a look and let me know.
> >
> > [...]
> >
> > > @@ -37,7 +37,6 @@ typedef struct SMMUS1Hwpt {
> > >
> > >  typedef struct SMMUv3AccelDevice {
> > >      SMMUDevice  sdev;
> > > -    AddressSpace as_sysmem;
> > >      HostIOMMUDeviceIOMMUFD *idev;
> > >      SMMUS1Hwpt  *s1_hwpt;
> > >      SMMUViommu *viommu;
> > > @@ -48,6 +47,7 @@ typedef struct SMMUv3AccelDevice {  typedef struct
> > > SMMUv3AccelState {
> > >      MemoryRegion root;
> > >      MemoryRegion sysmem;
> > > +    AddressSpace as_sysmem;
> > >      SMMUViommu *viommu;
> > >      struct iommu_hw_info_arm_smmuv3 info;  } SMMUv3AccelState;
> >
> > That's changing from an ioas_id per VFIO device to an ioas_id per
> > vSMMU instance, right? I think it's still not enough.
> >
> > All vSMMU instances could share the same ioas_id. That is why I put in
> > the SMMUBaseClass as it's shared structure across vSMMUs.
> 
> Ah..you mean it is basically per VM then. Got it.

Regarding using SMMUBaseClass for this, it looks like ObjectClass normally holds
function pointers. Eric has made a similar observation elsewhere in this series.

 @Eric, any suggestions?

Is use of &address_space_memory directly in smmuv3_accel_find_add_as() a complete
no-go, given we are talking about having the system address space for all the SMMUv3 
accelerated devices here?

Thanks,
Shameer

Re: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd

Posted by Nicolin Chen 1 month, 3 weeks ago

On Thu, Sep 18, 2025 at 06:31:43AM -0700, Shameer Kolothum wrote:
> > > > @@ -37,7 +37,6 @@ typedef struct SMMUS1Hwpt {
> > > >
> > > >  typedef struct SMMUv3AccelDevice {
> > > >      SMMUDevice  sdev;
> > > > -    AddressSpace as_sysmem;
> > > >      HostIOMMUDeviceIOMMUFD *idev;
> > > >      SMMUS1Hwpt  *s1_hwpt;
> > > >      SMMUViommu *viommu;
> > > > @@ -48,6 +47,7 @@ typedef struct SMMUv3AccelDevice {  typedef struct
> > > > SMMUv3AccelState {
> > > >      MemoryRegion root;
> > > >      MemoryRegion sysmem;
> > > > +    AddressSpace as_sysmem;
> > > >      SMMUViommu *viommu;
> > > >      struct iommu_hw_info_arm_smmuv3 info;  } SMMUv3AccelState;
> > >
> > > That's changing from an ioas_id per VFIO device to an ioas_id per
> > > vSMMU instance, right? I think it's still not enough.
> > >
> > > All vSMMU instances could share the same ioas_id. That is why I put in
> > > the SMMUBaseClass as it's shared structure across vSMMUs.
> > 
> > Ah..you mean it is basically per VM then. Got it.
> 
> Regarding using SMMUBaseClass for this, it looks like ObjectClass normally holds
> function pointers. Eric has made a similar observation elsewhere in this series.
> 
>  @Eric, any suggestions?
> 
> Is use of &address_space_memory directly in smmuv3_accel_find_add_as() a complete
> no-go, given we are talking about having the system address space for all the SMMUv3 
> accelerated devices here?

smmuv3_accel_find_add_as() is per instance. So, it wouldn't share.

Even smmuv3_accel_init() wouldn't work.

If SMMUBaseClass isn't a good place, (I haven't tested but..) can
we do address_space_init() in the VMS? I see virt.c calling the
memory_region_init(), so perhaps we could init the as_sysmem there
and pass it in to each vSMMU instance.

Nicolin

RE: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd

Posted by Shameer Kolothum 1 month, 3 weeks ago


> -----Original Message-----
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: 18 September 2025 23:00
> To: Shameer Kolothum <skolothumtho@nvidia.com>
> Cc: Shameer Kolothum <shameerkolothum@gmail.com>; qemu-
> arm@nongnu.org; qemu-devel@nongnu.org; eric.auger@redhat.com;
> peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>;
> ddutile@redhat.com; berrange@redhat.com; Nathan Chen
> <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; linuxarm@huawei.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com
> Subject: Re: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict
> accelerated SMMUv3 to vfio-pci endpoints with iommufd
> 
> On Thu, Sep 18, 2025 at 06:31:43AM -0700, Shameer Kolothum wrote:
> > > > > @@ -37,7 +37,6 @@ typedef struct SMMUS1Hwpt {
> > > > >
> > > > >  typedef struct SMMUv3AccelDevice {
> > > > >      SMMUDevice  sdev;
> > > > > -    AddressSpace as_sysmem;
> > > > >      HostIOMMUDeviceIOMMUFD *idev;
> > > > >      SMMUS1Hwpt  *s1_hwpt;
> > > > >      SMMUViommu *viommu;
> > > > > @@ -48,6 +47,7 @@ typedef struct SMMUv3AccelDevice {  typedef
> struct
> > > > > SMMUv3AccelState {
> > > > >      MemoryRegion root;
> > > > >      MemoryRegion sysmem;
> > > > > +    AddressSpace as_sysmem;
> > > > >      SMMUViommu *viommu;
> > > > >      struct iommu_hw_info_arm_smmuv3 info;  } SMMUv3AccelState;
> > > >
> > > > That's changing from an ioas_id per VFIO device to an ioas_id per
> > > > vSMMU instance, right? I think it's still not enough.
> > > >
> > > > All vSMMU instances could share the same ioas_id. That is why I put in
> > > > the SMMUBaseClass as it's shared structure across vSMMUs.
> > >
> > > Ah..you mean it is basically per VM then. Got it.
> >
> > Regarding using SMMUBaseClass for this, it looks like ObjectClass normally
> holds
> > function pointers. Eric has made a similar observation elsewhere in this
> series.
> >
> >  @Eric, any suggestions?
> >
> > Is use of &address_space_memory directly in smmuv3_accel_find_add_as()
> a complete
> > no-go, given we are talking about having the system address space for all the
> SMMUv3
> > accelerated devices here?
> 
> smmuv3_accel_find_add_as() is per instance. So, it wouldn't share.

My suggestion was...

static AddressSpace *smmuv3_accel_find_add_as(..)
{
...
    if (vfio_pci) {
        return &address_space_memory;
    } else {
        return &sdev->as;
    }
}

ie, use the global to system memory address space instead of creating an
alias to the system memory and a different address space. This will provide
the same pointer to VFIO/iommufd and  it can then reuse the ioas_id.
I can see that QEMU uses "&address_space_memory" directly in many places
(pci_device_iommu_address_space(), etc). I think the idea behind a separate 
address space is to have private ownership and lifetime management probably.
Not sure there are any other concerns here. Please let me know if there are
any.

> Even smmuv3_accel_init() wouldn't work.
> 
> If SMMUBaseClass isn't a good place, (I haven't tested but..) can
> we do address_space_init() in the VMS? I see virt.c calling the
> memory_region_init(), so perhaps we could init the as_sysmem there
> and pass it in to each vSMMU instance.

That will still require you to store the address space pointer somewhere and 
each time a SMMUv3 accel instance is created set a property to the instance
via the hotplug handler. Doable but still...

Thanks,
Shameer

Re: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd

Posted by Peter Maydell 1 month, 3 weeks ago

On Fri, 19 Sept 2025 at 08:38, Shameer Kolothum <skolothumtho@nvidia.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: 18 September 2025 23:00
> > To: Shameer Kolothum <skolothumtho@nvidia.com>
> > Cc: Shameer Kolothum <shameerkolothum@gmail.com>; qemu-
> > arm@nongnu.org; qemu-devel@nongnu.org; eric.auger@redhat.com;
> > peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>;
> > ddutile@redhat.com; berrange@redhat.com; Nathan Chen
> > <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> > smostafa@google.com; linuxarm@huawei.com; wangzhou1@hisilicon.com;
> > jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> > zhangfei.gao@linaro.org; zhenzhong.duan@intel.com
> > Subject: Re: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict
> > accelerated SMMUv3 to vfio-pci endpoints with iommufd
> >
> > On Thu, Sep 18, 2025 at 06:31:43AM -0700, Shameer Kolothum wrote:
> > > > > > @@ -37,7 +37,6 @@ typedef struct SMMUS1Hwpt {
> > > > > >
> > > > > >  typedef struct SMMUv3AccelDevice {
> > > > > >      SMMUDevice  sdev;
> > > > > > -    AddressSpace as_sysmem;
> > > > > >      HostIOMMUDeviceIOMMUFD *idev;
> > > > > >      SMMUS1Hwpt  *s1_hwpt;
> > > > > >      SMMUViommu *viommu;
> > > > > > @@ -48,6 +47,7 @@ typedef struct SMMUv3AccelDevice {  typedef
> > struct
> > > > > > SMMUv3AccelState {
> > > > > >      MemoryRegion root;
> > > > > >      MemoryRegion sysmem;
> > > > > > +    AddressSpace as_sysmem;
> > > > > >      SMMUViommu *viommu;
> > > > > >      struct iommu_hw_info_arm_smmuv3 info;  } SMMUv3AccelState;
> > > > >
> > > > > That's changing from an ioas_id per VFIO device to an ioas_id per
> > > > > vSMMU instance, right? I think it's still not enough.
> > > > >
> > > > > All vSMMU instances could share the same ioas_id. That is why I put in
> > > > > the SMMUBaseClass as it's shared structure across vSMMUs.
> > > >
> > > > Ah..you mean it is basically per VM then. Got it.
> > >
> > > Regarding using SMMUBaseClass for this, it looks like ObjectClass normally
> > holds
> > > function pointers. Eric has made a similar observation elsewhere in this
> > series.
> > >
> > >  @Eric, any suggestions?
> > >
> > > Is use of &address_space_memory directly in smmuv3_accel_find_add_as()
> > a complete
> > > no-go, given we are talking about having the system address space for all the
> > SMMUv3
> > > accelerated devices here?
> >
> > smmuv3_accel_find_add_as() is per instance. So, it wouldn't share.
>
> My suggestion was...
>
> static AddressSpace *smmuv3_accel_find_add_as(..)
> {
> ...
>     if (vfio_pci) {
>         return &address_space_memory;
>     } else {
>         return &sdev->as;
>     }
> }
>
> ie, use the global to system memory address space instead of creating an
> alias to the system memory and a different address space. This will provide
> the same pointer to VFIO/iommufd and  it can then reuse the ioas_id.
> I can see that QEMU uses "&address_space_memory" directly in many places
> (pci_device_iommu_address_space(), etc). I think the idea behind a separate
> address space is to have private ownership and lifetime management probably.
> Not sure there are any other concerns here. Please let me know if there are
> any.

I don't know about vfio, but why is it special here? Generally
directly doing stuff with address_space_memory is not a good
idea : you should be dealing with whatever address space the
board model handed you as what you do transactions to. We
have a lot of legacy code that assumes it can directly work
with address_space_memory, but we should usually try to avoid
adding new code that does that.

thanks
-- PMM

RE: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd

Posted by Shameer Kolothum 1 month, 3 weeks ago


> -----Original Message-----
> From: Peter Maydell <peter.maydell@linaro.org>
> Sent: 20 September 2025 14:04
> To: Shameer Kolothum <skolothumtho@nvidia.com>
> Cc: Nicolin Chen <nicolinc@nvidia.com>; Shameer Kolothum
> <shameerkolothum@gmail.com>; qemu-arm@nongnu.org; qemu-
> devel@nongnu.org; eric.auger@redhat.com; Jason Gunthorpe
> <jgg@nvidia.com>; ddutile@redhat.com; berrange@redhat.com; Nathan
> Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; linuxarm@huawei.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com
> Subject: Re: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict accelerated
> SMMUv3 to vfio-pci endpoints with iommufd
> 
> External email: Use caution opening links or attachments
> 
> 
> On Fri, 19 Sept 2025 at 08:38, Shameer Kolothum
> <skolothumtho@nvidia.com> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > Sent: 18 September 2025 23:00
> > > To: Shameer Kolothum <skolothumtho@nvidia.com>
> > > Cc: Shameer Kolothum <shameerkolothum@gmail.com>; qemu-
> > > arm@nongnu.org; qemu-devel@nongnu.org; eric.auger@redhat.com;
> > > peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>;
> > > ddutile@redhat.com; berrange@redhat.com; Nathan Chen
> > > <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> > > smostafa@google.com; linuxarm@huawei.com;
> wangzhou1@hisilicon.com;
> > > jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> > > zhangfei.gao@linaro.org; zhenzhong.duan@intel.com
> > > Subject: Re: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict
> > > accelerated SMMUv3 to vfio-pci endpoints with iommufd
> > >
> > > On Thu, Sep 18, 2025 at 06:31:43AM -0700, Shameer Kolothum wrote:
> > > > > > > @@ -37,7 +37,6 @@ typedef struct SMMUS1Hwpt {
> > > > > > >
> > > > > > >  typedef struct SMMUv3AccelDevice {
> > > > > > >      SMMUDevice  sdev;
> > > > > > > -    AddressSpace as_sysmem;
> > > > > > >      HostIOMMUDeviceIOMMUFD *idev;
> > > > > > >      SMMUS1Hwpt  *s1_hwpt;
> > > > > > >      SMMUViommu *viommu;
> > > > > > > @@ -48,6 +47,7 @@ typedef struct SMMUv3AccelDevice {  typedef
> > > struct
> > > > > > > SMMUv3AccelState {
> > > > > > >      MemoryRegion root;
> > > > > > >      MemoryRegion sysmem;
> > > > > > > +    AddressSpace as_sysmem;
> > > > > > >      SMMUViommu *viommu;
> > > > > > >      struct iommu_hw_info_arm_smmuv3 info;  }
> SMMUv3AccelState;
> > > > > >
> > > > > > That's changing from an ioas_id per VFIO device to an ioas_id per
> > > > > > vSMMU instance, right? I think it's still not enough.
> > > > > >
> > > > > > All vSMMU instances could share the same ioas_id. That is why I put
> in
> > > > > > the SMMUBaseClass as it's shared structure across vSMMUs.
> > > > >
> > > > > Ah..you mean it is basically per VM then. Got it.
> > > >
> > > > Regarding using SMMUBaseClass for this, it looks like ObjectClass
> normally
> > > holds
> > > > function pointers. Eric has made a similar observation elsewhere in this
> > > series.
> > > >
> > > >  @Eric, any suggestions?
> > > >
> > > > Is use of &address_space_memory directly in
> smmuv3_accel_find_add_as()
> > > a complete
> > > > no-go, given we are talking about having the system address space for all
> the
> > > SMMUv3
> > > > accelerated devices here?
> > >
> > > smmuv3_accel_find_add_as() is per instance. So, it wouldn't share.
> >
> > My suggestion was...
> >
> > static AddressSpace *smmuv3_accel_find_add_as(..)
> > {
> > ...
> >     if (vfio_pci) {
> >         return &address_space_memory;
> >     } else {
> >         return &sdev->as;
> >     }
> > }
> >
> > ie, use the global to system memory address space instead of creating an
> > alias to the system memory and a different address space. This will provide
> > the same pointer to VFIO/iommufd and  it can then reuse the ioas_id.
> > I can see that QEMU uses "&address_space_memory" directly in many
> places
> > (pci_device_iommu_address_space(), etc). I think the idea behind a
> separate
> > address space is to have private ownership and lifetime management
> probably.
> > Not sure there are any other concerns here. Please let me know if there are
> > any.
> 
> I don't know about vfio, but why is it special here? Generally
> directly doing stuff with address_space_memory is not a good
> idea : you should be dealing with whatever address space the
> board model handed you as what you do transactions to. We
> have a lot of legacy code that assumes it can directly work
> with address_space_memory, but we should usually try to avoid
> adding new code that does that.

The goal is to return the same system address space pointer for all devices
behind the accelerated SMMUv3s in a VM. That way VFIO/iommufd can
reuse a single IOAS ID in iommufd_cdev_attach(), allowing the Stage-2 page
tables to be shared within the VM instead of duplicating them for every
SMMUv3 instance.

If directly using &address_space_memory in SMMUv3 is discouraged, the
other two options considered so far are,

 -Make use of SMMUv3 ObjectClass to create a separate Address Space so that
  all instances can return the same pointer. But it looks like #ObjectClass
  is mainly to hold function pointers and, AFAICS, AddressSpace is not an
  object type making it difficult to associate the ptr using
  object_class_property_add/object_class_property_link etc.

 -Same problem if we do it within virt code. How can we link it in virt with
  SMMUv3 dev state?

Please let me know if I am missing something and there is a way around this.
Any pointers appreciated.

Thanks,
Shameer

Re: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd

Posted by Nicolin Chen 1 month, 3 weeks ago

On Fri, Sep 19, 2025 at 12:38:45AM -0700, Shameer Kolothum wrote:
> My suggestion was...
> 
> static AddressSpace *smmuv3_accel_find_add_as(..)
> {
> ...
>     if (vfio_pci) {
>         return &address_space_memory;
>     } else {
>         return &sdev->as;
>     }
> }
> 
> ie, use the global to system memory address space instead of creating an
> alias to the system memory and a different address space. This will provide
> the same pointer to VFIO/iommufd and  it can then reuse the ioas_id.
> I can see that QEMU uses "&address_space_memory" directly in many places
> (pci_device_iommu_address_space(), etc). I think the idea behind a separate 
> address space is to have private ownership and lifetime management probably.
> Not sure there are any other concerns here. Please let me know if there are
> any.

Oh, I misunderstood.

Yea, given that address_space_memory comes from the system_memory:

system/physmem.c:2824:    address_space_init(&address_space_memory, system_memory, "memory");

I suppose it should work. That way will be cleaner.

Thanks
Nicolin

Re: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd

Posted by Jason Gunthorpe 1 month, 4 weeks ago

On Wed, Sep 17, 2025 at 06:52:31PM +0000, Shameer Kolothum wrote:
> > All vSMMU instances could share the same ioas_id. That is why I
> > put in the SMMUBaseClass as it's shared structure across vSMMUs.
> 
> Ah..you mean it is basically per VM then. Got it.

Yes, this is quite important, each ioas charges memory for what it
pins, so every needless duplicate wrongly inflates the accounting.

We also are working to have the kernel share S2 page tables across
instances which requires a single ioas.

Jason

RE: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd

Posted by Duan, Zhenzhong 4 months ago


>-----Original Message-----
>From: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
>Subject: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict accelerated
>SMMUv3 to vfio-pci endpoints with iommufd
>
>Accelerated SMMUv3 is only useful when the device can take advantage of
>the host's SMMUv3 in nested mode. To keep things simple and correct, we
>only allow this feature for vfio-pci endpoint devices that use the iommufd
>backend. We also allow non-endpoint emulated devices like PCI bridges and
>root ports, so that users can plug in these vfio-pci devices.
>
>Another reason for this limit is to avoid problems with IOTLB
>invalidations. Some commands (e.g., CMD_TLBI_NH_ASID) lack an associated
>SID, making it difficult to trace the originating device. If we allowed
>emulated endpoint devices, QEMU would have to invalidate both its own
>software IOTLB and the host's hardware IOTLB, which could slow things
>down.
>
>Since vfio-pci devices in nested mode rely on the host SMMUv3's nested
>translation (S1+S2), their get_address_space() callback must return the
>system address space to enable correct S2 mappings of guest RAM.
>
>So in short:
> - vfio-pci devices return the system address space
> - bridges and root ports return the IOMMU address space
>
>Note: On ARM, MSI doorbell addresses are also translated via SMMUv3.

So the translation result is a doorbell addr(gpa) for guest?
IIUC, there should be a mapping between guest doorbell addr(gpa) to host
doorbell addr(hpa) in stage2 page table? Where is this mapping setup?

>Hence, if a vfio-pci device is behind the SMMuv3 with translation enabled,
>it must return the IOMMU address space for MSI. Support for this will be
>added in a follow-up patch.
>
>Signed-off-by: Shameer Kolothum
><shameerali.kolothum.thodi@huawei.com>
>---
> hw/arm/smmuv3-accel.c               | 50
>++++++++++++++++++++++++++++-
> hw/arm/smmuv3-accel.h               | 15 +++++++++
> hw/arm/smmuv3.c                     |  4 +++
> hw/pci-bridge/pci_expander_bridge.c |  1 -
> include/hw/arm/smmuv3.h             |  1 +
> include/hw/pci/pci_bridge.h         |  1 +
> 6 files changed, 70 insertions(+), 2 deletions(-)
>
>diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
>index 2eac9c6ff4..0b0ddb03e2 100644
>--- a/hw/arm/smmuv3-accel.c
>+++ b/hw/arm/smmuv3-accel.c
>@@ -7,13 +7,19 @@
>  */
>
> #include "qemu/osdep.h"
>+#include "qemu/error-report.h"
>
> #include "hw/arm/smmuv3.h"
>+#include "hw/pci/pci_bridge.h"
>+#include "hw/pci-host/gpex.h"
>+#include "hw/vfio/pci.h"
>+
> #include "smmuv3-accel.h"
>
> static SMMUv3AccelDevice *smmuv3_accel_get_dev(SMMUState *bs,
>SMMUPciBus *sbus,
>                                                 PCIBus *bus, int
>devfn)
> {
>+    SMMUv3State *s = ARM_SMMUV3(bs);
>     SMMUDevice *sdev = sbus->pbdev[devfn];
>     SMMUv3AccelDevice *accel_dev;
>
>@@ -25,30 +31,72 @@ static SMMUv3AccelDevice
>*smmuv3_accel_get_dev(SMMUState *bs, SMMUPciBus *sbus,
>
>         sbus->pbdev[devfn] = sdev;
>         smmu_init_sdev(bs, sdev, bus, devfn);
>+        address_space_init(&accel_dev->as_sysmem, &s->s_accel->root,
>+                           "smmuv3-accel-sysmem");
>     }
>
>     return accel_dev;
> }
>
>+static bool smmuv3_accel_pdev_allowed(PCIDevice *pdev, bool *vfio_pci)
>+{
>+
>+    if (object_dynamic_cast(OBJECT(pdev), TYPE_PCI_BRIDGE) ||
>+        object_dynamic_cast(OBJECT(pdev), "pxb-pcie") ||
>+        object_dynamic_cast(OBJECT(pdev), "gpex-root")) {
>+        return true;
>+    } else if ((object_dynamic_cast(OBJECT(pdev), TYPE_VFIO_PCI) &&
>+        object_property_find(OBJECT(pdev), "iommufd"))) {

Will this always return true?

>+        *vfio_pci = true;
>+        return true;
>+    }
>+    return false;
>+}
>+
> static AddressSpace *smmuv3_accel_find_add_as(PCIBus *bus, void
>*opaque,
>                                               int devfn)
> {
>+    PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), devfn);
>     SMMUState *bs = opaque;
>+    bool vfio_pci = false;
>     SMMUPciBus *sbus;
>     SMMUv3AccelDevice *accel_dev;
>     SMMUDevice *sdev;
>
>+    if (pdev && !smmuv3_accel_pdev_allowed(pdev, &vfio_pci)) {
>+        error_report("Device(%s) not allowed. Only PCIe root complex
>devices "
>+                     "or PCI bridge devices or vfio-pci endpoint devices
>with "
>+                     "iommufd as backend is allowed with
>arm-smmuv3,accel=on",
>+                     pdev->name);
>+        exit(1);

Seems aggressive for a hotplug, could we fail hotplug instead of kill QEMU?

Thanks
Zhenzhong

>+    }
>     sbus = smmu_get_sbus(bs, bus);
>     accel_dev = smmuv3_accel_get_dev(bs, sbus, bus, devfn);
>     sdev = &accel_dev->sdev;
>
>-    return &sdev->as;
>+    if (vfio_pci) {
>+        return &accel_dev->as_sysmem;
>+    } else {
>+        return &sdev->as;
>+    }
> }
>
> static const PCIIOMMUOps smmuv3_accel_ops = {
>     .get_address_space = smmuv3_accel_find_add_as,
> };
>
>+void smmuv3_accel_init(SMMUv3State *s)
>+{
>+    SMMUv3AccelState *s_accel;
>+
>+    s->s_accel = s_accel = g_new0(SMMUv3AccelState, 1);
>+    memory_region_init(&s_accel->root, OBJECT(s), "root", UINT64_MAX);
>+    memory_region_init_alias(&s_accel->sysmem, OBJECT(s),
>+                             "smmuv3-accel-sysmem",
>get_system_memory(), 0,
>+
>memory_region_size(get_system_memory()));
>+    memory_region_add_subregion(&s_accel->root, 0,
>&s_accel->sysmem);
>+}
>+
> static void smmuv3_accel_class_init(ObjectClass *oc, const void *data)
> {
>     SMMUBaseClass *sbc = ARM_SMMU_CLASS(oc);
>diff --git a/hw/arm/smmuv3-accel.h b/hw/arm/smmuv3-accel.h
>index 4cf30b1291..2cd343103f 100644
>--- a/hw/arm/smmuv3-accel.h
>+++ b/hw/arm/smmuv3-accel.h
>@@ -9,11 +9,26 @@
> #ifndef HW_ARM_SMMUV3_ACCEL_H
> #define HW_ARM_SMMUV3_ACCEL_H
>
>+#include "hw/arm/smmuv3.h"
> #include "hw/arm/smmu-common.h"
> #include CONFIG_DEVICES
>
> typedef struct SMMUv3AccelDevice {
>     SMMUDevice  sdev;
>+    AddressSpace as_sysmem;
> } SMMUv3AccelDevice;
>
>+typedef struct SMMUv3AccelState {
>+    MemoryRegion root;
>+    MemoryRegion sysmem;
>+} SMMUv3AccelState;
>+
>+#if defined(CONFIG_ARM_SMMUV3) && defined(CONFIG_IOMMUFD)
>+void smmuv3_accel_init(SMMUv3State *s);
>+#else
>+static inline void smmuv3_accel_init(SMMUv3State *d)
>+{
>+}
>+#endif
>+
> #endif /* HW_ARM_SMMUV3_ACCEL_H */
>diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
>index bcf8af8dc7..2f5a8157dd 100644
>--- a/hw/arm/smmuv3.c
>+++ b/hw/arm/smmuv3.c
>@@ -32,6 +32,7 @@
> #include "qapi/error.h"
>
> #include "hw/arm/smmuv3.h"
>+#include "smmuv3-accel.h"
> #include "smmuv3-internal.h"
> #include "smmu-internal.h"
>
>@@ -1898,6 +1899,9 @@ static void smmu_realize(DeviceState *d, Error
>**errp)
>     sysbus_init_mmio(dev, &sys->iomem);
>
>     smmu_init_irq(s, dev);
>+    if (sys->accel) {
>+        smmuv3_accel_init(s);
>+    }
> }
>
> static const VMStateDescription vmstate_smmuv3_queue = {
>diff --git a/hw/pci-bridge/pci_expander_bridge.c
>b/hw/pci-bridge/pci_expander_bridge.c
>index 1bcceddbc4..a8eb2d2426 100644
>--- a/hw/pci-bridge/pci_expander_bridge.c
>+++ b/hw/pci-bridge/pci_expander_bridge.c
>@@ -48,7 +48,6 @@ struct PXBBus {
>     char bus_path[8];
> };
>
>-#define TYPE_PXB_PCIE_DEV "pxb-pcie"
> OBJECT_DECLARE_SIMPLE_TYPE(PXBPCIEDev, PXB_PCIE_DEV)
>
> static GList *pxb_dev_list;
>diff --git a/include/hw/arm/smmuv3.h b/include/hw/arm/smmuv3.h
>index d183a62766..3bdb92391a 100644
>--- a/include/hw/arm/smmuv3.h
>+++ b/include/hw/arm/smmuv3.h
>@@ -63,6 +63,7 @@ struct SMMUv3State {
>     qemu_irq     irq[4];
>     QemuMutex mutex;
>     char *stage;
>+    struct SMMUv3AccelState  *s_accel;
> };
>
> typedef enum {
>diff --git a/include/hw/pci/pci_bridge.h b/include/hw/pci/pci_bridge.h
>index a055fd8d32..b61360b900 100644
>--- a/include/hw/pci/pci_bridge.h
>+++ b/include/hw/pci/pci_bridge.h
>@@ -106,6 +106,7 @@ typedef struct PXBPCIEDev {
>
> #define TYPE_PXB_PCIE_BUS "pxb-pcie-bus"
> #define TYPE_PXB_CXL_BUS "pxb-cxl-bus"
>+#define TYPE_PXB_PCIE_DEV "pxb-pcie"
> #define TYPE_PXB_DEV "pxb"
> OBJECT_DECLARE_SIMPLE_TYPE(PXBDev, PXB_DEV)
>
>--
>2.34.1

Re: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd

Posted by Nicolin Chen 4 months ago

On Tue, Jul 15, 2025 at 10:53:50AM +0000, Duan, Zhenzhong wrote:
> 
> 
> >-----Original Message-----
> >From: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> >Subject: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict accelerated
> >SMMUv3 to vfio-pci endpoints with iommufd
> >
> >Accelerated SMMUv3 is only useful when the device can take advantage of
> >the host's SMMUv3 in nested mode. To keep things simple and correct, we
> >only allow this feature for vfio-pci endpoint devices that use the iommufd
> >backend. We also allow non-endpoint emulated devices like PCI bridges and
> >root ports, so that users can plug in these vfio-pci devices.
> >
> >Another reason for this limit is to avoid problems with IOTLB
> >invalidations. Some commands (e.g., CMD_TLBI_NH_ASID) lack an associated
> >SID, making it difficult to trace the originating device. If we allowed
> >emulated endpoint devices, QEMU would have to invalidate both its own
> >software IOTLB and the host's hardware IOTLB, which could slow things
> >down.
> >
> >Since vfio-pci devices in nested mode rely on the host SMMUv3's nested
> >translation (S1+S2), their get_address_space() callback must return the
> >system address space to enable correct S2 mappings of guest RAM.
> >
> >So in short:
> > - vfio-pci devices return the system address space
> > - bridges and root ports return the IOMMU address space
> >
> >Note: On ARM, MSI doorbell addresses are also translated via SMMUv3.
> 
> So the translation result is a doorbell addr(gpa) for guest?
> IIUC, there should be a mapping between guest doorbell addr(gpa) to host
> doorbell addr(hpa) in stage2 page table? Where is this mapping setup?

Yes and yes.

On ARM, MSI is behind IOMMU. When 2-stage translation is enabled,
it goes through two stages as you understood.

There are a few ways to implement this, though the current kernel
only supports one solution, which is a hard-coded RMR (reserved
memory region).

The solution sets up a RMR region in the ACPI's IORT, which maps
the stage1 linearly, i.e. gIOVA=gPA.

The gPA=>hPA mappings in the stage-2 are done by the kernel that
polls an IOMMU_RESV_SW_MSI region defined in the kernel driver.

It's not the ideal solution, but it's the simplest to implement.

There are other ways to support this like a true 2-stage mapping
but they are still on the way.

For more details, please refer to this:
https://lore.kernel.org/all/cover.1740014950.git.nicolinc@nvidia.com/

> >+static bool smmuv3_accel_pdev_allowed(PCIDevice *pdev, bool *vfio_pci)
> >+{
> >+
> >+    if (object_dynamic_cast(OBJECT(pdev), TYPE_PCI_BRIDGE) ||
> >+        object_dynamic_cast(OBJECT(pdev), "pxb-pcie") ||
> >+        object_dynamic_cast(OBJECT(pdev), "gpex-root")) {
> >+        return true;
> >+    } else if ((object_dynamic_cast(OBJECT(pdev), TYPE_VFIO_PCI) &&
> >+        object_property_find(OBJECT(pdev), "iommufd"))) {
> 
> Will this always return true?

It won't if a vfio-pci device doesn't have the "iommufd" property?

> >+        *vfio_pci = true;
> >+        return true;
> >+    }
> >+    return false;

Then, it returns "false" here.

> > static AddressSpace *smmuv3_accel_find_add_as(PCIBus *bus, void
> >*opaque,
> >                                               int devfn)
> > {
> >+    PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), devfn);
> >     SMMUState *bs = opaque;
> >+    bool vfio_pci = false;
> >     SMMUPciBus *sbus;
> >     SMMUv3AccelDevice *accel_dev;
> >     SMMUDevice *sdev;
> >
> >+    if (pdev && !smmuv3_accel_pdev_allowed(pdev, &vfio_pci)) {
> >+        error_report("Device(%s) not allowed. Only PCIe root complex
> >devices "
> >+                     "or PCI bridge devices or vfio-pci endpoint devices
> >with "
> >+                     "iommufd as backend is allowed with
> >arm-smmuv3,accel=on",
> >+                     pdev->name);
> >+        exit(1);
> 
> Seems aggressive for a hotplug, could we fail hotplug instead of kill QEMU?

Hotplug will unlikely be supported well, as it would introduce
too much complication.

With iommufd, a vIOMMU object is allocated per device (vfio). If
the device fd (cdev) is not yet given to the QEMU. It isn't able
to allocate a vIOMMU object when creating a VM.

While a vIOMMU object can be allocated at a later stage once the
device is hotplugged. But things like IORT mappings aren't able
to get refreshed since the OS is likely already booted. Even an
IOMMU capability sync via the hw_info ioctl will be difficult to
do at the runtime post the guest iommu driver's initialization.

I am not 100% sure. But I think Intel model could have a similar
problem if the guest boots with zero cold-plugged device and then
hot-plugs a PASID-capable device at the runtime, when the guest-
level IOMMU driver is already inited?

FWIW, Shameer's cover-letter has the following line:
 "At least one vfio-pci device must currently be cold-plugged to
  a PCIe root complex associated with arm-smmuv3,accel=on."

Perhaps there should be a similar highlight in this smmuv3-accel
file as well (@Shameer).

Nicolin

RE: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd

Posted by Shameerali Kolothum Thodi via 4 months ago


> -----Original Message-----
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Tuesday, July 15, 2025 6:59 PM
> To: Duan, Zhenzhong <zhenzhong.duan@intel.com>
> Cc: Shameerali Kolothum Thodi
> <shameerali.kolothum.thodi@huawei.com>; qemu-arm@nongnu.org;
> qemu-devel@nongnu.org; eric.auger@redhat.com;
> peter.maydell@linaro.org; jgg@nvidia.com; ddutile@redhat.com;
> berrange@redhat.com; nathanc@nvidia.com; mochs@nvidia.com;
> smostafa@google.com; Linuxarm <linuxarm@huawei.com>; Wangzhou (B)
> <wangzhou1@hisilicon.com>; jiangkunkun <jiangkunkun@huawei.com>;
> Jonathan Cameron <jonathan.cameron@huawei.com>;
> zhangfei.gao@linaro.org; shameerkolothum@gmail.com
> Subject: Re: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict
> accelerated SMMUv3 to vfio-pci endpoints with iommufd

...

> > >+    if (pdev && !smmuv3_accel_pdev_allowed(pdev, &vfio_pci)) {
> > >+        error_report("Device(%s) not allowed. Only PCIe root complex
> > >devices "
> > >+                     "or PCI bridge devices or vfio-pci endpoint devices
> > >with "
> > >+                     "iommufd as backend is allowed with
> > >arm-smmuv3,accel=on",
> > >+                     pdev->name);
> > >+        exit(1);
> >
> > Seems aggressive for a hotplug, could we fail hotplug instead of kill
> QEMU?

That's right. I will try to see whether it is possible to do a dev->hotplugged
check here.
 
> Hotplug will unlikely be supported well, as it would introduce
> too much complication.
> 
> With iommufd, a vIOMMU object is allocated per device (vfio). If
> the device fd (cdev) is not yet given to the QEMU. It isn't able
> to allocate a vIOMMU object when creating a VM.
> 
> While a vIOMMU object can be allocated at a later stage once the
> device is hotplugged. But things like IORT mappings aren't able
> to get refreshed since the OS is likely already booted.

Why do we need IORT mappings to be refreshed during hotplug?
AFAICS, the mappings are created per host bridge Ids. And how is this
different from a host machine doing hotplug?

 Even an
> IOMMU capability sync via the hw_info ioctl will be difficult to
> do at the runtime post the guest iommu driver's initialization.

We had some discussion on this "at least one vfio-pci" restriction
for accelerated mode previously here.
https://lore.kernel.org/qemu-devel/Z6TtCLQ35UI12T77@redhat.com/#t

I am not sure we reached any consensus on that. The 3 different approaches
discussed were,

1. The current one used here. At least one cold plugged vfio-pci device
   so that  we can retrieve the host SMMUV3 HW_INFO as per current
  IOMMUFD APIs.

2. A new IOMMUFD API to retrieve HW_INFO without a device. 

3. A fully specified vSMMUv3 through Qemu command line so that we
   don't need HW_INFO from kernel.

We're going with option one for now, but completely blocking hotplug
because of it  feels a bit too restrictive to me.

The real issue (for now), as I see it, is that we need some way to remember
the Guest SMMUv3 <-> Host SMMUv3 mapping after the guest has booted.
That way, even if all devices tied to a Guest SMMUv3 get hot-unplugged,
QEMU can still block attaching a device that belongs to a different Host
SMMUv3.

Thanks,
Shameer

> I am not 100% sure. But I think Intel model could have a similar
> problem if the guest boots with zero cold-plugged device and then
> hot-plugs a PASID-capable device at the runtime, when the guest-
> level IOMMU driver is already inited?
> 
> FWIW, Shameer's cover-letter has the following line:
>  "At least one vfio-pci device must currently be cold-plugged to
>   a PCIe root complex associated with arm-smmuv3,accel=on."
> 
> Perhaps there should be a similar highlight in this smmuv3-accel
> file as well (@Shameer).
> 
> Nicolin

Re: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd

Posted by Eric Auger 2 months, 1 week ago

Hi Shameer,

On 7/16/25 10:06 AM, Shameerali Kolothum Thodi wrote:
>
>> -----Original Message-----
>> From: Nicolin Chen <nicolinc@nvidia.com>
>> Sent: Tuesday, July 15, 2025 6:59 PM
>> To: Duan, Zhenzhong <zhenzhong.duan@intel.com>
>> Cc: Shameerali Kolothum Thodi
>> <shameerali.kolothum.thodi@huawei.com>; qemu-arm@nongnu.org;
>> qemu-devel@nongnu.org; eric.auger@redhat.com;
>> peter.maydell@linaro.org; jgg@nvidia.com; ddutile@redhat.com;
>> berrange@redhat.com; nathanc@nvidia.com; mochs@nvidia.com;
>> smostafa@google.com; Linuxarm <linuxarm@huawei.com>; Wangzhou (B)
>> <wangzhou1@hisilicon.com>; jiangkunkun <jiangkunkun@huawei.com>;
>> Jonathan Cameron <jonathan.cameron@huawei.com>;
>> zhangfei.gao@linaro.org; shameerkolothum@gmail.com
>> Subject: Re: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict
>> accelerated SMMUv3 to vfio-pci endpoints with iommufd
> ...
>
>>>> +    if (pdev && !smmuv3_accel_pdev_allowed(pdev, &vfio_pci)) {
>>>> +        error_report("Device(%s) not allowed. Only PCIe root complex
>>>> devices "
>>>> +                     "or PCI bridge devices or vfio-pci endpoint devices
>>>> with "
>>>> +                     "iommufd as backend is allowed with
>>>> arm-smmuv3,accel=on",
>>>> +                     pdev->name);
>>>> +        exit(1);
>>> Seems aggressive for a hotplug, could we fail hotplug instead of kill
>> QEMU?
> That's right. I will try to see whether it is possible to do a dev->hotplugged
> check here.
>  
>> Hotplug will unlikely be supported well, as it would introduce
>> too much complication.
>>
>> With iommufd, a vIOMMU object is allocated per device (vfio). If
>> the device fd (cdev) is not yet given to the QEMU. It isn't able
>> to allocate a vIOMMU object when creating a VM.
>>
>> While a vIOMMU object can be allocated at a later stage once the
>> device is hotplugged. But things like IORT mappings aren't able
>> to get refreshed since the OS is likely already booted.
> Why do we need IORT mappings to be refreshed during hotplug?
> AFAICS, the mappings are created per host bridge Ids. And how is this
> different from a host machine doing hotplug?
>
>  Even an
>> IOMMU capability sync via the hw_info ioctl will be difficult to
>> do at the runtime post the guest iommu driver's initialization.
> We had some discussion on this "at least one vfio-pci" restriction
> for accelerated mode previously here.
> https://lore.kernel.org/qemu-devel/Z6TtCLQ35UI12T77@redhat.com/#t
>
> I am not sure we reached any consensus on that. The 3 different approaches
> discussed were,
>
> 1. The current one used here. At least one cold plugged vfio-pci device
>    so that  we can retrieve the host SMMUV3 HW_INFO as per current
>   IOMMUFD APIs.

I do not get why you can't wait for the 1st device to be attached to
"freeze" the settings. Is it because you may also have some bridges /
root ports also attached to the same viommu. As those latter do not have
any adherence to the host SMMU, is that a problem?
>
> 2. A new IOMMUFD API to retrieve HW_INFO without a device. 
this can only be possible if, on the command line you connect the vsmmu
to a sysfs path to the host iommu (or maybe this is what you meant in
3). This would be another option we also evoked in the past. But this is
not very user friendly for the guy who launches the VM to care both the
device and the associated physical SMMU. Logically we could build that
relationship automatically.
>
> 3. A fully specified vSMMUv3 through Qemu command line so that we
>    don't need HW_INFO from kernel.

I don't think this is sensible as there may be plenty of those, each
requirement a libvirt adaptation

Eric
>
> We're going with option one for now, but completely blocking hotplug
> because of it  feels a bit too restrictive to me.
>
> The real issue (for now), as I see it, is that we need some way to remember
> the Guest SMMUv3 <-> Host SMMUv3 mapping after the guest has booted.
> That way, even if all devices tied to a Guest SMMUv3 get hot-unplugged,
> QEMU can still block attaching a device that belongs to a different Host
> SMMUv3.
>
> Thanks,
> Shameer
>
>> I am not 100% sure. But I think Intel model could have a similar
>> problem if the guest boots with zero cold-plugged device and then
>> hot-plugs a PASID-capable device at the runtime, when the guest-
>> level IOMMU driver is already inited?
>>
>> FWIW, Shameer's cover-letter has the following line:
>>  "At least one vfio-pci device must currently be cold-plugged to
>>   a PCIe root complex associated with arm-smmuv3,accel=on."
>>
>> Perhaps there should be a similar highlight in this smmuv3-accel
>> file as well (@Shameer).
>>
>> Nicolin

RE: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd

Posted by Shameer Kolothum 2 months, 1 week ago

Hi Eric,

> -----Original Message-----
> From: Eric Auger <eric.auger@redhat.com>
> Sent: 05 September 2025 09:30
> To: Nicolin Chen <nicolinc@nvidia.com>; Duan, Zhenzhong
> <zhenzhong.duan@intel.com>; Shameer Kolothum
> <skolothumtho@nvidia.com>
> Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
> peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>;
> ddutile@redhat.com; berrange@redhat.com; Nathan Chen
> <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; Linuxarm <linuxarm@huawei.com>; Wangzhou (B)
> <wangzhou1@hisilicon.com>; jiangkunkun <jiangkunkun@huawei.com>;
> Jonathan Cameron <jonathan.cameron@huawei.com>;
> zhangfei.gao@linaro.org; shameerkolothum@gmail.com
> Subject: Re: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict
> accelerated SMMUv3 to vfio-pci endpoints with iommufd
> 
> External email: Use caution opening links or attachments
> 
> 
> Hi Shameer,
> 
> On 7/16/25 10:06 AM, Shameerali Kolothum Thodi wrote:
> >
> >> -----Original Message-----
> >> From: Nicolin Chen <nicolinc@nvidia.com>
> >> Sent: Tuesday, July 15, 2025 6:59 PM
> >> To: Duan, Zhenzhong <zhenzhong.duan@intel.com>
> >> Cc: Shameerali Kolothum Thodi
> >> <shameerali.kolothum.thodi@huawei.com>; qemu-arm@nongnu.org;
> >> qemu-devel@nongnu.org; eric.auger@redhat.com;
> >> peter.maydell@linaro.org; jgg@nvidia.com; ddutile@redhat.com;
> >> berrange@redhat.com; nathanc@nvidia.com; mochs@nvidia.com;
> >> smostafa@google.com; Linuxarm <linuxarm@huawei.com>; Wangzhou (B)
> >> <wangzhou1@hisilicon.com>; jiangkunkun <jiangkunkun@huawei.com>;
> >> Jonathan Cameron <jonathan.cameron@huawei.com>;
> >> zhangfei.gao@linaro.org; shameerkolothum@gmail.com
> >> Subject: Re: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict
> >> accelerated SMMUv3 to vfio-pci endpoints with iommufd
> > ...
> >
> >>>> +    if (pdev && !smmuv3_accel_pdev_allowed(pdev, &vfio_pci)) {
> >>>> +        error_report("Device(%s) not allowed. Only PCIe root complex
> >>>> devices "
> >>>> +                     "or PCI bridge devices or vfio-pci endpoint devices
> >>>> with "
> >>>> +                     "iommufd as backend is allowed with
> >>>> arm-smmuv3,accel=on",
> >>>> +                     pdev->name);
> >>>> +        exit(1);
> >>> Seems aggressive for a hotplug, could we fail hotplug instead of kill
> >> QEMU?
> > That's right. I will try to see whether it is possible to do a dev->hotplugged
> > check here.
> >
> >> Hotplug will unlikely be supported well, as it would introduce
> >> too much complication.
> >>
> >> With iommufd, a vIOMMU object is allocated per device (vfio). If
> >> the device fd (cdev) is not yet given to the QEMU. It isn't able
> >> to allocate a vIOMMU object when creating a VM.
> >>
> >> While a vIOMMU object can be allocated at a later stage once the
> >> device is hotplugged. But things like IORT mappings aren't able
> >> to get refreshed since the OS is likely already booted.
> > Why do we need IORT mappings to be refreshed during hotplug?
> > AFAICS, the mappings are created per host bridge Ids. And how is this
> > different from a host machine doing hotplug?
> >
> >  Even an
> >> IOMMU capability sync via the hw_info ioctl will be difficult to
> >> do at the runtime post the guest iommu driver's initialization.
> > We had some discussion on this "at least one vfio-pci" restriction
> > for accelerated mode previously here.
> > https://lore.kernel.org/qemu-devel/Z6TtCLQ35UI12T77@redhat.com/#t
> >
> > I am not sure we reached any consensus on that. The 3 different approaches
> > discussed were,
> >
> > 1. The current one used here. At least one cold plugged vfio-pci device
> >    so that  we can retrieve the host SMMUV3 HW_INFO as per current
> >   IOMMUFD APIs.
> 
> I do not get why you can't wait for the 1st device to be attached to
> "freeze" the settings. Is it because you may also have some bridges /
> root ports also attached to the same viommu. As those latter do not have
> any adherence to the host SMMU, is that a problem?

We need to initialise the registers for SMMUv3 before Guest boots as 
SMMUv3 is a platform device and can't be hot plugged later.

This is where we do it now,
smmu_reset_exit()
  --> smmuv3_init_regs(s);
    if (sys->accel) {
        smmuv3_accel_init_regs(s);
    }
I am not sure how we can update the Guest SMMUv3 features
after the boot.

And the only way to retrieve the Host SMMUv3 HW features is 
through a dev bound to iommufd(IOMMU_GET_HW_INFO).

> >
> > 2. A new IOMMUFD API to retrieve HW_INFO without a device.
> this can only be possible if, on the command line you connect the vsmmu
> to a sysfs path to the host iommu (or maybe this is what you meant in
> 3). This would be another option we also evoked in the past. But this is
> not very user friendly for the guy who launches the VM to care both the
> device and the associated physical SMMU. Logically we could build that
> relationship automatically.
> >
> > 3. A fully specified vSMMUv3 through Qemu command line so that we
> >    don't need HW_INFO from kernel.
> 
> I don't think this is sensible as there may be plenty of those, each
> requirement a libvirt adaptation

As mentioned in my previous reply, the idea is to initialize the accel SMMUv3
with all the features of emulated SMMUv3 and add support for any additional
features required for accel case through command line(like STALL, PASID etc).

Thanks,
Shameer

RE: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd

Posted by Duan, Zhenzhong 4 months ago


>-----Original Message-----
>From: Nicolin Chen <nicolinc@nvidia.com>
>Subject: Re: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict
>accelerated SMMUv3 to vfio-pci endpoints with iommufd
>
>On Tue, Jul 15, 2025 at 10:53:50AM +0000, Duan, Zhenzhong wrote:
>>
>>
>> >-----Original Message-----
>> >From: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
>> >Subject: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict
>accelerated
>> >SMMUv3 to vfio-pci endpoints with iommufd
>> >
>> >Accelerated SMMUv3 is only useful when the device can take advantage of
>> >the host's SMMUv3 in nested mode. To keep things simple and correct, we
>> >only allow this feature for vfio-pci endpoint devices that use the iommufd
>> >backend. We also allow non-endpoint emulated devices like PCI bridges
>and
>> >root ports, so that users can plug in these vfio-pci devices.
>> >
>> >Another reason for this limit is to avoid problems with IOTLB
>> >invalidations. Some commands (e.g., CMD_TLBI_NH_ASID) lack an
>associated
>> >SID, making it difficult to trace the originating device. If we allowed
>> >emulated endpoint devices, QEMU would have to invalidate both its own
>> >software IOTLB and the host's hardware IOTLB, which could slow things
>> >down.
>> >
>> >Since vfio-pci devices in nested mode rely on the host SMMUv3's nested
>> >translation (S1+S2), their get_address_space() callback must return the
>> >system address space to enable correct S2 mappings of guest RAM.
>> >
>> >So in short:
>> > - vfio-pci devices return the system address space
>> > - bridges and root ports return the IOMMU address space
>> >
>> >Note: On ARM, MSI doorbell addresses are also translated via SMMUv3.
>>
>> So the translation result is a doorbell addr(gpa) for guest?
>> IIUC, there should be a mapping between guest doorbell addr(gpa) to host
>> doorbell addr(hpa) in stage2 page table? Where is this mapping setup?
>
>Yes and yes.
>
>On ARM, MSI is behind IOMMU. When 2-stage translation is enabled,
>it goes through two stages as you understood.
>
>There are a few ways to implement this, though the current kernel
>only supports one solution, which is a hard-coded RMR (reserved
>memory region).
>
>The solution sets up a RMR region in the ACPI's IORT, which maps
>the stage1 linearly, i.e. gIOVA=gPA.
>
>The gPA=>hPA mappings in the stage-2 are done by the kernel that
>polls an IOMMU_RESV_SW_MSI region defined in the kernel driver.
>
>It's not the ideal solution, but it's the simplest to implement.
>
>There are other ways to support this like a true 2-stage mapping
>but they are still on the way.
>
>For more details, please refer to this:
>https://lore.kernel.org/all/cover.1740014950.git.nicolinc@nvidia.com/

Thanks for the link, it helps much for understanding arm smmu arch.

>
>> >+static bool smmuv3_accel_pdev_allowed(PCIDevice *pdev, bool
>*vfio_pci)
>> >+{
>> >+
>> >+    if (object_dynamic_cast(OBJECT(pdev), TYPE_PCI_BRIDGE) ||
>> >+        object_dynamic_cast(OBJECT(pdev), "pxb-pcie") ||
>> >+        object_dynamic_cast(OBJECT(pdev), "gpex-root")) {
>> >+        return true;
>> >+    } else if ((object_dynamic_cast(OBJECT(pdev), TYPE_VFIO_PCI) &&
>> >+        object_property_find(OBJECT(pdev), "iommufd"))) {
>>
>> Will this always return true?
>
>It won't if a vfio-pci device doesn't have the "iommufd" property?

IIUC, iommufd property is always there, just value not filled for legacy container case.
What about checking VFIOPCIDevice.vbasedev.iommufd?

>
>> >+        *vfio_pci = true;
>> >+        return true;
>> >+    }
>> >+    return false;
>
>Then, it returns "false" here.
>
>> > static AddressSpace *smmuv3_accel_find_add_as(PCIBus *bus, void
>> >*opaque,
>> >                                               int devfn)
>> > {
>> >+    PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), devfn);
>> >     SMMUState *bs = opaque;
>> >+    bool vfio_pci = false;
>> >     SMMUPciBus *sbus;
>> >     SMMUv3AccelDevice *accel_dev;
>> >     SMMUDevice *sdev;
>> >
>> >+    if (pdev && !smmuv3_accel_pdev_allowed(pdev, &vfio_pci)) {
>> >+        error_report("Device(%s) not allowed. Only PCIe root complex
>> >devices "
>> >+                     "or PCI bridge devices or vfio-pci endpoint
>devices
>> >with "
>> >+                     "iommufd as backend is allowed with
>> >arm-smmuv3,accel=on",
>> >+                     pdev->name);
>> >+        exit(1);
>>
>> Seems aggressive for a hotplug, could we fail hotplug instead of kill QEMU?
>
>Hotplug will unlikely be supported well, as it would introduce
>too much complication.
>
>With iommufd, a vIOMMU object is allocated per device (vfio). If
>the device fd (cdev) is not yet given to the QEMU. It isn't able
>to allocate a vIOMMU object when creating a VM.
>
>While a vIOMMU object can be allocated at a later stage once the
>device is hotplugged. But things like IORT mappings aren't able
>to get refreshed since the OS is likely already booted. Even an
>IOMMU capability sync via the hw_info ioctl will be difficult to
>do at the runtime post the guest iommu driver's initialization.
>
>I am not 100% sure. But I think Intel model could have a similar
>problem if the guest boots with zero cold-plugged device and then
>hot-plugs a PASID-capable device at the runtime, when the guest-
>level IOMMU driver is already inited?

For vtd we define a property for each capability we care about.
When hotplug a device, we get hw_info through ioctl and compare
host's capability with virtual vtd's property setting, if incompatible,
we fail the hotplug.

In old implementation we sync host iommu caps into virtual vtd's cap,
but that's Naked by maintainer. The suggested way is to define property
for each capability we care and do compatibility check.

There is a "pasid" property in virtual vtd, only when it's true, the PASID-capable
device can work with pasid.

Zhenzhong

>
>FWIW, Shameer's cover-letter has the following line:
> "At least one vfio-pci device must currently be cold-plugged to
>  a PCIe root complex associated with arm-smmuv3,accel=on."
>
>Perhaps there should be a similar highlight in this smmuv3-accel
>file as well (@Shameer).
>
>Nicolin

Re: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd

Posted by Eric Auger 2 months, 1 week ago

Hi,

On 7/16/25 8:26 AM, Duan, Zhenzhong wrote:
>
>> -----Original Message-----
>> From: Nicolin Chen <nicolinc@nvidia.com>
>> Subject: Re: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict
>> accelerated SMMUv3 to vfio-pci endpoints with iommufd
>>
>> On Tue, Jul 15, 2025 at 10:53:50AM +0000, Duan, Zhenzhong wrote:
>>>
>>>> -----Original Message-----
>>>> From: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
>>>> Subject: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict
>> accelerated
>>>> SMMUv3 to vfio-pci endpoints with iommufd
>>>>
>>>> Accelerated SMMUv3 is only useful when the device can take advantage of
>>>> the host's SMMUv3 in nested mode. To keep things simple and correct, we
>>>> only allow this feature for vfio-pci endpoint devices that use the iommufd
>>>> backend. We also allow non-endpoint emulated devices like PCI bridges
>> and
>>>> root ports, so that users can plug in these vfio-pci devices.
>>>>
>>>> Another reason for this limit is to avoid problems with IOTLB
>>>> invalidations. Some commands (e.g., CMD_TLBI_NH_ASID) lack an
>> associated
>>>> SID, making it difficult to trace the originating device. If we allowed
>>>> emulated endpoint devices, QEMU would have to invalidate both its own
>>>> software IOTLB and the host's hardware IOTLB, which could slow things
>>>> down.
>>>>
>>>> Since vfio-pci devices in nested mode rely on the host SMMUv3's nested
>>>> translation (S1+S2), their get_address_space() callback must return the
>>>> system address space to enable correct S2 mappings of guest RAM.
>>>>
>>>> So in short:
>>>> - vfio-pci devices return the system address space
>>>> - bridges and root ports return the IOMMU address space
>>>>
>>>> Note: On ARM, MSI doorbell addresses are also translated via SMMUv3.
>>> So the translation result is a doorbell addr(gpa) for guest?
>>> IIUC, there should be a mapping between guest doorbell addr(gpa) to host
>>> doorbell addr(hpa) in stage2 page table? Where is this mapping setup?
>> Yes and yes.
>>
>> On ARM, MSI is behind IOMMU. When 2-stage translation is enabled,
>> it goes through two stages as you understood.
>>
>> There are a few ways to implement this, though the current kernel
>> only supports one solution, which is a hard-coded RMR (reserved
>> memory region).
>>
>> The solution sets up a RMR region in the ACPI's IORT, which maps
>> the stage1 linearly, i.e. gIOVA=gPA.
>>
>> The gPA=>hPA mappings in the stage-2 are done by the kernel that
>> polls an IOMMU_RESV_SW_MSI region defined in the kernel driver.
>>
>> It's not the ideal solution, but it's the simplest to implement.
>>
>> There are other ways to support this like a true 2-stage mapping
>> but they are still on the way.
>>
>> For more details, please refer to this:
>> https://lore.kernel.org/all/cover.1740014950.git.nicolinc@nvidia.com/
> Thanks for the link, it helps much for understanding arm smmu arch.
>
>>>> +static bool smmuv3_accel_pdev_allowed(PCIDevice *pdev, bool
>> *vfio_pci)
>>>> +{
>>>> +
>>>> +    if (object_dynamic_cast(OBJECT(pdev), TYPE_PCI_BRIDGE) ||
>>>> +        object_dynamic_cast(OBJECT(pdev), "pxb-pcie") ||
>>>> +        object_dynamic_cast(OBJECT(pdev), "gpex-root")) {
>>>> +        return true;
>>>> +    } else if ((object_dynamic_cast(OBJECT(pdev), TYPE_VFIO_PCI) &&
>>>> +        object_property_find(OBJECT(pdev), "iommufd"))) {
>>> Will this always return true?
>> It won't if a vfio-pci device doesn't have the "iommufd" property?
> IIUC, iommufd property is always there, just value not filled for legacy container case.
> What about checking VFIOPCIDevice.vbasedev.iommufd?
>
>>>> +        *vfio_pci = true;
>>>> +        return true;
>>>> +    }
>>>> +    return false;
>> Then, it returns "false" here.
>>
>>>> static AddressSpace *smmuv3_accel_find_add_as(PCIBus *bus, void
>>>> *opaque,
>>>>                                               int devfn)
>>>> {
>>>> +    PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), devfn);
>>>>     SMMUState *bs = opaque;
>>>> +    bool vfio_pci = false;
>>>>     SMMUPciBus *sbus;
>>>>     SMMUv3AccelDevice *accel_dev;
>>>>     SMMUDevice *sdev;
>>>>
>>>> +    if (pdev && !smmuv3_accel_pdev_allowed(pdev, &vfio_pci)) {
>>>> +        error_report("Device(%s) not allowed. Only PCIe root complex
>>>> devices "
>>>> +                     "or PCI bridge devices or vfio-pci endpoint
>> devices
>>>> with "
>>>> +                     "iommufd as backend is allowed with
>>>> arm-smmuv3,accel=on",
>>>> +                     pdev->name);
>>>> +        exit(1);
>>> Seems aggressive for a hotplug, could we fail hotplug instead of kill QEMU?
>> Hotplug will unlikely be supported well, as it would introduce
>> too much complication.
>>
>> With iommufd, a vIOMMU object is allocated per device (vfio). If
>> the device fd (cdev) is not yet given to the QEMU. It isn't able
>> to allocate a vIOMMU object when creating a VM.
>>
>> While a vIOMMU object can be allocated at a later stage once the
>> device is hotplugged. But things like IORT mappings aren't able
>> to get refreshed since the OS is likely already booted. Even an
>> IOMMU capability sync via the hw_info ioctl will be difficult to
>> do at the runtime post the guest iommu driver's initialization.
>>
>> I am not 100% sure. But I think Intel model could have a similar
>> problem if the guest boots with zero cold-plugged device and then
>> hot-plugs a PASID-capable device at the runtime, when the guest-
>> level IOMMU driver is already inited?
> For vtd we define a property for each capability we care about.
> When hotplug a device, we get hw_info through ioctl and compare
> host's capability with virtual vtd's property setting, if incompatible,
> we fail the hotplug.
>
> In old implementation we sync host iommu caps into virtual vtd's cap,
> but that's Naked by maintainer. The suggested way is to define property
> for each capability we care and do compatibility check.
>
> There is a "pasid" property in virtual vtd, only when it's true, the PASID-capable
> device can work with pasid.
>
> Zhenzhong

I don't think this is an option not to support hotplug. I agree with
Zhenzhong, we shall try to align with the way it is done on intel-iommu
and study whether it also fits the needs for accelerated smmu.

Thanks

Eric
>
>> FWIW, Shameer's cover-letter has the following line:
>> "At least one vfio-pci device must currently be cold-plugged to
>>  a PCIe root complex associated with arm-smmuv3,accel=on."
>>
>> Perhaps there should be a similar highlight in this smmuv3-accel
>> file as well (@Shameer).
>>
>> Nicolin

RE: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd

Posted by Shameer Kolothum 2 months, 1 week ago

Hi Eric,

> -----Original Message-----
> From: Eric Auger <eric.auger@redhat.com>
> Sent: 05 September 2025 09:15
> To: Duan, Zhenzhong <zhenzhong.duan@intel.com>; Nicolin Chen
> <nicolinc@nvidia.com>; Shameer Kolothum <skolothumtho@nvidia.com>
> Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
> peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>;
> ddutile@redhat.com; berrange@redhat.com; Nathan Chen
> <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; linuxarm@huawei.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; shameerkolothum@gmail.com
> Subject: Re: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict accelerated
> SMMUv3 to vfio-pci endpoints with iommufd
> 

[..]
> >>>> static AddressSpace *smmuv3_accel_find_add_as(PCIBus *bus, void
> >>>> *opaque,
> >>>>                                               int devfn)
> >>>> {
> >>>> +    PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), devfn);
> >>>>     SMMUState *bs = opaque;
> >>>> +    bool vfio_pci = false;
> >>>>     SMMUPciBus *sbus;
> >>>>     SMMUv3AccelDevice *accel_dev;
> >>>>     SMMUDevice *sdev;
> >>>>
> >>>> +    if (pdev && !smmuv3_accel_pdev_allowed(pdev, &vfio_pci)) {
> >>>> +        error_report("Device(%s) not allowed. Only PCIe root complex
> >>>> devices "
> >>>> +                     "or PCI bridge devices or vfio-pci endpoint
> >> devices
> >>>> with "
> >>>> +                     "iommufd as backend is allowed with
> >>>> arm-smmuv3,accel=on",
> >>>> +                     pdev->name);
> >>>> +        exit(1);
> >>> Seems aggressive for a hotplug, could we fail hotplug instead of kill
> QEMU?
> >> Hotplug will unlikely be supported well, as it would introduce
> >> too much complication.
> >>
> >> With iommufd, a vIOMMU object is allocated per device (vfio). If
> >> the device fd (cdev) is not yet given to the QEMU. It isn't able
> >> to allocate a vIOMMU object when creating a VM.
> >>
> >> While a vIOMMU object can be allocated at a later stage once the
> >> device is hotplugged. But things like IORT mappings aren't able
> >> to get refreshed since the OS is likely already booted. Even an
> >> IOMMU capability sync via the hw_info ioctl will be difficult to
> >> do at the runtime post the guest iommu driver's initialization.
> >>
> >> I am not 100% sure. But I think Intel model could have a similar
> >> problem if the guest boots with zero cold-plugged device and then
> >> hot-plugs a PASID-capable device at the runtime, when the guest-
> >> level IOMMU driver is already inited?
> > For vtd we define a property for each capability we care about.
> > When hotplug a device, we get hw_info through ioctl and compare
> > host's capability with virtual vtd's property setting, if incompatible,
> > we fail the hotplug.
> >
> > In old implementation we sync host iommu caps into virtual vtd's cap,
> > but that's Naked by maintainer. The suggested way is to define property
> > for each capability we care and do compatibility check.
> >
> > There is a "pasid" property in virtual vtd, only when it's true, the PASID-
> capable
> > device can work with pasid.
> >
> > Zhenzhong
> 
> I don't think this is an option not to support hotplug. I agree with
> Zhenzhong, we shall try to align with the way it is done on intel-iommu
> and study whether it also fits the needs for accelerated smmu.

Hotplug is supported. The only current requirement is that we need at least
one cold-plugged device to retrieve the host SMMUv3 features. These are
then used to update the vSMMUv3 features during the reset phase so that
the Guest can probe them at boot.

Based on the discussions in this series, we plan to remove that dependency
going forward. Instead, the accelerated vSMMUv3 will be initialized based
on the current emulated SMMUv3 features, plus any additional features
specified by the user on the command line.

During every device attachment to the accelerated SMMUv3, we will
cross-check for any disparities and allow or disallow the attachment
accordingly.

Please let me know if you see any drawbacks with this approach.

Thanks,
Shameer

Re: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd

Posted by Eric Auger 2 months ago

Hi Shameer,

On 9/8/25 9:41 AM, Shameer Kolothum wrote:
> Hi Eric,
>
>> -----Original Message-----
>> From: Eric Auger <eric.auger@redhat.com>
>> Sent: 05 September 2025 09:15
>> To: Duan, Zhenzhong <zhenzhong.duan@intel.com>; Nicolin Chen
>> <nicolinc@nvidia.com>; Shameer Kolothum <skolothumtho@nvidia.com>
>> Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
>> peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>;
>> ddutile@redhat.com; berrange@redhat.com; Nathan Chen
>> <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
>> smostafa@google.com; linuxarm@huawei.com; wangzhou1@hisilicon.com;
>> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
>> zhangfei.gao@linaro.org; shameerkolothum@gmail.com
>> Subject: Re: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict accelerated
>> SMMUv3 to vfio-pci endpoints with iommufd
>>
> [..]
>>>>>> static AddressSpace *smmuv3_accel_find_add_as(PCIBus *bus, void
>>>>>> *opaque,
>>>>>>                                               int devfn)
>>>>>> {
>>>>>> +    PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), devfn);
>>>>>>     SMMUState *bs = opaque;
>>>>>> +    bool vfio_pci = false;
>>>>>>     SMMUPciBus *sbus;
>>>>>>     SMMUv3AccelDevice *accel_dev;
>>>>>>     SMMUDevice *sdev;
>>>>>>
>>>>>> +    if (pdev && !smmuv3_accel_pdev_allowed(pdev, &vfio_pci)) {
>>>>>> +        error_report("Device(%s) not allowed. Only PCIe root complex
>>>>>> devices "
>>>>>> +                     "or PCI bridge devices or vfio-pci endpoint
>>>> devices
>>>>>> with "
>>>>>> +                     "iommufd as backend is allowed with
>>>>>> arm-smmuv3,accel=on",
>>>>>> +                     pdev->name);
>>>>>> +        exit(1);
>>>>> Seems aggressive for a hotplug, could we fail hotplug instead of kill
>> QEMU?
>>>> Hotplug will unlikely be supported well, as it would introduce
>>>> too much complication.
>>>>
>>>> With iommufd, a vIOMMU object is allocated per device (vfio). If
>>>> the device fd (cdev) is not yet given to the QEMU. It isn't able
>>>> to allocate a vIOMMU object when creating a VM.
>>>>
>>>> While a vIOMMU object can be allocated at a later stage once the
>>>> device is hotplugged. But things like IORT mappings aren't able
>>>> to get refreshed since the OS is likely already booted. Even an
>>>> IOMMU capability sync via the hw_info ioctl will be difficult to
>>>> do at the runtime post the guest iommu driver's initialization.
>>>>
>>>> I am not 100% sure. But I think Intel model could have a similar
>>>> problem if the guest boots with zero cold-plugged device and then
>>>> hot-plugs a PASID-capable device at the runtime, when the guest-
>>>> level IOMMU driver is already inited?
>>> For vtd we define a property for each capability we care about.
>>> When hotplug a device, we get hw_info through ioctl and compare
>>> host's capability with virtual vtd's property setting, if incompatible,
>>> we fail the hotplug.
>>>
>>> In old implementation we sync host iommu caps into virtual vtd's cap,
>>> but that's Naked by maintainer. The suggested way is to define property
>>> for each capability we care and do compatibility check.
>>>
>>> There is a "pasid" property in virtual vtd, only when it's true, the PASID-
>> capable
>>> device can work with pasid.
>>>
>>> Zhenzhong
>> I don't think this is an option not to support hotplug. I agree with
>> Zhenzhong, we shall try to align with the way it is done on intel-iommu
>> and study whether it also fits the needs for accelerated smmu.
> Hotplug is supported. The only current requirement is that we need at least
> one cold-plugged device to retrieve the host SMMUv3 features. These are
> then used to update the vSMMUv3 features during the reset phase so that
> the Guest can probe them at boot.
>
> Based on the discussions in this series, we plan to remove that dependency
> going forward. Instead, the accelerated vSMMUv3 will be initialized based
> on the current emulated SMMUv3 features, plus any additional features
> specified by the user on the command line.
>
> During every device attachment to the accelerated SMMUv3, we will
> cross-check for any disparities and allow or disallow the attachment
> accordingly.
>
> Please let me know if you see any drawbacks with this approach.
This looks aligned with the intel-iommu strategy and up to now I don't
see any cons. Let's try to relax the requirement and we will see if it
brings further issues ...

Thanks

Eric
>
> Thanks,
> Shameer
> 	
>
>
>

RE: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd

Posted by Shameerali Kolothum Thodi via 4 months ago


> -----Original Message-----
> From: Duan, Zhenzhong <zhenzhong.duan@intel.com>
> Sent: Wednesday, July 16, 2025 7:26 AM
> To: Nicolin Chen <nicolinc@nvidia.com>
> Cc: Shameerali Kolothum Thodi
> <shameerali.kolothum.thodi@huawei.com>; qemu-arm@nongnu.org;
> qemu-devel@nongnu.org; eric.auger@redhat.com;
> peter.maydell@linaro.org; jgg@nvidia.com; ddutile@redhat.com;
> berrange@redhat.com; nathanc@nvidia.com; mochs@nvidia.com;
> smostafa@google.com; Linuxarm <linuxarm@huawei.com>; Wangzhou (B)
> <wangzhou1@hisilicon.com>; jiangkunkun <jiangkunkun@huawei.com>;
> Jonathan Cameron <jonathan.cameron@huawei.com>;
> zhangfei.gao@linaro.org; shameerkolothum@gmail.com
> Subject: RE: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict
> accelerated SMMUv3 to vfio-pci endpoints with iommufd
> 

...

> >> >+static bool smmuv3_accel_pdev_allowed(PCIDevice *pdev, bool
> >*vfio_pci)
> >> >+{
> >> >+
> >> >+    if (object_dynamic_cast(OBJECT(pdev), TYPE_PCI_BRIDGE) ||
> >> >+        object_dynamic_cast(OBJECT(pdev), "pxb-pcie") ||
> >> >+        object_dynamic_cast(OBJECT(pdev), "gpex-root")) {
> >> >+        return true;
> >> >+    } else if ((object_dynamic_cast(OBJECT(pdev), TYPE_VFIO_PCI) &&
> >> >+        object_property_find(OBJECT(pdev), "iommufd"))) {
> >>
> >> Will this always return true?
> >
> >It won't if a vfio-pci device doesn't have the "iommufd" property?
> 
> IIUC, iommufd property is always there, just value not filled for legacy
> container case.
> What about checking VFIOPCIDevice.vbasedev.iommufd?

That's right. The property is always there. But instead of accessing VFIOPCIDevice
in SMMUv3 code, I think we can use object_property_get_link(obj, "iommufd", &error_abort)
instead?
 
> >
> >> >+        *vfio_pci = true;
> >> >+        return true;
> >> >+    }
> >> >+    return false;
> >
> >Then, it returns "false" here.
> >
> >> > static AddressSpace *smmuv3_accel_find_add_as(PCIBus *bus, void
> >> >*opaque,
> >> >                                               int devfn)
> >> > {
> >> >+    PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), devfn);
> >> >     SMMUState *bs = opaque;
> >> >+    bool vfio_pci = false;
> >> >     SMMUPciBus *sbus;
> >> >     SMMUv3AccelDevice *accel_dev;
> >> >     SMMUDevice *sdev;
> >> >
> >> >+    if (pdev && !smmuv3_accel_pdev_allowed(pdev, &vfio_pci)) {
> >> >+        error_report("Device(%s) not allowed. Only PCIe root complex
> >> >devices "
> >> >+                     "or PCI bridge devices or vfio-pci endpoint
> >devices
> >> >with "
> >> >+                     "iommufd as backend is allowed with
> >> >arm-smmuv3,accel=on",
> >> >+                     pdev->name);
> >> >+        exit(1);
> >>
> >> Seems aggressive for a hotplug, could we fail hotplug instead of kill
> QEMU?
> >
> >Hotplug will unlikely be supported well, as it would introduce
> >too much complication.
> >
> >With iommufd, a vIOMMU object is allocated per device (vfio). If
> >the device fd (cdev) is not yet given to the QEMU. It isn't able
> >to allocate a vIOMMU object when creating a VM.
> >
> >While a vIOMMU object can be allocated at a later stage once the
> >device is hotplugged. But things like IORT mappings aren't able
> >to get refreshed since the OS is likely already booted. Even an
> >IOMMU capability sync via the hw_info ioctl will be difficult to
> >do at the runtime post the guest iommu driver's initialization.
> >
> >I am not 100% sure. But I think Intel model could have a similar
> >problem if the guest boots with zero cold-plugged device and then
> >hot-plugs a PASID-capable device at the runtime, when the guest-
> >level IOMMU driver is already inited?
> 
> For vtd we define a property for each capability we care about.
> When hotplug a device, we get hw_info through ioctl and compare
> host's capability with virtual vtd's property setting, if incompatible,
> we fail the hotplug.
> 
> In old implementation we sync host iommu caps into virtual vtd's cap,
> but that's Naked by maintainer. The suggested way is to define property
> for each capability we care and do compatibility check.
> 
> There is a "pasid" property in virtual vtd, only when it's true, the PASID-
> capable
> device can work with pasid.

Thanks for this information. I think probably we need to take a look at this as
this doesn't have a dependency on cold-plug device to be present for SMMUv3.
Will go through intel vtd implementation.

Thanks,
Shameer

Re: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd

Posted by Nicolin Chen 4 months ago

On Wed, Jul 16, 2025 at 09:34:04AM +0000, Shameerali Kolothum Thodi wrote:
> > >> Seems aggressive for a hotplug, could we fail hotplug instead of kill
> > QEMU?
> > >
> > >Hotplug will unlikely be supported well, as it would introduce
> > >too much complication.
> > >
> > >With iommufd, a vIOMMU object is allocated per device (vfio). If
> > >the device fd (cdev) is not yet given to the QEMU. It isn't able
> > >to allocate a vIOMMU object when creating a VM.
> > >
> > >While a vIOMMU object can be allocated at a later stage once the
> > >device is hotplugged. But things like IORT mappings aren't able
> > >to get refreshed since the OS is likely already booted. Even an
> > >IOMMU capability sync via the hw_info ioctl will be difficult to
> > >do at the runtime post the guest iommu driver's initialization.
> > >
> > >I am not 100% sure. But I think Intel model could have a similar
> > >problem if the guest boots with zero cold-plugged device and then
> > >hot-plugs a PASID-capable device at the runtime, when the guest-
> > >level IOMMU driver is already inited?
> > 
> > For vtd we define a property for each capability we care about.
> > When hotplug a device, we get hw_info through ioctl and compare
> > host's capability with virtual vtd's property setting, if incompatible,
> > we fail the hotplug.
> > 
> > In old implementation we sync host iommu caps into virtual vtd's cap,
> > but that's Naked by maintainer. The suggested way is to define property
> > for each capability we care and do compatibility check.
> > 
> > There is a "pasid" property in virtual vtd, only when it's true, the PASID-
> > capable
> > device can work with pasid.
> 
> Thanks for this information. I think probably we need to take a look at this as
> this doesn't have a dependency on cold-plug device to be present for SMMUv3.
> Will go through intel vtd implementation.

I see. A compatibility test sounds promising.

It still feels tricky when dealing with multi vSMMU instances, if
some instances don't have a cold-plug device to poll hw_info. We
would need to pre-define all the feature bits. Then, run the test
on every hotplug device attached later to the vSMMU instance.

Maybe we could do something wise:
The sysfs node provides all the IOMMU nodes. So, we could compare
the node names to see if they are likely symmetric or not. Nodes
sharing the same naming pattern are more likely created by the
same IOMMU driver. So, as a speculation, a vSMMU instance with no
coldplug device could borrow the bits from a vSMMU instance with
a device?

Sure, individual IOMMU instances could differ in specific fields
despite using the same node name. This would unfortunately lead
to hotplug failure upon the compatibility check.

Thanks
Nicolin

Re: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd

Posted by Eric Auger 2 months, 1 week ago

Hi Nicolin

On 7/16/25 7:51 PM, Nicolin Chen wrote:
> On Wed, Jul 16, 2025 at 09:34:04AM +0000, Shameerali Kolothum Thodi wrote:
>>>>> Seems aggressive for a hotplug, could we fail hotplug instead of kill
>>> QEMU?
>>>> Hotplug will unlikely be supported well, as it would introduce
>>>> too much complication.
>>>>
>>>> With iommufd, a vIOMMU object is allocated per device (vfio). If
>>>> the device fd (cdev) is not yet given to the QEMU. It isn't able
>>>> to allocate a vIOMMU object when creating a VM.
>>>>
>>>> While a vIOMMU object can be allocated at a later stage once the
>>>> device is hotplugged. But things like IORT mappings aren't able
>>>> to get refreshed since the OS is likely already booted. Even an
>>>> IOMMU capability sync via the hw_info ioctl will be difficult to
>>>> do at the runtime post the guest iommu driver's initialization.
>>>>
>>>> I am not 100% sure. But I think Intel model could have a similar
>>>> problem if the guest boots with zero cold-plugged device and then
>>>> hot-plugs a PASID-capable device at the runtime, when the guest-
>>>> level IOMMU driver is already inited?
>>> For vtd we define a property for each capability we care about.
>>> When hotplug a device, we get hw_info through ioctl and compare
>>> host's capability with virtual vtd's property setting, if incompatible,
>>> we fail the hotplug.
>>>
>>> In old implementation we sync host iommu caps into virtual vtd's cap,
>>> but that's Naked by maintainer. The suggested way is to define property
>>> for each capability we care and do compatibility check.
>>>
>>> There is a "pasid" property in virtual vtd, only when it's true, the PASID-
>>> capable
>>> device can work with pasid.
>> Thanks for this information. I think probably we need to take a look at this as
>> this doesn't have a dependency on cold-plug device to be present for SMMUv3.
>> Will go through intel vtd implementation.
> I see. A compatibility test sounds promising.
>
> It still feels tricky when dealing with multi vSMMU instances, if
> some instances don't have a cold-plug device to poll hw_info. We
> would need to pre-define all the feature bits. Then, run the test
> on every hotplug device attached later to the vSMMU instance.

This is what looks the most sensible to me
>
> Maybe we could do something wise:
> The sysfs node provides all the IOMMU nodes. So, we could compare
> the node names to see if they are likely symmetric or not. Nodes
> sharing the same naming pattern are more likely created by the
> same IOMMU driver. So, as a speculation, a vSMMU instance with no
> coldplug device could borrow the bits from a vSMMU instance with
> a device?
Then instead of trying to match names, I think it would be cleaner to
pass the sysfs path. But I would rather explore the "collect info as the
come" way

Thanks

Eric
>
> Sure, individual IOMMU instances could differ in specific fields
> despite using the same node name. This would unfortunately lead
> to hotplug failure upon the compatibility check.
>
> Thanks
> Nicolin
>

Re: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd

Posted by Nicolin Chen 4 months ago

On Wed, Jul 16, 2025 at 10:51:23AM -0700, Nicolin Chen wrote:
> On Wed, Jul 16, 2025 at 09:34:04AM +0000, Shameerali Kolothum Thodi wrote:
> > > >> Seems aggressive for a hotplug, could we fail hotplug instead of kill
> > > QEMU?
> > > >
> > > >Hotplug will unlikely be supported well, as it would introduce
> > > >too much complication.
> > > >
> > > >With iommufd, a vIOMMU object is allocated per device (vfio). If
> > > >the device fd (cdev) is not yet given to the QEMU. It isn't able
> > > >to allocate a vIOMMU object when creating a VM.
> > > >
> > > >While a vIOMMU object can be allocated at a later stage once the
> > > >device is hotplugged. But things like IORT mappings aren't able
> > > >to get refreshed since the OS is likely already booted. Even an
> > > >IOMMU capability sync via the hw_info ioctl will be difficult to
> > > >do at the runtime post the guest iommu driver's initialization.
> > > >
> > > >I am not 100% sure. But I think Intel model could have a similar
> > > >problem if the guest boots with zero cold-plugged device and then
> > > >hot-plugs a PASID-capable device at the runtime, when the guest-
> > > >level IOMMU driver is already inited?
> > > 
> > > For vtd we define a property for each capability we care about.
> > > When hotplug a device, we get hw_info through ioctl and compare
> > > host's capability with virtual vtd's property setting, if incompatible,
> > > we fail the hotplug.
> > > 
> > > In old implementation we sync host iommu caps into virtual vtd's cap,
> > > but that's Naked by maintainer. The suggested way is to define property
> > > for each capability we care and do compatibility check.
> > > 
> > > There is a "pasid" property in virtual vtd, only when it's true, the PASID-
> > > capable
> > > device can work with pasid.
> > 
> > Thanks for this information. I think probably we need to take a look at this as
> > this doesn't have a dependency on cold-plug device to be present for SMMUv3.
> > Will go through intel vtd implementation.
> 
> I see. A compatibility test sounds promising.
> 
> It still feels tricky when dealing with multi vSMMU instances, if
> some instances don't have a cold-plug device to poll hw_info. We
> would need to pre-define all the feature bits. Then, run the test
> on every hotplug device attached later to the vSMMU instance.
> 
> Maybe we could do something wise:
> The sysfs node provides all the IOMMU nodes. So, we could compare
> the node names to see if they are likely symmetric or not. Nodes
> sharing the same naming pattern are more likely created by the
> same IOMMU driver. So, as a speculation, a vSMMU instance with no
> coldplug device could borrow the bits from a vSMMU instance with
> a device?
> 
> Sure, individual IOMMU instances could differ in specific fields
> despite using the same node name. This would unfortunately lead
> to hotplug failure upon the compatibility check.

Hmm, forget what I said here. Each vSMMU instance should be pre
defined with a list of parameters. So, we will need to run the
compatibility test not only for hotplug devices, but coldplug
ones too.

Thanks
Nicolin

RE: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd

Posted by Duan, Zhenzhong 4 months ago


>-----Original Message-----
>From: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
>Subject: RE: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict
>accelerated SMMUv3 to vfio-pci endpoints with iommufd
>
>
>
>> -----Original Message-----
>> From: Duan, Zhenzhong <zhenzhong.duan@intel.com>
>> Sent: Wednesday, July 16, 2025 7:26 AM
>> To: Nicolin Chen <nicolinc@nvidia.com>
>> Cc: Shameerali Kolothum Thodi
>> <shameerali.kolothum.thodi@huawei.com>; qemu-arm@nongnu.org;
>> qemu-devel@nongnu.org; eric.auger@redhat.com;
>> peter.maydell@linaro.org; jgg@nvidia.com; ddutile@redhat.com;
>> berrange@redhat.com; nathanc@nvidia.com; mochs@nvidia.com;
>> smostafa@google.com; Linuxarm <linuxarm@huawei.com>; Wangzhou (B)
>> <wangzhou1@hisilicon.com>; jiangkunkun <jiangkunkun@huawei.com>;
>> Jonathan Cameron <jonathan.cameron@huawei.com>;
>> zhangfei.gao@linaro.org; shameerkolothum@gmail.com
>> Subject: RE: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict
>> accelerated SMMUv3 to vfio-pci endpoints with iommufd
>>
>
>...
>
>> >> >+static bool smmuv3_accel_pdev_allowed(PCIDevice *pdev, bool
>> >*vfio_pci)
>> >> >+{
>> >> >+
>> >> >+    if (object_dynamic_cast(OBJECT(pdev), TYPE_PCI_BRIDGE) ||
>> >> >+        object_dynamic_cast(OBJECT(pdev), "pxb-pcie") ||
>> >> >+        object_dynamic_cast(OBJECT(pdev), "gpex-root")) {
>> >> >+        return true;
>> >> >+    } else if ((object_dynamic_cast(OBJECT(pdev), TYPE_VFIO_PCI)
>&&
>> >> >+        object_property_find(OBJECT(pdev), "iommufd"))) {
>> >>
>> >> Will this always return true?
>> >
>> >It won't if a vfio-pci device doesn't have the "iommufd" property?
>>
>> IIUC, iommufd property is always there, just value not filled for legacy
>> container case.
>> What about checking VFIOPCIDevice.vbasedev.iommufd?
>
>That's right. The property is always there. But instead of accessing
>VFIOPCIDevice
>in SMMUv3 code, I think we can use object_property_get_link(obj, "iommufd",
>&error_abort)
>instead?

Yes, looks better.

Re: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd

Posted by Jonathan Cameron via 4 months ago

On Mon, 14 Jul 2025 16:59:32 +0100
Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> wrote:

> Accelerated SMMUv3 is only useful when the device can take advantage of
> the host's SMMUv3 in nested mode. To keep things simple and correct, we
> only allow this feature for vfio-pci endpoint devices that use the iommufd
> backend. We also allow non-endpoint emulated devices like PCI bridges and
> root ports, so that users can plug in these vfio-pci devices.
> 
> Another reason for this limit is to avoid problems with IOTLB
> invalidations. Some commands (e.g., CMD_TLBI_NH_ASID) lack an associated
> SID, making it difficult to trace the originating device. If we allowed
> emulated endpoint devices, QEMU would have to invalidate both its own
> software IOTLB and the host's hardware IOTLB, which could slow things
> down.
> 
> Since vfio-pci devices in nested mode rely on the host SMMUv3's nested
> translation (S1+S2), their get_address_space() callback must return the
> system address space to enable correct S2 mappings of guest RAM.
> 
> So in short:
>  - vfio-pci devices return the system address space
>  - bridges and root ports return the IOMMU address space
> 
> Note: On ARM, MSI doorbell addresses are also translated via SMMUv3.
> Hence, if a vfio-pci device is behind the SMMuv3 with translation enabled,
> it must return the IOMMU address space for MSI. Support for this will be
> added in a follow-up patch.
> 
> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> ---
>  hw/arm/smmuv3-accel.c               | 50 ++++++++++++++++++++++++++++-
>  hw/arm/smmuv3-accel.h               | 15 +++++++++
>  hw/arm/smmuv3.c                     |  4 +++
>  hw/pci-bridge/pci_expander_bridge.c |  1 -
>  include/hw/arm/smmuv3.h             |  1 +
>  include/hw/pci/pci_bridge.h         |  1 +
>  6 files changed, 70 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
> index 2eac9c6ff4..0b0ddb03e2 100644
> --- a/hw/arm/smmuv3-accel.c
> +++ b/hw/arm/smmuv3-accel.c
> @@ -7,13 +7,19 @@
>   */
>  
>  #include "qemu/osdep.h"
> +#include "qemu/error-report.h"
>  
>  #include "hw/arm/smmuv3.h"
> +#include "hw/pci/pci_bridge.h"
> +#include "hw/pci-host/gpex.h"
> +#include "hw/vfio/pci.h"
> +
>  #include "smmuv3-accel.h"
>  
>  static SMMUv3AccelDevice *smmuv3_accel_get_dev(SMMUState *bs, SMMUPciBus *sbus,
>                                                  PCIBus *bus, int devfn)
>  {
> +    SMMUv3State *s = ARM_SMMUV3(bs);
>      SMMUDevice *sdev = sbus->pbdev[devfn];
>      SMMUv3AccelDevice *accel_dev;
>  
> @@ -25,30 +31,72 @@ static SMMUv3AccelDevice *smmuv3_accel_get_dev(SMMUState *bs, SMMUPciBus *sbus,
>  
>          sbus->pbdev[devfn] = sdev;
>          smmu_init_sdev(bs, sdev, bus, devfn);
> +        address_space_init(&accel_dev->as_sysmem, &s->s_accel->root,
> +                           "smmuv3-accel-sysmem");
>      }
>  
>      return accel_dev;
>  }
>  
> +static bool smmuv3_accel_pdev_allowed(PCIDevice *pdev, bool *vfio_pci)
> +{
> +
> +    if (object_dynamic_cast(OBJECT(pdev), TYPE_PCI_BRIDGE) ||
> +        object_dynamic_cast(OBJECT(pdev), "pxb-pcie") ||
> +        object_dynamic_cast(OBJECT(pdev), "gpex-root")) {

Include gpex.h and TYPE_GPEX_ROOT_DEVICE
TYPE_IOMMUFD_BACKEND in iommufd.h

etc.



> +        return true;
> +    } else if ((object_dynamic_cast(OBJECT(pdev), TYPE_VFIO_PCI) &&
> +        object_property_find(OBJECT(pdev), "iommufd"))) {
> +        *vfio_pci = true;
> +        return true;
> +    }
> +    return false;
> +}
> +
>  static AddressSpace *smmuv3_accel_find_add_as(PCIBus *bus, void *opaque,
>                                                int devfn)
>  {
> +    PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), devfn);
>      SMMUState *bs = opaque;
> +    bool vfio_pci = false;
>      SMMUPciBus *sbus;
>      SMMUv3AccelDevice *accel_dev;
>      SMMUDevice *sdev;
>  
> +    if (pdev && !smmuv3_accel_pdev_allowed(pdev, &vfio_pci)) {
> +        error_report("Device(%s) not allowed. Only PCIe root complex devices "
> +                     "or PCI bridge devices or vfio-pci endpoint devices with "
> +                     "iommufd as backend is allowed with arm-smmuv3,accel=on",
> +                     pdev->name);
> +        exit(1);
> +    }
>      sbus = smmu_get_sbus(bs, bus);
>      accel_dev = smmuv3_accel_get_dev(bs, sbus, bus, devfn);
>      sdev = &accel_dev->sdev;
>  
> -    return &sdev->as;
> +    if (vfio_pci) {
> +        return &accel_dev->as_sysmem;
> +    } else {
> +        return &sdev->as;
> +    }
>  }
>  
>  static const PCIIOMMUOps smmuv3_accel_ops = {
>      .get_address_space = smmuv3_accel_find_add_as,
>  };
>  
> +void smmuv3_accel_init(SMMUv3State *s)
> +{
> +    SMMUv3AccelState *s_accel;
> +
> +    s->s_accel = s_accel = g_new0(SMMUv3AccelState, 1);
> +    memory_region_init(&s_accel->root, OBJECT(s), "root", UINT64_MAX);
> +    memory_region_init_alias(&s_accel->sysmem, OBJECT(s),
> +                             "smmuv3-accel-sysmem", get_system_memory(), 0,
> +                             memory_region_size(get_system_memory()));
> +    memory_region_add_subregion(&s_accel->root, 0, &s_accel->sysmem);
> +}
> +
>  static void smmuv3_accel_class_init(ObjectClass *oc, const void *data)
>  {
>      SMMUBaseClass *sbc = ARM_SMMU_CLASS(oc);

> diff --git a/include/hw/arm/smmuv3.h b/include/hw/arm/smmuv3.h
> index d183a62766..3bdb92391a 100644
> --- a/include/hw/arm/smmuv3.h
> +++ b/include/hw/arm/smmuv3.h
> @@ -63,6 +63,7 @@ struct SMMUv3State {
>      qemu_irq     irq[4];
>      QemuMutex mutex;
>      char *stage;
> +    struct SMMUv3AccelState  *s_accel;

bonus space.

>  };
>  
>  typedef enum {

Re: [RFC PATCH v3 06/15] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd

Posted by Nicolin Chen 4 months ago

On Mon, Jul 14, 2025 at 04:59:32PM +0100, Shameer Kolothum wrote:
> Accelerated SMMUv3 is only useful when the device can take advantage of
> the host's SMMUv3 in nested mode. To keep things simple and correct, we
> only allow this feature for vfio-pci endpoint devices that use the iommufd
> backend. We also allow non-endpoint emulated devices like PCI bridges and
> root ports, so that users can plug in these vfio-pci devices.
> 
> Another reason for this limit is to avoid problems with IOTLB
> invalidations. Some commands (e.g., CMD_TLBI_NH_ASID) lack an associated
> SID, making it difficult to trace the originating device. If we allowed
> emulated endpoint devices, QEMU would have to invalidate both its own
> software IOTLB and the host's hardware IOTLB, which could slow things
> down.

Change looks good to me,

Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>

Some nits:

> Since vfio-pci devices in nested mode rely on the host SMMUv3's nested
> translation (S1+S2), their get_address_space() callback must return the
> system address space to enable correct S2 mappings of guest RAM.
>
> So in short:
>  - vfio-pci devices return the system address space
>  - bridges and root ports return the IOMMU address space

I think we can spare a few more words here and in the code too:
(a) A vfio-pci device in an accelerated mode doesn't need any of
    the features from the iommu address space, since translation
    and IOTLB maintenance will be handled by the real hardware.
(b) In the HW accelerated mode, the VFIO core will allocate the
    S2 nesting parent HWPT on top of a core-managed IOAS for S2
    mappings. So, returning the system address space allows to
    take advantage of that.
(c) The reason why bridges and root ports can't use the system
    address space.

Feel free to organize these in your preferred words.

> Note: On ARM, MSI doorbell addresses are also translated via SMMUv3.
> Hence, if a vfio-pci device is behind the SMMuv3 with translation enabled,

s/SMMuv3/SMMUv3
  
> +static bool smmuv3_accel_pdev_allowed(PCIDevice *pdev, bool *vfio_pci)
> +{
> +
> +    if (object_dynamic_cast(OBJECT(pdev), TYPE_PCI_BRIDGE) ||
> +        object_dynamic_cast(OBJECT(pdev), "pxb-pcie") ||

"TYPE_PXB_PCIE_DEV" since we moved it to the header in this patch.