[v4] hw/arm/virt: Add support for user-creatable accelerated SMMUv3

[PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd

Posted by Shameer Kolothum 4 months, 1 week ago

Accelerated SMMUv3 is only useful when the device can take advantage of
the host's SMMUv3 in nested mode. To keep things simple and correct, we
only allow this feature for vfio-pci endpoint devices that use the iommufd
backend. We also allow non-endpoint emulated devices like PCI bridges and
root ports, so that users can plug in these vfio-pci devices. We can only
enforce this if devices are cold plugged. For hotplug cases, give appropriate
warnings.

Another reason for this limit is to avoid problems with IOTLB
invalidations. Some commands (e.g., CMD_TLBI_NH_ASID) lack an associated
SID, making it difficult to trace the originating device. If we allowed
emulated endpoint devices, QEMU would have to invalidate both its own
software IOTLB and the host's hardware IOTLB, which could slow things
down.

Since vfio-pci devices in nested mode rely on the host SMMUv3's nested
translation (S1+S2), their get_address_space() callback must return the
system address space so that VFIO core can setup correct S2 mappings
for guest RAM.

So in short:
 - vfio-pci devices(with iommufd as backend) return the system address
   space.
 - bridges and root ports return the IOMMU address space.

Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
 hw/arm/smmuv3-accel.c               | 68 ++++++++++++++++++++++++++++-
 hw/pci-bridge/pci_expander_bridge.c |  1 -
 include/hw/pci/pci_bridge.h         |  1 +
 3 files changed, 68 insertions(+), 2 deletions(-)

diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
index 79f1713be6..44410cfb2a 100644
--- a/hw/arm/smmuv3-accel.c
+++ b/hw/arm/smmuv3-accel.c
@@ -7,8 +7,13 @@
  */
 
 #include "qemu/osdep.h"
+#include "qemu/error-report.h"
 
 #include "hw/arm/smmuv3.h"
+#include "hw/pci/pci_bridge.h"
+#include "hw/pci-host/gpex.h"
+#include "hw/vfio/pci.h"
+
 #include "smmuv3-accel.h"
 
 static SMMUv3AccelDevice *smmuv3_accel_get_dev(SMMUState *bs, SMMUPciBus *sbus,
@@ -29,15 +34,76 @@ static SMMUv3AccelDevice *smmuv3_accel_get_dev(SMMUState *bs, SMMUPciBus *sbus,
     return accel_dev;
 }
 
+static bool smmuv3_accel_pdev_allowed(PCIDevice *pdev, bool *vfio_pci)
+{
+
+    if (object_dynamic_cast(OBJECT(pdev), TYPE_PCI_BRIDGE) ||
+        object_dynamic_cast(OBJECT(pdev), TYPE_PXB_PCIE_DEV) ||
+        object_dynamic_cast(OBJECT(pdev), TYPE_GPEX_ROOT_DEVICE)) {
+        return true;
+    } else if ((object_dynamic_cast(OBJECT(pdev), TYPE_VFIO_PCI))) {
+        *vfio_pci = true;
+        if (object_property_get_link(OBJECT(pdev), "iommufd", NULL)) {
+            return true;
+        }
+    }
+    return false;
+}
+
 static AddressSpace *smmuv3_accel_find_add_as(PCIBus *bus, void *opaque,
                                               int devfn)
 {
+    PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), devfn);
     SMMUState *bs = opaque;
     SMMUPciBus *sbus = smmu_get_sbus(bs, bus);
     SMMUv3AccelDevice *accel_dev = smmuv3_accel_get_dev(bs, sbus, bus, devfn);
     SMMUDevice *sdev = &accel_dev->sdev;
+    bool vfio_pci = false;
+
+    if (pdev && !smmuv3_accel_pdev_allowed(pdev, &vfio_pci)) {
+        if (DEVICE(pdev)->hotplugged) {
+            if (vfio_pci) {
+                warn_report("Hot plugging a vfio-pci device (%s) without "
+                            "iommufd as backend is not supported", pdev->name);
+            } else {
+                warn_report("Hot plugging an emulated device %s with "
+                            "accelerated SMMUv3. This will bring down "
+                            "performace", pdev->name);
+            }
+            /*
+             * Both cases, we will return IOMMU address space. For hotplugged
+             * vfio-pci dev without iommufd as backend, it will fail later in
+             * smmuv3_notify_flag_changed() with "requires iommu MAP notifier"
+             * error message.
+             */
+             return &sdev->as;
+        } else {
+            error_report("Device(%s) not allowed. Only PCIe root complex "
+                         "devices or PCI bridge devices or vfio-pci endpoint "
+                         "devices with iommufd as backend is allowed with "
+                         "arm-smmuv3,accel=on", pdev->name);
+            exit(1);
+        }
+    }
 
-    return &sdev->as;
+    /*
+     * We return the system address for vfio-pci devices(with iommufd as
+     * backend) so that the VFIO core can set up Stage-2 (S2) mappings for
+     * guest RAM. This is needed because, in the accelerated SMMUv3 case,
+     * the host SMMUv3 runs in nested (S1 + S2)  mode where the guest
+     * manages its own S1 page tables while the host manages S2.
+     *
+     * We are using the global &address_space_memory here, as this will ensure
+     * same system address space pointer for all devices behind the accelerated
+     * SMMUv3s in a VM. That way VFIO/iommufd can reuse a single IOAS ID in
+     * iommufd_cdev_attach(), allowing the Stage-2 page tables to be shared
+     * within the VM instead of duplicating them for every SMMUv3 instance.
+     */
+    if (vfio_pci) {
+        return &address_space_memory;
+    } else {
+        return &sdev->as;
+    }
 }
 
 static const PCIIOMMUOps smmuv3_accel_ops = {
diff --git a/hw/pci-bridge/pci_expander_bridge.c b/hw/pci-bridge/pci_expander_bridge.c
index 1bcceddbc4..a8eb2d2426 100644
--- a/hw/pci-bridge/pci_expander_bridge.c
+++ b/hw/pci-bridge/pci_expander_bridge.c
@@ -48,7 +48,6 @@ struct PXBBus {
     char bus_path[8];
 };
 
-#define TYPE_PXB_PCIE_DEV "pxb-pcie"
 OBJECT_DECLARE_SIMPLE_TYPE(PXBPCIEDev, PXB_PCIE_DEV)
 
 static GList *pxb_dev_list;
diff --git a/include/hw/pci/pci_bridge.h b/include/hw/pci/pci_bridge.h
index a055fd8d32..b61360b900 100644
--- a/include/hw/pci/pci_bridge.h
+++ b/include/hw/pci/pci_bridge.h
@@ -106,6 +106,7 @@ typedef struct PXBPCIEDev {
 
 #define TYPE_PXB_PCIE_BUS "pxb-pcie-bus"
 #define TYPE_PXB_CXL_BUS "pxb-cxl-bus"
+#define TYPE_PXB_PCIE_DEV "pxb-pcie"
 #define TYPE_PXB_DEV "pxb"
 OBJECT_DECLARE_SIMPLE_TYPE(PXBDev, PXB_DEV)
 
-- 
2.43.0

Re: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd

Posted by Eric Auger 3 months, 3 weeks ago


On 9/29/25 3:36 PM, Shameer Kolothum wrote:
> Accelerated SMMUv3 is only useful when the device can take advantage of
> the host's SMMUv3 in nested mode. To keep things simple and correct, we
> only allow this feature for vfio-pci endpoint devices that use the iommufd
> backend. We also allow non-endpoint emulated devices like PCI bridges and
> root ports, so that users can plug in these vfio-pci devices. We can only
> enforce this if devices are cold plugged. For hotplug cases, give appropriate
> warnings.
>
> Another reason for this limit is to avoid problems with IOTLB
> invalidations. Some commands (e.g., CMD_TLBI_NH_ASID) lack an associated
> SID, making it difficult to trace the originating device. If we allowed
> emulated endpoint devices, QEMU would have to invalidate both its own
> software IOTLB and the host's hardware IOTLB, which could slow things
> down.
>
> Since vfio-pci devices in nested mode rely on the host SMMUv3's nested
> translation (S1+S2), their get_address_space() callback must return the
> system address space so that VFIO core can setup correct S2 mappings
> for guest RAM.
>
> So in short:
>  - vfio-pci devices(with iommufd as backend) return the system address
>    space.
>  - bridges and root ports return the IOMMU address space.
>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> ---
>  hw/arm/smmuv3-accel.c               | 68 ++++++++++++++++++++++++++++-
>  hw/pci-bridge/pci_expander_bridge.c |  1 -
>  include/hw/pci/pci_bridge.h         |  1 +
>  3 files changed, 68 insertions(+), 2 deletions(-)
>
> diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
> index 79f1713be6..44410cfb2a 100644
> --- a/hw/arm/smmuv3-accel.c
> +++ b/hw/arm/smmuv3-accel.c
> @@ -7,8 +7,13 @@
>   */
>  
>  #include "qemu/osdep.h"
> +#include "qemu/error-report.h"
>  
>  #include "hw/arm/smmuv3.h"
> +#include "hw/pci/pci_bridge.h"
> +#include "hw/pci-host/gpex.h"
> +#include "hw/vfio/pci.h"
> +
>  #include "smmuv3-accel.h"
>  
>  static SMMUv3AccelDevice *smmuv3_accel_get_dev(SMMUState *bs, SMMUPciBus *sbus,
> @@ -29,15 +34,76 @@ static SMMUv3AccelDevice *smmuv3_accel_get_dev(SMMUState *bs, SMMUPciBus *sbus,
>      return accel_dev;
>  }
>  
> +static bool smmuv3_accel_pdev_allowed(PCIDevice *pdev, bool *vfio_pci)
> +{
> +
> +    if (object_dynamic_cast(OBJECT(pdev), TYPE_PCI_BRIDGE) ||
> +        object_dynamic_cast(OBJECT(pdev), TYPE_PXB_PCIE_DEV) ||
> +        object_dynamic_cast(OBJECT(pdev), TYPE_GPEX_ROOT_DEVICE)) {
> +        return true;
> +    } else if ((object_dynamic_cast(OBJECT(pdev), TYPE_VFIO_PCI))) {
> +        *vfio_pci = true;
> +        if (object_property_get_link(OBJECT(pdev), "iommufd", NULL)) {
> +            return true;
> +        }
> +    }
> +    return false;
> +}
> +
>  static AddressSpace *smmuv3_accel_find_add_as(PCIBus *bus, void *opaque,
>                                                int devfn)
>  {
> +    PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), devfn);
>      SMMUState *bs = opaque;
>      SMMUPciBus *sbus = smmu_get_sbus(bs, bus);
>      SMMUv3AccelDevice *accel_dev = smmuv3_accel_get_dev(bs, sbus, bus, devfn);
>      SMMUDevice *sdev = &accel_dev->sdev;
> +    bool vfio_pci = false;
> +
> +    if (pdev && !smmuv3_accel_pdev_allowed(pdev, &vfio_pci)) {
> +        if (DEVICE(pdev)->hotplugged) {
> +            if (vfio_pci) {
> +                warn_report("Hot plugging a vfio-pci device (%s) without "
> +                            "iommufd as backend is not supported", pdev->name);
> +            } else {
> +                warn_report("Hot plugging an emulated device %s with "
> +                            "accelerated SMMUv3. This will bring down "
> +                            "performace", pdev->name);
> +            }
> +            /*
> +             * Both cases, we will return IOMMU address space. For hotplugged
> +             * vfio-pci dev without iommufd as backend, it will fail later in
> +             * smmuv3_notify_flag_changed() with "requires iommu MAP notifier"
> +             * error message.
> +             */
> +             return &sdev->as;
> +        } else {
> +            error_report("Device(%s) not allowed. Only PCIe root complex "
> +                         "devices or PCI bridge devices or vfio-pci endpoint "
> +                         "devices with iommufd as backend is allowed with "
> +                         "arm-smmuv3,accel=on", pdev->name);
> +            exit(1);
> +        }
> +    }
>  
> -    return &sdev->as;
> +    /*
> +     * We return the system address for vfio-pci devices(with iommufd as
> +     * backend) so that the VFIO core can set up Stage-2 (S2) mappings for
> +     * guest RAM. This is needed because, in the accelerated SMMUv3 case,
> +     * the host SMMUv3 runs in nested (S1 + S2)  mode where the guest
> +     * manages its own S1 page tables while the host manages S2.
> +     *
> +     * We are using the global &address_space_memory here, as this will ensure
> +     * same system address space pointer for all devices behind the accelerated
> +     * SMMUv3s in a VM. That way VFIO/iommufd can reuse a single IOAS ID in
> +     * iommufd_cdev_attach(), allowing the Stage-2 page tables to be shared
> +     * within the VM instead of duplicating them for every SMMUv3 instance.
> +     */
> +    if (vfio_pci) {
> +        return &address_space_memory;
From that comment one understands the need of a single and common AS.
However it is not obvious why it shall be

&address_space_memory and not an AS created on purpose.

Eric

> +    } else {
> +        return &sdev->as;
> +    }
>  }
>  
>  static const PCIIOMMUOps smmuv3_accel_ops = {
> diff --git a/hw/pci-bridge/pci_expander_bridge.c b/hw/pci-bridge/pci_expander_bridge.c
> index 1bcceddbc4..a8eb2d2426 100644
> --- a/hw/pci-bridge/pci_expander_bridge.c
> +++ b/hw/pci-bridge/pci_expander_bridge.c
> @@ -48,7 +48,6 @@ struct PXBBus {
>      char bus_path[8];
>  };
>  
> -#define TYPE_PXB_PCIE_DEV "pxb-pcie"
>  OBJECT_DECLARE_SIMPLE_TYPE(PXBPCIEDev, PXB_PCIE_DEV)
>  
>  static GList *pxb_dev_list;
> diff --git a/include/hw/pci/pci_bridge.h b/include/hw/pci/pci_bridge.h
> index a055fd8d32..b61360b900 100644
> --- a/include/hw/pci/pci_bridge.h
> +++ b/include/hw/pci/pci_bridge.h
> @@ -106,6 +106,7 @@ typedef struct PXBPCIEDev {
>  
>  #define TYPE_PXB_PCIE_BUS "pxb-pcie-bus"
>  #define TYPE_PXB_CXL_BUS "pxb-cxl-bus"
> +#define TYPE_PXB_PCIE_DEV "pxb-pcie"
>  #define TYPE_PXB_DEV "pxb"
>  OBJECT_DECLARE_SIMPLE_TYPE(PXBDev, PXB_DEV)
>

Re: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd

Posted by Nicolin Chen 3 months, 3 weeks ago

On Mon, Oct 20, 2025 at 06:31:38PM +0200, Eric Auger wrote:
> On 9/29/25 3:36 PM, Shameer Kolothum wrote:
> > +    /*
> > +     * We return the system address for vfio-pci devices(with iommufd as
> > +     * backend) so that the VFIO core can set up Stage-2 (S2) mappings for
> > +     * guest RAM. This is needed because, in the accelerated SMMUv3 case,
> > +     * the host SMMUv3 runs in nested (S1 + S2)  mode where the guest
> > +     * manages its own S1 page tables while the host manages S2.
> > +     *
> > +     * We are using the global &address_space_memory here, as this will ensure
> > +     * same system address space pointer for all devices behind the accelerated
> > +     * SMMUv3s in a VM. That way VFIO/iommufd can reuse a single IOAS ID in
> > +     * iommufd_cdev_attach(), allowing the Stage-2 page tables to be shared
> > +     * within the VM instead of duplicating them for every SMMUv3 instance.
> > +     */
> > +    if (vfio_pci) {
> > +        return &address_space_memory;
> From that comment one understands the need of a single and common AS.
> However it is not obvious why it shall be
> 
> &address_space_memory and not an AS created on purpose.

We tried creating an AS, but it was not straightforward to share
across vSMMU instances, as most of the structures are per vSMMU.

Only SMMUv3Class seems to be shared across vSMMU instances, but
it doesn't seem to be the good place to hold an AS pointer either.

The global @address_space_memory is provisioned as the system AS,
so it's easy to use.

Perhaps we could add a couple of lines to the comments.

Thanks
Nicolin

RE: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd

Posted by Shameer Kolothum 3 months, 3 weeks ago


> -----Original Message-----
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: 20 October 2025 19:25
> To: Eric Auger <eric.auger@redhat.com>
> Cc: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
> arm@nongnu.org; qemu-devel@nongnu.org; peter.maydell@linaro.org;
> Jason Gunthorpe <jgg@nvidia.com>; ddutile@redhat.com;
> berrange@redhat.com; Nathan Chen <nathanc@nvidia.com>; Matt Ochs
> <mochs@nvidia.com>; smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> shameerkolothum@gmail.com
> Subject: Re: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated
> SMMUv3 to vfio-pci endpoints with iommufd
> 
> On Mon, Oct 20, 2025 at 06:31:38PM +0200, Eric Auger wrote:
> > On 9/29/25 3:36 PM, Shameer Kolothum wrote:
> > > +    /*
> > > +     * We return the system address for vfio-pci devices(with iommufd as
> > > +     * backend) so that the VFIO core can set up Stage-2 (S2) mappings for
> > > +     * guest RAM. This is needed because, in the accelerated SMMUv3
> case,
> > > +     * the host SMMUv3 runs in nested (S1 + S2)  mode where the guest
> > > +     * manages its own S1 page tables while the host manages S2.
> > > +     *
> > > +     * We are using the global &address_space_memory here, as this will
> ensure
> > > +     * same system address space pointer for all devices behind the
> accelerated
> > > +     * SMMUv3s in a VM. That way VFIO/iommufd can reuse a single IOAS
> ID in
> > > +     * iommufd_cdev_attach(), allowing the Stage-2 page tables to be
> shared
> > > +     * within the VM instead of duplicating them for every SMMUv3
> instance.
> > > +     */
> > > +    if (vfio_pci) {
> > > +        return &address_space_memory;
> > From that comment one understands the need of a single and common AS.
> > However it is not obvious why it shall be
> >
> > &address_space_memory and not an AS created on purpose.
> 
> We tried creating an AS, but it was not straightforward to share across vSMMU
> instances, as most of the structures are per vSMMU.
> 
> Only SMMUv3Class seems to be shared across vSMMU instances, but it
> doesn't seem to be the good place to hold an AS pointer either.
> 
> The global @address_space_memory is provisioned as the system AS, so it's
> easy to use.

We had discussed this previously here,
https://lore.kernel.org/qemu-devel/aJKn650gOGQh2whD@Asurada-Nvidia/

Thanks,
Shameer

Re: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd

Posted by Eric Auger 3 months, 3 weeks ago


On 10/20/25 8:59 PM, Shameer Kolothum wrote:
>
>> -----Original Message-----
>> From: Nicolin Chen <nicolinc@nvidia.com>
>> Sent: 20 October 2025 19:25
>> To: Eric Auger <eric.auger@redhat.com>
>> Cc: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
>> arm@nongnu.org; qemu-devel@nongnu.org; peter.maydell@linaro.org;
>> Jason Gunthorpe <jgg@nvidia.com>; ddutile@redhat.com;
>> berrange@redhat.com; Nathan Chen <nathanc@nvidia.com>; Matt Ochs
>> <mochs@nvidia.com>; smostafa@google.com; wangzhou1@hisilicon.com;
>> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
>> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
>> shameerkolothum@gmail.com
>> Subject: Re: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated
>> SMMUv3 to vfio-pci endpoints with iommufd
>>
>> On Mon, Oct 20, 2025 at 06:31:38PM +0200, Eric Auger wrote:
>>> On 9/29/25 3:36 PM, Shameer Kolothum wrote:
>>>> +    /*
>>>> +     * We return the system address for vfio-pci devices(with iommufd as
>>>> +     * backend) so that the VFIO core can set up Stage-2 (S2) mappings for
>>>> +     * guest RAM. This is needed because, in the accelerated SMMUv3
>> case,
>>>> +     * the host SMMUv3 runs in nested (S1 + S2)  mode where the guest
>>>> +     * manages its own S1 page tables while the host manages S2.
>>>> +     *
>>>> +     * We are using the global &address_space_memory here, as this will
>> ensure
>>>> +     * same system address space pointer for all devices behind the
>> accelerated
>>>> +     * SMMUv3s in a VM. That way VFIO/iommufd can reuse a single IOAS
>> ID in
>>>> +     * iommufd_cdev_attach(), allowing the Stage-2 page tables to be
>> shared
>>>> +     * within the VM instead of duplicating them for every SMMUv3
>> instance.
>>>> +     */
>>>> +    if (vfio_pci) {
>>>> +        return &address_space_memory;
>>> From that comment one understands the need of a single and common AS.
>>> However it is not obvious why it shall be
>>>
>>> &address_space_memory and not an AS created on purpose.
>> We tried creating an AS, but it was not straightforward to share across vSMMU
>> instances, as most of the structures are per vSMMU.
>>
>> Only SMMUv3Class seems to be shared across vSMMU instances, but it
>> doesn't seem to be the good place to hold an AS pointer either.
>>
>> The global @address_space_memory is provisioned as the system AS, so it's
>> easy to use.
> We had discussed this previously here,
> https://lore.kernel.org/qemu-devel/aJKn650gOGQh2whD@Asurada-Nvidia/

Thank you for the pointer. I definitively missed that thread. Seems
Peter was not very keen either to use the address_space_memory. Why
can't we use a global variable in smmu-accel.c ? in hw/vfio/container.c
there is a list of VFIOAddressSpace for instance.
Is there anything wrong doing that?

Thanks

Eric

Eric
>
> Thanks,
> Shameer
>

Re: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd

Posted by Eric Auger 4 months, 1 week ago

Hi Shameer,

On 9/29/25 3:36 PM, Shameer Kolothum wrote:
> Accelerated SMMUv3 is only useful when the device can take advantage of
> the host's SMMUv3 in nested mode. To keep things simple and correct, we
> only allow this feature for vfio-pci endpoint devices that use the iommufd
> backend. We also allow non-endpoint emulated devices like PCI bridges and
> root ports, so that users can plug in these vfio-pci devices. We can only
> enforce this if devices are cold plugged. For hotplug cases, give appropriate

"We can only enforce this if devices are cold plugged": I don't really understand that statement. you do checks when the device is hotplugged too. For emulated device you eventually allow them but you could decide to reject them? 

> warnings.
>
> Another reason for this limit is to avoid problems with IOTLB
> invalidations. Some commands (e.g., CMD_TLBI_NH_ASID) lack an associated
> SID, making it difficult to trace the originating device. If we allowed
> emulated endpoint devices, QEMU would have to invalidate both its own
> software IOTLB and the host's hardware IOTLB, which could slow things
> down.
>
> Since vfio-pci devices in nested mode rely on the host SMMUv3's nested
> translation (S1+S2), their get_address_space() callback must return the
> system address space so that VFIO core can setup correct S2 mappings
> for guest RAM.
>
> So in short:
>  - vfio-pci devices(with iommufd as backend) return the system address
>    space.
>  - bridges and root ports return the IOMMU address space.
>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> ---
>  hw/arm/smmuv3-accel.c               | 68 ++++++++++++++++++++++++++++-
>  hw/pci-bridge/pci_expander_bridge.c |  1 -
>  include/hw/pci/pci_bridge.h         |  1 +
>  3 files changed, 68 insertions(+), 2 deletions(-)
>
> diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
> index 79f1713be6..44410cfb2a 100644
> --- a/hw/arm/smmuv3-accel.c
> +++ b/hw/arm/smmuv3-accel.c
> @@ -7,8 +7,13 @@
>   */
>  
>  #include "qemu/osdep.h"
> +#include "qemu/error-report.h"
>  
>  #include "hw/arm/smmuv3.h"
> +#include "hw/pci/pci_bridge.h"
> +#include "hw/pci-host/gpex.h"
> +#include "hw/vfio/pci.h"
> +
>  #include "smmuv3-accel.h"
>  
>  static SMMUv3AccelDevice *smmuv3_accel_get_dev(SMMUState *bs, SMMUPciBus *sbus,
> @@ -29,15 +34,76 @@ static SMMUv3AccelDevice *smmuv3_accel_get_dev(SMMUState *bs, SMMUPciBus *sbus,
>      return accel_dev;
>  }
>  
> +static bool smmuv3_accel_pdev_allowed(PCIDevice *pdev, bool *vfio_pci)
> +{
> +
> +    if (object_dynamic_cast(OBJECT(pdev), TYPE_PCI_BRIDGE) ||
> +        object_dynamic_cast(OBJECT(pdev), TYPE_PXB_PCIE_DEV) ||
> +        object_dynamic_cast(OBJECT(pdev), TYPE_GPEX_ROOT_DEVICE)) {
> +        return true;
> +    } else if ((object_dynamic_cast(OBJECT(pdev), TYPE_VFIO_PCI))) {
> +        *vfio_pci = true;
> +        if (object_property_get_link(OBJECT(pdev), "iommufd", NULL)) {
> +            return true;
> +        }
> +    }
> +    return false;
> +}
> +
>  static AddressSpace *smmuv3_accel_find_add_as(PCIBus *bus, void *opaque,
>                                                int devfn)
>  {
> +    PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), devfn);
>      SMMUState *bs = opaque;
>      SMMUPciBus *sbus = smmu_get_sbus(bs, bus);
>      SMMUv3AccelDevice *accel_dev = smmuv3_accel_get_dev(bs, sbus, bus, devfn);
>      SMMUDevice *sdev = &accel_dev->sdev;
> +    bool vfio_pci = false;
> +
> +    if (pdev && !smmuv3_accel_pdev_allowed(pdev, &vfio_pci)) {
> +        if (DEVICE(pdev)->hotplugged) {
> +            if (vfio_pci) {
> +                warn_report("Hot plugging a vfio-pci device (%s) without "
> +                            "iommufd as backend is not supported", pdev->name);
with accelerated SMMUv3.

why don't we return NULL and properly handle this in the caller. May be
worth adding an errp to get_address_space(). I know this is cumbersome
though.
> +            } else {
> +                warn_report("Hot plugging an emulated device %s with "
> +                            "accelerated SMMUv3. This will bring down "
> +                            "performace", pdev->name);
performance
> +            }
> +            /*
> +             * Both cases, we will return IOMMU address space. For hotplugged
In both cases?
> +             * vfio-pci dev without iommufd as backend, it will fail later in
> +             * smmuv3_notify_flag_changed() with "requires iommu MAP notifier"
> +             * error message.
> +             */
> +             return &sdev->as;
> +        } else {
> +            error_report("Device(%s) not allowed. Only PCIe root complex "
device (%s)
> +                         "devices or PCI bridge devices or vfio-pci endpoint "
> +                         "devices with iommufd as backend is allowed with "
> +                         "arm-smmuv3,accel=on", pdev->name);
> +            exit(1);
> +        }
> +    }
>  
> -    return &sdev->as;
> +    /*
> +     * We return the system address for vfio-pci devices(with iommufd as
> +     * backend) so that the VFIO core can set up Stage-2 (S2) mappings for
> +     * guest RAM. This is needed because, in the accelerated SMMUv3 case,
> +     * the host SMMUv3 runs in nested (S1 + S2)  mode where the guest
> +     * manages its own S1 page tables while the host manages S2.
> +     *
> +     * We are using the global &address_space_memory here, as this will ensure
> +     * same system address space pointer for all devices behind the accelerated
> +     * SMMUv3s in a VM. That way VFIO/iommufd can reuse a single IOAS ID in
> +     * iommufd_cdev_attach(), allowing the Stage-2 page tables to be shared
> +     * within the VM instead of duplicating them for every SMMUv3 instance.
> +     */
> +    if (vfio_pci) {
> +        return &address_space_memory;
> +    } else {
> +        return &sdev->as;
> +    }
>  }
>  
>  static const PCIIOMMUOps smmuv3_accel_ops = {
> diff --git a/hw/pci-bridge/pci_expander_bridge.c b/hw/pci-bridge/pci_expander_bridge.c
> index 1bcceddbc4..a8eb2d2426 100644
> --- a/hw/pci-bridge/pci_expander_bridge.c
> +++ b/hw/pci-bridge/pci_expander_bridge.c
> @@ -48,7 +48,6 @@ struct PXBBus {
>      char bus_path[8];
>  };
>  
> -#define TYPE_PXB_PCIE_DEV "pxb-pcie"
>  OBJECT_DECLARE_SIMPLE_TYPE(PXBPCIEDev, PXB_PCIE_DEV)
>  
>  static GList *pxb_dev_list;
> diff --git a/include/hw/pci/pci_bridge.h b/include/hw/pci/pci_bridge.h
> index a055fd8d32..b61360b900 100644
> --- a/include/hw/pci/pci_bridge.h
> +++ b/include/hw/pci/pci_bridge.h
> @@ -106,6 +106,7 @@ typedef struct PXBPCIEDev {
>  
>  #define TYPE_PXB_PCIE_BUS "pxb-pcie-bus"
>  #define TYPE_PXB_CXL_BUS "pxb-cxl-bus"
> +#define TYPE_PXB_PCIE_DEV "pxb-pcie"
I agree with Nicolin, you shall rather move that change in a seperate patch.
>  #define TYPE_PXB_DEV "pxb"
>  OBJECT_DECLARE_SIMPLE_TYPE(PXBDev, PXB_DEV)
>  
Thanks

Eric

RE: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd

Posted by Shameer Kolothum 4 months, 1 week ago

Hi Eric,

> -----Original Message-----
> From: Eric Auger <eric.auger@redhat.com>
> Sent: 01 October 2025 18:32
> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
> arm@nongnu.org; qemu-devel@nongnu.org
> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> shameerkolothum@gmail.com
> Subject: Re: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated
> SMMUv3 to vfio-pci endpoints with iommufd
> 
> External email: Use caution opening links or attachments
> 
> Hi Shameer,
> 
> On 9/29/25 3:36 PM, Shameer Kolothum wrote:
> > Accelerated SMMUv3 is only useful when the device can take advantage
> > of the host's SMMUv3 in nested mode. To keep things simple and
> > correct, we only allow this feature for vfio-pci endpoint devices that
> > use the iommufd backend. We also allow non-endpoint emulated devices
> > like PCI bridges and root ports, so that users can plug in these
> > vfio-pci devices. We can only enforce this if devices are cold
> > plugged. For hotplug cases, give appropriate
> 
> "We can only enforce this if devices are cold plugged": I don't really
> understand that statement.

By "enforce" here I meant, we can prevent user from starting a Guest 
with a non "vfio-pci/iommufd dev" with accel=one case.  

 you do checks when the device is hotplugged too.
> For emulated device you eventually allow them but you could decide to reject
> them?

Currently get_address_space() is a " Mandatory callback which returns a pointer
to an #AddressSpace". Changing that and propagating an error all the way, as 
you said below, is not that straightforward. At present we warn the user
appropriately for both vfio-pci without iommufd and emulated device hot plug
cases. Perhaps, if required, the error handling can be taken up as a clean-up series
later?

Also, I think I need to explain the emulated device hotplug case a bit more. This
is something I realised later during the tests.

Unfortunately, the hotplug scenario for emulated devices behaves differently.
What I’ve noticed is that the hotplug handler’s call path to get_address_space()
differs from cold-plug cases.

In the emulated device hotplug case, the pdev is NULL for below:
PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), devfn);

Here’s what seems to be happening:

do_pci_register_device() {
   ....
    if (phase_check(PHASE_MACHINE_READY)) {
        pci_init_bus_master(pci_dev);
            pci_device_iommu_address_space()  --> get_address_space()
    }
    ....
    bus->devices[devfn] = pci_dev;   //happens only after the above call.
}

For vfio-pci hotplug, we’re fine, since the vfio layer calls get_address_space()
again, with a valid pdev.

For cold-plug cases, the if (phase_check(PHASE_MACHINE_READY)) check is
false, and the call path looks like this:

pcibus_machine_done()
   pci_init_bus_master(pci_dev);
       pci_device_iommu_address_space()  --> get_address_space()

By then we have a valid pdev.

I’m not sure there’s an easy fix here. One option could be to modify
get_address_space() to take pci_dev as input. Or we could change the 
call path order above.

(See my below reply to emulated dev warn_report() case as well)

Please let me know your thoughts.

> > warnings.
> >

[...]

> > +
> > +    if (pdev && !smmuv3_accel_pdev_allowed(pdev, &vfio_pci)) {
> > +        if (DEVICE(pdev)->hotplugged) {
> > +            if (vfio_pci) {
> > +                warn_report("Hot plugging a vfio-pci device (%s) without "
> > +                            "iommufd as backend is not supported",
> > + pdev->name);
> with accelerated SMMUv3.
> 
> why don't we return NULL and properly handle this in the caller. May be worth
> adding an errp to get_address_space(). I know this is cumbersome though.

See above reply on propagating err from this callback.

> > +            } else {
> > +                warn_report("Hot plugging an emulated device %s with "
> > +                            "accelerated SMMUv3. This will bring down "
> > +                            "performace", pdev->name);
> performance
> > +            }

As I mentioned above, since the pdev for emulated dev hotplug case is NULL,
we will not hit the above warning. 

> > +            /*
> > +             * Both cases, we will return IOMMU address space. For
> > + hotplugged
> In both cases?

Yes, since we can't return NULL here. However, as done here, we will inform
the user appropriately.

> > +             * vfio-pci dev without iommufd as backend, it will fail later in
> > +             * smmuv3_notify_flag_changed() with "requires iommu MAP
> notifier"

[...]

> > +#define TYPE_PXB_PCIE_DEV "pxb-pcie"
> I agree with Nicolin, you shall rather move that change in a seperate patch.

I thought of mentioning this change in the commit log(which I missed) and
avoiding a separate patch just for this. But if you guys feel strongly, I will
have a separate one.

Thanks,
Shameer

Re: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd

Posted by Eric Auger 3 months, 3 weeks ago

Hi Shameer,

On 10/2/25 11:30 AM, Shameer Kolothum wrote:
> Hi Eric,
>
>> -----Original Message-----
>> From: Eric Auger <eric.auger@redhat.com>
>> Sent: 01 October 2025 18:32
>> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
>> arm@nongnu.org; qemu-devel@nongnu.org
>> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
>> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
>> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
>> smostafa@google.com; wangzhou1@hisilicon.com;
>> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
>> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
>> shameerkolothum@gmail.com
>> Subject: Re: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated
>> SMMUv3 to vfio-pci endpoints with iommufd
>>
>> External email: Use caution opening links or attachments
>>
>> Hi Shameer,
>>
>> On 9/29/25 3:36 PM, Shameer Kolothum wrote:
>>> Accelerated SMMUv3 is only useful when the device can take advantage
>>> of the host's SMMUv3 in nested mode. To keep things simple and
>>> correct, we only allow this feature for vfio-pci endpoint devices that
>>> use the iommufd backend. We also allow non-endpoint emulated devices
>>> like PCI bridges and root ports, so that users can plug in these
>>> vfio-pci devices. We can only enforce this if devices are cold
>>> plugged. For hotplug cases, give appropriate
>> "We can only enforce this if devices are cold plugged": I don't really
>> understand that statement.
> By "enforce" here I meant, we can prevent user from starting a Guest 
> with a non "vfio-pci/iommufd dev" with accel=one case.  
Ah OK I misread the code. I thought you were also exiting in case of
hotplug but you only issue a warn_report.
From a user point of view, the assigned device will succeed attachment
but won't work. Will we get subsequent messages?  I understand the pain
of propagating the error but if the user experience is bad I think it
should weight over ?
>
>  you do checks when the device is hotplugged too.
>> For emulated device you eventually allow them but you could decide to reject
>> them?
> Currently get_address_space() is a " Mandatory callback which returns a pointer
> to an #AddressSpace". Changing that and propagating an error all the way, as 
> you said below, is not that straightforward. At present we warn the user
> appropriately for both vfio-pci without iommufd and emulated device hot plug
> cases. Perhaps, if required, the error handling can be taken up as a clean-up series
> later?
>
> Also, I think I need to explain the emulated device hotplug case a bit more. This
> is something I realised later during the tests.
>
> Unfortunately, the hotplug scenario for emulated devices behaves differently.
> What I’ve noticed is that the hotplug handler’s call path to get_address_space()
> differs from cold-plug cases.
>
> In the emulated device hotplug case, the pdev is NULL for below:
> PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), devfn);
>
> Here’s what seems to be happening:
>
> do_pci_register_device() {
>    ....
>     if (phase_check(PHASE_MACHINE_READY)) {
>         pci_init_bus_master(pci_dev);
>             pci_device_iommu_address_space()  --> get_address_space()
>     }
>     ....
>     bus->devices[devfn] = pci_dev;   //happens only after the above call.
> }
>
> For vfio-pci hotplug, we’re fine, since the vfio layer calls get_address_space()
> again, with a valid pdev.
>
> For cold-plug cases, the if (phase_check(PHASE_MACHINE_READY)) check is
> false, and the call path looks like this:
>
> pcibus_machine_done()
>    pci_init_bus_master(pci_dev);
>        pci_device_iommu_address_space()  --> get_address_space()
>
> By then we have a valid pdev.
>
> I’m not sure there’s an easy fix here. One option could be to modify
> get_address_space() to take pci_dev as input. Or we could change the 
> call path order above.
>
> (See my below reply to emulated dev warn_report() case as well)
>
> Please let me know your thoughts.
Can't you move the assignment of bus->devices[devfn] before the call and
unset it in case of failure?

Or if you propagate errors from

get_address_space() you could retry the call later?

Eric

>
>>> warnings.
>>>
> [...]
>
>>> +
>>> +    if (pdev && !smmuv3_accel_pdev_allowed(pdev, &vfio_pci)) {
>>> +        if (DEVICE(pdev)->hotplugged) {
>>> +            if (vfio_pci) {
>>> +                warn_report("Hot plugging a vfio-pci device (%s) without "
>>> +                            "iommufd as backend is not supported",
>>> + pdev->name);
>> with accelerated SMMUv3.
>>
>> why don't we return NULL and properly handle this in the caller. May be worth
>> adding an errp to get_address_space(). I know this is cumbersome though.
> See above reply on propagating err from this callback.
>
>>> +            } else {
>>> +                warn_report("Hot plugging an emulated device %s with "
>>> +                            "accelerated SMMUv3. This will bring down "
>>> +                            "performace", pdev->name);
>> performance
>>> +            }
> As I mentioned above, since the pdev for emulated dev hotplug case is NULL,
> we will not hit the above warning. 
>
>>> +            /*
>>> +             * Both cases, we will return IOMMU address space. For
>>> + hotplugged
>> In both cases?
> Yes, since we can't return NULL here. However, as done here, we will inform
> the user appropriately.
>
>>> +             * vfio-pci dev without iommufd as backend, it will fail later in
>>> +             * smmuv3_notify_flag_changed() with "requires iommu MAP
>> notifier"
> [...]
>
>>> +#define TYPE_PXB_PCIE_DEV "pxb-pcie"
>> I agree with Nicolin, you shall rather move that change in a seperate patch.
> I thought of mentioning this change in the commit log(which I missed) and
> avoiding a separate patch just for this. But if you guys feel strongly, I will
> have a separate one.
>
> Thanks,
> Shameer

RE: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd

Posted by Shameer Kolothum 3 months, 3 weeks ago

Hi Eric,

> -----Original Message-----
> From: Eric Auger <eric.auger@redhat.com>
> Sent: 17 October 2025 13:47
> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
> arm@nongnu.org; qemu-devel@nongnu.org
> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> shameerkolothum@gmail.com
> Subject: Re: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated
> SMMUv3 to vfio-pci endpoints with iommufd
> 
> External email: Use caution opening links or attachments
> 
> 
> Hi Shameer,
> 
> On 10/2/25 11:30 AM, Shameer Kolothum wrote:
> > Hi Eric,
> >
> >> -----Original Message-----
> >> From: Eric Auger <eric.auger@redhat.com>
> >> Sent: 01 October 2025 18:32
> >> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
> >> arm@nongnu.org; qemu-devel@nongnu.org
> >> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
> >> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
> >> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> >> smostafa@google.com; wangzhou1@hisilicon.com;
> >> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> >> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> >> shameerkolothum@gmail.com
> >> Subject: Re: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated
> >> SMMUv3 to vfio-pci endpoints with iommufd
> >>
> >> External email: Use caution opening links or attachments
> >>
> >> Hi Shameer,
> >>
> >> On 9/29/25 3:36 PM, Shameer Kolothum wrote:
> >>> Accelerated SMMUv3 is only useful when the device can take advantage
> >>> of the host's SMMUv3 in nested mode. To keep things simple and
> >>> correct, we only allow this feature for vfio-pci endpoint devices that
> >>> use the iommufd backend. We also allow non-endpoint emulated devices
> >>> like PCI bridges and root ports, so that users can plug in these
> >>> vfio-pci devices. We can only enforce this if devices are cold
> >>> plugged. For hotplug cases, give appropriate
> >> "We can only enforce this if devices are cold plugged": I don't really
> >> understand that statement.
> > By "enforce" here I meant, we can prevent user from starting a Guest
> > with a non "vfio-pci/iommufd dev" with accel=one case.
> Ah OK I misread the code. I thought you were also exiting in case of
> hotplug but you only issue a warn_report.
> From a user point of view, the assigned device will succeed attachment
> but won't work. Will we get subsequent messages? 

It will work. But as the warning says, it may degrade the performance especially
If the SMMUv3 has other vfio-pci devices. Because the TLB invalidations
from emulated one will be issued to host SMMUv3 as well.

 I understand the pain
> of propagating the error but if the user experience is bad I think it
> should weight over ?

I am not against it. But can be taken up as a separate one if required.

> >
> > Please let me know your thoughts.
> Can't you move the assignment of bus->devices[devfn] before the call and
> unset it in case of failure?
> 
> Or if you propagate errors from
> 
> get_address_space() you could retry the call later?

For now, I have a fix like below that seems to do the
Job.

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index c9932c87e3..9693d7f10c 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -1370,9 +1370,6 @@ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev,
     pci_dev->bus_master_as.max_bounce_buffer_size =
         pci_dev->max_bounce_buffer_size;

-    if (phase_check(PHASE_MACHINE_READY)) {
-        pci_init_bus_master(pci_dev);
-    }
     pci_dev->irq_state = 0;
     pci_config_alloc(pci_dev);

@@ -1416,6 +1413,9 @@ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev,
     pci_dev->config_write = config_write;
     bus->devices[devfn] = pci_dev;
     pci_dev->version_id = 2; /* Current pci device vmstate version */
+    if (phase_check(PHASE_MACHINE_READY)) {
+        pci_init_bus_master(pci_dev);
+    }
     return pci_dev;
 }

Thanks,
Shameer

Re: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd

Posted by Eric Auger 3 months, 3 weeks ago


On 10/17/25 3:15 PM, Shameer Kolothum wrote:
> Hi Eric,
>
>> -----Original Message-----
>> From: Eric Auger <eric.auger@redhat.com>
>> Sent: 17 October 2025 13:47
>> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
>> arm@nongnu.org; qemu-devel@nongnu.org
>> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
>> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
>> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
>> smostafa@google.com; wangzhou1@hisilicon.com;
>> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
>> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
>> shameerkolothum@gmail.com
>> Subject: Re: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated
>> SMMUv3 to vfio-pci endpoints with iommufd
>>
>> External email: Use caution opening links or attachments
>>
>>
>> Hi Shameer,
>>
>> On 10/2/25 11:30 AM, Shameer Kolothum wrote:
>>> Hi Eric,
>>>
>>>> -----Original Message-----
>>>> From: Eric Auger <eric.auger@redhat.com>
>>>> Sent: 01 October 2025 18:32
>>>> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
>>>> arm@nongnu.org; qemu-devel@nongnu.org
>>>> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
>>>> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
>>>> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
>>>> smostafa@google.com; wangzhou1@hisilicon.com;
>>>> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
>>>> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
>>>> shameerkolothum@gmail.com
>>>> Subject: Re: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated
>>>> SMMUv3 to vfio-pci endpoints with iommufd
>>>>
>>>> External email: Use caution opening links or attachments
>>>>
>>>> Hi Shameer,
>>>>
>>>> On 9/29/25 3:36 PM, Shameer Kolothum wrote:
>>>>> Accelerated SMMUv3 is only useful when the device can take advantage
>>>>> of the host's SMMUv3 in nested mode. To keep things simple and
>>>>> correct, we only allow this feature for vfio-pci endpoint devices that
>>>>> use the iommufd backend. We also allow non-endpoint emulated devices
>>>>> like PCI bridges and root ports, so that users can plug in these
>>>>> vfio-pci devices. We can only enforce this if devices are cold
>>>>> plugged. For hotplug cases, give appropriate
>>>> "We can only enforce this if devices are cold plugged": I don't really
>>>> understand that statement.
>>> By "enforce" here I meant, we can prevent user from starting a Guest
>>> with a non "vfio-pci/iommufd dev" with accel=one case.
>> Ah OK I misread the code. I thought you were also exiting in case of
>> hotplug but you only issue a warn_report.
>> From a user point of view, the assigned device will succeed attachment
>> but won't work. Will we get subsequent messages? 
> It will work. But as the warning says, it may degrade the performance especially
> If the SMMUv3 has other vfio-pci devices. Because the TLB invalidations
> from emulated one will be issued to host SMMUv3 as well.
>
>  I understand the pain
>> of propagating the error but if the user experience is bad I think it
>> should weight over ?
> I am not against it. But can be taken up as a separate one if required.
>
>>> Please let me know your thoughts.
>> Can't you move the assignment of bus->devices[devfn] before the call and
>> unset it in case of failure?
>>
>> Or if you propagate errors from
>>
>> get_address_space() you could retry the call later?
> For now, I have a fix like below that seems to do the
> Job.
>
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index c9932c87e3..9693d7f10c 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -1370,9 +1370,6 @@ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev,
>      pci_dev->bus_master_as.max_bounce_buffer_size =
>          pci_dev->max_bounce_buffer_size;
>
> -    if (phase_check(PHASE_MACHINE_READY)) {
> -        pci_init_bus_master(pci_dev);
> -    }
>      pci_dev->irq_state = 0;
>      pci_config_alloc(pci_dev);
>
> @@ -1416,6 +1413,9 @@ static PCIDevice *do_pci_register_device(PCIDevice *pci_dev,
>      pci_dev->config_write = config_write;
>      bus->devices[devfn] = pci_dev;
>      pci_dev->version_id = 2; /* Current pci device vmstate version */
> +    if (phase_check(PHASE_MACHINE_READY)) {
> +        pci_init_bus_master(pci_dev);
> +    }
>      return pci_dev;

OK worth putting it in a separate patch to allow finer review of PCI
maintainers.

Thanks

Eric
>  }
>
> Thanks,
> Shameer

Re: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd

Posted by Nicolin Chen 4 months, 1 week ago

On Mon, Sep 29, 2025 at 02:36:22PM +0100, Shameer Kolothum wrote:
> Accelerated SMMUv3 is only useful when the device can take advantage of
> the host's SMMUv3 in nested mode. To keep things simple and correct, we
> only allow this feature for vfio-pci endpoint devices that use the iommufd
> backend. We also allow non-endpoint emulated devices like PCI bridges and
> root ports, so that users can plug in these vfio-pci devices. We can only
> enforce this if devices are cold plugged. For hotplug cases, give appropriate
> warnings.
> 
> Another reason for this limit is to avoid problems with IOTLB
> invalidations. Some commands (e.g., CMD_TLBI_NH_ASID) lack an associated
> SID, making it difficult to trace the originating device. If we allowed
> emulated endpoint devices, QEMU would have to invalidate both its own
> software IOTLB and the host's hardware IOTLB, which could slow things
> down.
> 
> Since vfio-pci devices in nested mode rely on the host SMMUv3's nested
> translation (S1+S2), their get_address_space() callback must return the
> system address space so that VFIO core can setup correct S2 mappings
> for guest RAM.
> 
> So in short:
>  - vfio-pci devices(with iommufd as backend) return the system address
>    space.
>  - bridges and root ports return the IOMMU address space.
> 
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>

Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>

With some nits:

> +    /*
> +     * We return the system address for vfio-pci devices(with iommufd as
> +     * backend) so that the VFIO core can set up Stage-2 (S2) mappings for
> +     * guest RAM. This is needed because, in the accelerated SMMUv3 case,
> +     * the host SMMUv3 runs in nested (S1 + S2)  mode where the guest
> +     * manages its own S1 page tables while the host manages S2.
> +     *
> +     * We are using the global &address_space_memory here, as this will ensure
> +     * same system address space pointer for all devices behind the accelerated
> +     * SMMUv3s in a VM. That way VFIO/iommufd can reuse a single IOAS ID in
> +     * iommufd_cdev_attach(), allowing the Stage-2 page tables to be shared
> +     * within the VM instead of duplicating them for every SMMUv3 instance.
> +     */
> +    if (vfio_pci) {
> +        return &address_space_memory;

How about:

    /*
     * In the accelerated case, a vfio-pci device passed through via the iommufd
     * backend must stay in the system address space, as it is always translated
     * by its physical SMMU (using a stage-2-only STE or a nested STE), in which
     * case the stage-2 nesting parent page table is allocated by the vfio core,
     * backing up the system address space.
     *
     * So, return the system address space via the global address_space_memory.
     * The shared address_space_memory also allows devices under different vSMMU
     * instances in a VM to reuse a single nesting parent HWPT in the vfio core.
     */
?

And I think this would be clearer by having get_viommu_flags() in
this patch.

> diff --git a/hw/pci-bridge/pci_expander_bridge.c b/hw/pci-bridge/pci_expander_bridge.c
> index 1bcceddbc4..a8eb2d2426 100644
> --- a/hw/pci-bridge/pci_expander_bridge.c
> +++ b/hw/pci-bridge/pci_expander_bridge.c
> @@ -48,7 +48,6 @@ struct PXBBus {
>      char bus_path[8];
>  };
>  
> -#define TYPE_PXB_PCIE_DEV "pxb-pcie"
>  OBJECT_DECLARE_SIMPLE_TYPE(PXBPCIEDev, PXB_PCIE_DEV)
>  
>  static GList *pxb_dev_list;
> diff --git a/include/hw/pci/pci_bridge.h b/include/hw/pci/pci_bridge.h
> index a055fd8d32..b61360b900 100644
> --- a/include/hw/pci/pci_bridge.h
> +++ b/include/hw/pci/pci_bridge.h
> @@ -106,6 +106,7 @@ typedef struct PXBPCIEDev {
>  
>  #define TYPE_PXB_PCIE_BUS "pxb-pcie-bus"
>  #define TYPE_PXB_CXL_BUS "pxb-cxl-bus"
> +#define TYPE_PXB_PCIE_DEV "pxb-pcie"
>  #define TYPE_PXB_DEV "pxb"
>  OBJECT_DECLARE_SIMPLE_TYPE(PXBDev, PXB_DEV)

Maybe this can be a patch itself.

Nicolin

RE: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd

Posted by Shameer Kolothum 4 months, 1 week ago

Hi Nocolin,

> -----Original Message-----
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: 30 September 2025 01:11
> To: Shameer Kolothum <skolothumtho@nvidia.com>
> Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
> eric.auger@redhat.com; peter.maydell@linaro.org; Jason Gunthorpe
> <jgg@nvidia.com>; ddutile@redhat.com; berrange@redhat.com; Nathan
> Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> shameerkolothum@gmail.com
> Subject: Re: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated
> SMMUv3 to vfio-pci endpoints with iommufd
> 
> On Mon, Sep 29, 2025 at 02:36:22PM +0100, Shameer Kolothum wrote:
> > Accelerated SMMUv3 is only useful when the device can take advantage of
> > the host's SMMUv3 in nested mode. To keep things simple and correct, we
> > only allow this feature for vfio-pci endpoint devices that use the iommufd
> > backend. We also allow non-endpoint emulated devices like PCI bridges and
> > root ports, so that users can plug in these vfio-pci devices. We can only
> > enforce this if devices are cold plugged. For hotplug cases, give appropriate
> > warnings.
> >
> > Another reason for this limit is to avoid problems with IOTLB
> > invalidations. Some commands (e.g., CMD_TLBI_NH_ASID) lack an
> associated
> > SID, making it difficult to trace the originating device. If we allowed
> > emulated endpoint devices, QEMU would have to invalidate both its own
> > software IOTLB and the host's hardware IOTLB, which could slow things
> > down.
> >
> > Since vfio-pci devices in nested mode rely on the host SMMUv3's nested
> > translation (S1+S2), their get_address_space() callback must return the
> > system address space so that VFIO core can setup correct S2 mappings
> > for guest RAM.
> >
> > So in short:
> >  - vfio-pci devices(with iommufd as backend) return the system address
> >    space.
> >  - bridges and root ports return the IOMMU address space.
> >
> > Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> 
> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
> 
> With some nits:
> 
> > +    /*
> > +     * We return the system address for vfio-pci devices(with iommufd as
> > +     * backend) so that the VFIO core can set up Stage-2 (S2) mappings for
> > +     * guest RAM. This is needed because, in the accelerated SMMUv3 case,
> > +     * the host SMMUv3 runs in nested (S1 + S2)  mode where the guest
> > +     * manages its own S1 page tables while the host manages S2.
> > +     *
> > +     * We are using the global &address_space_memory here, as this will
> ensure
> > +     * same system address space pointer for all devices behind the
> accelerated
> > +     * SMMUv3s in a VM. That way VFIO/iommufd can reuse a single IOAS ID
> in
> > +     * iommufd_cdev_attach(), allowing the Stage-2 page tables to be
> shared
> > +     * within the VM instead of duplicating them for every SMMUv3
> instance.
> > +     */
> > +    if (vfio_pci) {
> > +        return &address_space_memory;
> 
> How about:
> 
>     /*
>      * In the accelerated case, a vfio-pci device passed through via the iommufd
>      * backend must stay in the system address space, as it is always translated
>      * by its physical SMMU (using a stage-2-only STE or a nested STE), in which
>      * case the stage-2 nesting parent page table is allocated by the vfio core,
>      * backing up the system address space.
>      *
>      * So, return the system address space via the global
> address_space_memory.
>      * The shared address_space_memory also allows devices under different
> vSMMU
>      * instances in a VM to reuse a single nesting parent HWPT in the vfio core.
>      */
> ?

Ok. I will go through the descriptions and comments in this series again and 
will try to improve it.

Thanks,
Shameer

Re: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd

Posted by Jonathan Cameron via 4 months, 1 week ago

On Mon, 29 Sep 2025 14:36:22 +0100
Shameer Kolothum <skolothumtho@nvidia.com> wrote:

> Accelerated SMMUv3 is only useful when the device can take advantage of
> the host's SMMUv3 in nested mode. To keep things simple and correct, we
> only allow this feature for vfio-pci endpoint devices that use the iommufd
> backend. We also allow non-endpoint emulated devices like PCI bridges and
> root ports, so that users can plug in these vfio-pci devices. We can only
> enforce this if devices are cold plugged. For hotplug cases, give appropriate
> warnings.
> 
> Another reason for this limit is to avoid problems with IOTLB
> invalidations. Some commands (e.g., CMD_TLBI_NH_ASID) lack an associated
> SID, making it difficult to trace the originating device. If we allowed
> emulated endpoint devices, QEMU would have to invalidate both its own
> software IOTLB and the host's hardware IOTLB, which could slow things
> down.
> 
> Since vfio-pci devices in nested mode rely on the host SMMUv3's nested
> translation (S1+S2), their get_address_space() callback must return the
> system address space so that VFIO core can setup correct S2 mappings
> for guest RAM.
> 
> So in short:
>  - vfio-pci devices(with iommufd as backend) return the system address
>    space.
>  - bridges and root ports return the IOMMU address space.
> 
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
One question that really applies to earlier patch and an even more trivial
comment on a comment than the earlier ones ;)

Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>

> ---
>  hw/arm/smmuv3-accel.c               | 68 ++++++++++++++++++++++++++++-
>  hw/pci-bridge/pci_expander_bridge.c |  1 -
>  include/hw/pci/pci_bridge.h         |  1 +
>  3 files changed, 68 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
> index 79f1713be6..44410cfb2a 100644
> --- a/hw/arm/smmuv3-accel.c
> +++ b/hw/arm/smmuv3-accel.c

>  static AddressSpace *smmuv3_accel_find_add_as(PCIBus *bus, void *opaque,

I should have noticed this in previous patch...
What does add stand for here?  This name is not particularly clear to me.

>                                                int devfn)
>  {
> +    PCIDevice *pdev = pci_find_device(bus, pci_bus_num(bus), devfn);
>      SMMUState *bs = opaque;
>      SMMUPciBus *sbus = smmu_get_sbus(bs, bus);
>      SMMUv3AccelDevice *accel_dev = smmuv3_accel_get_dev(bs, sbus, bus, devfn);
>      SMMUDevice *sdev = &accel_dev->sdev;
> +    bool vfio_pci = false;
> +
> +    if (pdev && !smmuv3_accel_pdev_allowed(pdev, &vfio_pci)) {
> +        if (DEVICE(pdev)->hotplugged) {
> +            if (vfio_pci) {
> +                warn_report("Hot plugging a vfio-pci device (%s) without "
> +                            "iommufd as backend is not supported", pdev->name);
> +            } else {
> +                warn_report("Hot plugging an emulated device %s with "
> +                            "accelerated SMMUv3. This will bring down "
> +                            "performace", pdev->name);
> +            }
> +            /*
> +             * Both cases, we will return IOMMU address space. For hotplugged
> +             * vfio-pci dev without iommufd as backend, it will fail later in
> +             * smmuv3_notify_flag_changed() with "requires iommu MAP notifier"
> +             * error message.
> +             */
> +             return &sdev->as;
> +        } else {
> +            error_report("Device(%s) not allowed. Only PCIe root complex "
> +                         "devices or PCI bridge devices or vfio-pci endpoint "
> +                         "devices with iommufd as backend is allowed with "
> +                         "arm-smmuv3,accel=on", pdev->name);
> +            exit(1);
> +        }
> +    }
>  
> -    return &sdev->as;
> +    /*
> +     * We return the system address for vfio-pci devices(with iommufd as
> +     * backend) so that the VFIO core can set up Stage-2 (S2) mappings for
> +     * guest RAM. This is needed because, in the accelerated SMMUv3 case,
> +     * the host SMMUv3 runs in nested (S1 + S2)  mode where the guest
> +     * manages its own S1 page tables while the host manages S2.
> +     *
> +     * We are using the global &address_space_memory here, as this will ensure
> +     * same system address space pointer for all devices behind the accelerated
> +     * SMMUv3s in a VM. That way VFIO/iommufd can reuse a single IOAS ID in
> +     * iommufd_cdev_attach(), allowing the Stage-2 page tables to be shared
> +     * within the VM instead of duplicating them for every SMMUv3 instance.

These two paragraphs definitely not wrapping to same line length.
Nice to tidy that up.

> +     */
> +    if (vfio_pci) {
> +        return &address_space_memory;
> +    } else {
> +        return &sdev->as;
> +    }
>  }
>  
>  static const PCIIOMMUOps smmuv3_accel_ops = {

>

RE: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd

Posted by Shameer Kolothum 4 months, 1 week ago


> -----Original Message-----
> From: Jonathan Cameron <jonathan.cameron@huawei.com>
> Sent: 29 September 2025 17:09
> To: Shameer Kolothum <skolothumtho@nvidia.com>
> Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
> eric.auger@redhat.com; peter.maydell@linaro.org; Jason Gunthorpe
> <jgg@nvidia.com>; Nicolin Chen <nicolinc@nvidia.com>; ddutile@redhat.com;
> berrange@redhat.com; Nathan Chen <nathanc@nvidia.com>; Matt Ochs
> <mochs@nvidia.com>; smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; zhangfei.gao@linaro.org;
> zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> shameerkolothum@gmail.com
> Subject: Re: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated
> SMMUv3 to vfio-pci endpoints with iommufd
> 
> External email: Use caution opening links or attachments
> 
> 
> On Mon, 29 Sep 2025 14:36:22 +0100
> Shameer Kolothum <skolothumtho@nvidia.com> wrote:
> 
> > Accelerated SMMUv3 is only useful when the device can take advantage of
> > the host's SMMUv3 in nested mode. To keep things simple and correct, we
> > only allow this feature for vfio-pci endpoint devices that use the iommufd
> > backend. We also allow non-endpoint emulated devices like PCI bridges and
> > root ports, so that users can plug in these vfio-pci devices. We can only
> > enforce this if devices are cold plugged. For hotplug cases, give appropriate
> > warnings.
> >
> > Another reason for this limit is to avoid problems with IOTLB
> > invalidations. Some commands (e.g., CMD_TLBI_NH_ASID) lack an
> associated
> > SID, making it difficult to trace the originating device. If we allowed
> > emulated endpoint devices, QEMU would have to invalidate both its own
> > software IOTLB and the host's hardware IOTLB, which could slow things
> > down.
> >
> > Since vfio-pci devices in nested mode rely on the host SMMUv3's nested
> > translation (S1+S2), their get_address_space() callback must return the
> > system address space so that VFIO core can setup correct S2 mappings
> > for guest RAM.
> >
> > So in short:
> >  - vfio-pci devices(with iommufd as backend) return the system address
> >    space.
> >  - bridges and root ports return the IOMMU address space.
> >
> > Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> One question that really applies to earlier patch and an even more trivial
> comment on a comment than the earlier ones ;)
> 
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> 
> > ---
> >  hw/arm/smmuv3-accel.c               | 68 ++++++++++++++++++++++++++++-
> >  hw/pci-bridge/pci_expander_bridge.c |  1 -
> >  include/hw/pci/pci_bridge.h         |  1 +
> >  3 files changed, 68 insertions(+), 2 deletions(-)
> >
> > diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
> > index 79f1713be6..44410cfb2a 100644
> > --- a/hw/arm/smmuv3-accel.c
> > +++ b/hw/arm/smmuv3-accel.c
> 
> >  static AddressSpace *smmuv3_accel_find_add_as(PCIBus *bus, void
> *opaque,
> 
> I should have noticed this in previous patch...
> What does add stand for here?  This name is not particularly clear to me.

Good question 😊.

I believe the name comes from the smmu-common.c implementation of
get_address_space:

static const PCIIOMMUOps smmu_ops = {
    .get_address_space = smmu_find_add_as,
};
Looking at it again, that version allocates a new MR and creates a
new address space per sdev, so perhaps "add" referred to the address
space creation.

This callback here originally did something similar but no longer does. 
So, I think it’s better to just rename it to smmuv3_accel_get_as()

Thanks,
Shameer

Re: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd

Posted by Eric Auger 4 months, 1 week ago


On 9/30/25 10:03 AM, Shameer Kolothum wrote:
>
>> -----Original Message-----
>> From: Jonathan Cameron <jonathan.cameron@huawei.com>
>> Sent: 29 September 2025 17:09
>> To: Shameer Kolothum <skolothumtho@nvidia.com>
>> Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
>> eric.auger@redhat.com; peter.maydell@linaro.org; Jason Gunthorpe
>> <jgg@nvidia.com>; Nicolin Chen <nicolinc@nvidia.com>; ddutile@redhat.com;
>> berrange@redhat.com; Nathan Chen <nathanc@nvidia.com>; Matt Ochs
>> <mochs@nvidia.com>; smostafa@google.com; wangzhou1@hisilicon.com;
>> jiangkunkun@huawei.com; zhangfei.gao@linaro.org;
>> zhenzhong.duan@intel.com; yi.l.liu@intel.com;
>> shameerkolothum@gmail.com
>> Subject: Re: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated
>> SMMUv3 to vfio-pci endpoints with iommufd
>>
>> External email: Use caution opening links or attachments
>>
>>
>> On Mon, 29 Sep 2025 14:36:22 +0100
>> Shameer Kolothum <skolothumtho@nvidia.com> wrote:
>>
>>> Accelerated SMMUv3 is only useful when the device can take advantage of
>>> the host's SMMUv3 in nested mode. To keep things simple and correct, we
>>> only allow this feature for vfio-pci endpoint devices that use the iommufd
>>> backend. We also allow non-endpoint emulated devices like PCI bridges and
>>> root ports, so that users can plug in these vfio-pci devices. We can only
>>> enforce this if devices are cold plugged. For hotplug cases, give appropriate
>>> warnings.
>>>
>>> Another reason for this limit is to avoid problems with IOTLB
>>> invalidations. Some commands (e.g., CMD_TLBI_NH_ASID) lack an
>> associated
>>> SID, making it difficult to trace the originating device. If we allowed
>>> emulated endpoint devices, QEMU would have to invalidate both its own
>>> software IOTLB and the host's hardware IOTLB, which could slow things
>>> down.
>>>
>>> Since vfio-pci devices in nested mode rely on the host SMMUv3's nested
>>> translation (S1+S2), their get_address_space() callback must return the
>>> system address space so that VFIO core can setup correct S2 mappings
>>> for guest RAM.
>>>
>>> So in short:
>>>  - vfio-pci devices(with iommufd as backend) return the system address
>>>    space.
>>>  - bridges and root ports return the IOMMU address space.
>>>
>>> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
>> One question that really applies to earlier patch and an even more trivial
>> comment on a comment than the earlier ones ;)
>>
>> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
>>
>>> ---
>>>  hw/arm/smmuv3-accel.c               | 68 ++++++++++++++++++++++++++++-
>>>  hw/pci-bridge/pci_expander_bridge.c |  1 -
>>>  include/hw/pci/pci_bridge.h         |  1 +
>>>  3 files changed, 68 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
>>> index 79f1713be6..44410cfb2a 100644
>>> --- a/hw/arm/smmuv3-accel.c
>>> +++ b/hw/arm/smmuv3-accel.c
>>>  static AddressSpace *smmuv3_accel_find_add_as(PCIBus *bus, void
>> *opaque,
>>
>> I should have noticed this in previous patch...
>> What does add stand for here?  This name is not particularly clear to me.
> Good question 😊.
>
> I believe the name comes from the smmu-common.c implementation of
> get_address_space:
>
> static const PCIIOMMUOps smmu_ops = {
>     .get_address_space = smmu_find_add_as,
> };
> Looking at it again, that version allocates a new MR and creates a
> new address space per sdev, so perhaps "add" referred to the address
> space creation.
this stems from the original terminology used in intel-iommu.c
(vtd_find_add_as)

the smmu-common code looks for a registered device corresponding to @bus
and @devfn (this is the 'find'). If it exists it returns it, otherwise
it allocates a bus and SMMUDevice object according to what exists and
initializes the AddressSpace (this is the 'add').


>
> This callback here originally did something similar but no longer does. 
I don't get why it does not do something similar anymore?
> So, I think it’s better to just rename it to smmuv3_accel_get_as()
Well I would prefer we keep the original terminology to match other
viommu code. Except of course if I misunderstood the existing code.

Thanks

Eric
>
> Thanks,
> Shameer

RE: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated SMMUv3 to vfio-pci endpoints with iommufd

Posted by Shameer Kolothum 4 months, 1 week ago


> -----Original Message-----
> From: Eric Auger <eric.auger@redhat.com>
> Sent: 01 October 2025 17:39
> To: Shameer Kolothum <skolothumtho@nvidia.com>; Jonathan Cameron
> <jonathan.cameron@huawei.com>
> Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
> peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin Chen
> <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com; Nathan
> Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; zhangfei.gao@linaro.org;
> zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> shameerkolothum@gmail.com
> Subject: Re: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated
> SMMUv3 to vfio-pci endpoints with iommufd
> 
> External email: Use caution opening links or attachments
> 
> 
> On 9/30/25 10:03 AM, Shameer Kolothum wrote:
> >
> >> -----Original Message-----
> >> From: Jonathan Cameron <jonathan.cameron@huawei.com>
> >> Sent: 29 September 2025 17:09
> >> To: Shameer Kolothum <skolothumtho@nvidia.com>
> >> Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
> >> eric.auger@redhat.com; peter.maydell@linaro.org; Jason Gunthorpe
> >> <jgg@nvidia.com>; Nicolin Chen <nicolinc@nvidia.com>;
> ddutile@redhat.com;
> >> berrange@redhat.com; Nathan Chen <nathanc@nvidia.com>; Matt Ochs
> >> <mochs@nvidia.com>; smostafa@google.com; wangzhou1@hisilicon.com;
> >> jiangkunkun@huawei.com; zhangfei.gao@linaro.org;
> >> zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> >> shameerkolothum@gmail.com
> >> Subject: Re: [PATCH v4 06/27] hw/arm/smmuv3-accel: Restrict accelerated
> >> SMMUv3 to vfio-pci endpoints with iommufd
> >>
> >> External email: Use caution opening links or attachments
> >>
> >>
> >> On Mon, 29 Sep 2025 14:36:22 +0100
> >> Shameer Kolothum <skolothumtho@nvidia.com> wrote:
> >>
> >>> Accelerated SMMUv3 is only useful when the device can take advantage
> of
> >>> the host's SMMUv3 in nested mode. To keep things simple and correct, we
> >>> only allow this feature for vfio-pci endpoint devices that use the iommufd
> >>> backend. We also allow non-endpoint emulated devices like PCI bridges
> and
> >>> root ports, so that users can plug in these vfio-pci devices. We can only
> >>> enforce this if devices are cold plugged. For hotplug cases, give appropriate
> >>> warnings.
> >>>
> >>> Another reason for this limit is to avoid problems with IOTLB
> >>> invalidations. Some commands (e.g., CMD_TLBI_NH_ASID) lack an
> >> associated
> >>> SID, making it difficult to trace the originating device. If we allowed
> >>> emulated endpoint devices, QEMU would have to invalidate both its own
> >>> software IOTLB and the host's hardware IOTLB, which could slow things
> >>> down.
> >>>
> >>> Since vfio-pci devices in nested mode rely on the host SMMUv3's nested
> >>> translation (S1+S2), their get_address_space() callback must return the
> >>> system address space so that VFIO core can setup correct S2 mappings
> >>> for guest RAM.
> >>>
> >>> So in short:
> >>>  - vfio-pci devices(with iommufd as backend) return the system address
> >>>    space.
> >>>  - bridges and root ports return the IOMMU address space.
> >>>
> >>> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> >> One question that really applies to earlier patch and an even more trivial
> >> comment on a comment than the earlier ones ;)
> >>
> >> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> >>
> >>> ---
> >>>  hw/arm/smmuv3-accel.c               | 68
> ++++++++++++++++++++++++++++-
> >>>  hw/pci-bridge/pci_expander_bridge.c |  1 -
> >>>  include/hw/pci/pci_bridge.h         |  1 +
> >>>  3 files changed, 68 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
> >>> index 79f1713be6..44410cfb2a 100644
> >>> --- a/hw/arm/smmuv3-accel.c
> >>> +++ b/hw/arm/smmuv3-accel.c
> >>>  static AddressSpace *smmuv3_accel_find_add_as(PCIBus *bus, void
> >> *opaque,
> >>
> >> I should have noticed this in previous patch...
> >> What does add stand for here?  This name is not particularly clear to me.
> > Good question 😊.
> >
> > I believe the name comes from the smmu-common.c implementation of
> > get_address_space:
> >
> > static const PCIIOMMUOps smmu_ops = {
> >     .get_address_space = smmu_find_add_as,
> > };
> > Looking at it again, that version allocates a new MR and creates a
> > new address space per sdev, so perhaps "add" referred to the address
> > space creation.
> this stems from the original terminology used in intel-iommu.c
> (vtd_find_add_as)
> 
> the smmu-common code looks for a registered device corresponding to @bus
> and @devfn (this is the 'find'). If it exists it returns it, otherwise
> it allocates a bus and SMMUDevice object according to what exists and
> initializes the AddressSpace (this is the 'add').

Agree.

> 
> >
> > This callback here originally did something similar but no longer does.
> I don't get why it does not do something similar anymore?

Ok. It does all the above related to the "find" and "add" described above.
But for vfio-pci dev with IOMMUFD backend, it now returns the global
&address_space_memory. Previously we were creating a separate address
space pointing to system memory for each such devices. 

We could argue that in general what the function does is "get" the appropriate
address space for the device and can just call it simply,
get_dev_address_space() .

> > So, I think it’s better to just rename it to smmuv3_accel_get_as()
> Well I would prefer we keep the original terminology to match other
> viommu code. Except of course if I misunderstood the existing code.

Ok. I will keep the same then with some comment to explain that "find"
and "add" part.

Thanks,
Shameer