[v5] hw/arm/virt: Add support for user-creatable accelerated SMMUv3

[PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback

Posted by Shameer Kolothum 3 months, 1 week ago

On ARM, devices behind an IOMMU have their MSI doorbell addresses
translated by the IOMMU. In nested mode, this translation happens in
two stages (gIOVA → gPA → ITS page).

In accelerated SMMUv3 mode, both stages are handled by hardware, so
get_address_space() returns the system address space so that VFIO
can setup stage-2 mappings for system address space.

However, QEMU/KVM also calls this callback when resolving
MSI doorbells:

  kvm_irqchip_add_msi_route()
    kvm_arch_fixup_msi_route()
      pci_device_iommu_address_space()
        get_address_space()

VFIO device in the guest with a SMMUv3 is programmed with a gIOVA for
MSI doorbell. This gIOVA can't be used to setup the MSI doorbell
directly. This needs to be translated to vITS gPA. In order to do the
doorbell transalation it needs IOMMU address space.

Add an optional get_msi_address_space() callback and use it in this
path to return the correct address space for such cases.

Cc: Michael S. Tsirkin <mst@redhat.com>
Suggested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
 hw/pci/pci.c         | 18 ++++++++++++++++++
 include/hw/pci/pci.h | 16 ++++++++++++++++
 target/arm/kvm.c     |  2 +-
 3 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index fa9cf5dab2..1edd711247 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -2982,6 +2982,24 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
     return &address_space_memory;
 }
 
+AddressSpace *pci_device_iommu_msi_address_space(PCIDevice *dev)
+{
+    PCIBus *bus;
+    PCIBus *iommu_bus;
+    int devfn;
+
+    pci_device_get_iommu_bus_devfn(dev, &iommu_bus, &bus, &devfn);
+    if (iommu_bus) {
+        if (iommu_bus->iommu_ops->get_msi_address_space) {
+            return iommu_bus->iommu_ops->get_msi_address_space(bus,
+                                 iommu_bus->iommu_opaque, devfn);
+        }
+        return iommu_bus->iommu_ops->get_address_space(bus,
+                                 iommu_bus->iommu_opaque, devfn);
+    }
+    return &address_space_memory;
+}
+
 int pci_iommu_init_iotlb_notifier(PCIDevice *dev, IOMMUNotifier *n,
                                   IOMMUNotify fn, void *opaque)
 {
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index dfeba8c9bd..b731443c67 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -664,6 +664,21 @@ typedef struct PCIIOMMUOps {
                             uint32_t pasid, bool priv_req, bool exec_req,
                             hwaddr addr, bool lpig, uint16_t prgi, bool is_read,
                             bool is_write);
+    /**
+     * @get_msi_address_space: get the address space for MSI doorbell address
+     * for devices
+     *
+     * Optional callback which returns a pointer to an #AddressSpace. This
+     * is required if MSI doorbell also gets translated through vIOMMU(eg: ARM)
+     *
+     * @bus: the #PCIBus being accessed.
+     *
+     * @opaque: the data passed to pci_setup_iommu().
+     *
+     * @devfn: device and function number
+     */
+    AddressSpace * (*get_msi_address_space)(PCIBus *bus, void *opaque,
+                                            int devfn);
 } PCIIOMMUOps;
 
 bool pci_device_get_iommu_bus_devfn(PCIDevice *dev, PCIBus **piommu_bus,
@@ -672,6 +687,7 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
 bool pci_device_set_iommu_device(PCIDevice *dev, HostIOMMUDevice *hiod,
                                  Error **errp);
 void pci_device_unset_iommu_device(PCIDevice *dev);
+AddressSpace *pci_device_iommu_msi_address_space(PCIDevice *dev);
 
 /**
  * pci_device_get_viommu_flags: get vIOMMU flags.
diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index 0d57081e69..0df41128d0 100644
--- a/target/arm/kvm.c
+++ b/target/arm/kvm.c
@@ -1611,7 +1611,7 @@ int kvm_arm_set_irq(int cpu, int irqtype, int irq, int level)
 int kvm_arch_fixup_msi_route(struct kvm_irq_routing_entry *route,
                              uint64_t address, uint32_t data, PCIDevice *dev)
 {
-    AddressSpace *as = pci_device_iommu_address_space(dev);
+    AddressSpace *as = pci_device_iommu_msi_address_space(dev);
     hwaddr xlat, len, doorbell_gpa;
     MemoryRegionSection mrs;
     MemoryRegion *mr;
-- 
2.43.0

Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback

Posted by Eric Auger 3 months, 1 week ago

Hi Shameer, Nicolin,

On 10/31/25 11:49 AM, Shameer Kolothum wrote:
> On ARM, devices behind an IOMMU have their MSI doorbell addresses
> translated by the IOMMU. In nested mode, this translation happens in
> two stages (gIOVA → gPA → ITS page).
>
> In accelerated SMMUv3 mode, both stages are handled by hardware, so
> get_address_space() returns the system address space so that VFIO
> can setup stage-2 mappings for system address space.

Sorry but I still don't catch the above. Can you explain (most probably
again) why this is a requirement to return the system as so that VFIO
can setup stage-2 mappings for system address space. I am sorry for
insisting (at the risk of being stubborn or dumb) but I fail to
understand the requirement. As far as I remember the way I integrated it
at the old times did not require that change:
https://lore.kernel.org/all/20210411120912.15770-1-eric.auger@redhat.com/
I used a vfio_prereg_listener to force the S2 mapping.

What has changed that forces us now to have this gym


>
> However, QEMU/KVM also calls this callback when resolving
> MSI doorbells:
>
>   kvm_irqchip_add_msi_route()
>     kvm_arch_fixup_msi_route()
>       pci_device_iommu_address_space()
>         get_address_space()
>
> VFIO device in the guest with a SMMUv3 is programmed with a gIOVA for
> MSI doorbell. This gIOVA can't be used to setup the MSI doorbell
> directly. This needs to be translated to vITS gPA. In order to do the
> doorbell transalation it needs IOMMU address space.
>
> Add an optional get_msi_address_space() callback and use it in this
> path to return the correct address space for such cases.
>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Suggested-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> Reviewed-by Nicolin Chen <nicolinc@nvidia.com>
> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> ---
>  hw/pci/pci.c         | 18 ++++++++++++++++++
>  include/hw/pci/pci.h | 16 ++++++++++++++++
>  target/arm/kvm.c     |  2 +-
>  3 files changed, 35 insertions(+), 1 deletion(-)
>
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index fa9cf5dab2..1edd711247 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -2982,6 +2982,24 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>      return &address_space_memory;
>  }
>  
> +AddressSpace *pci_device_iommu_msi_address_space(PCIDevice *dev)
> +{
> +    PCIBus *bus;
> +    PCIBus *iommu_bus;
> +    int devfn;
> +
> +    pci_device_get_iommu_bus_devfn(dev, &iommu_bus, &bus, &devfn);
> +    if (iommu_bus) {
> +        if (iommu_bus->iommu_ops->get_msi_address_space) {
> +            return iommu_bus->iommu_ops->get_msi_address_space(bus,
> +                                 iommu_bus->iommu_opaque, devfn);
> +        }
> +        return iommu_bus->iommu_ops->get_address_space(bus,
> +                                 iommu_bus->iommu_opaque, devfn);
> +    }
> +    return &address_space_memory;
> +}
> +
>  int pci_iommu_init_iotlb_notifier(PCIDevice *dev, IOMMUNotifier *n,
>                                    IOMMUNotify fn, void *opaque)
>  {
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index dfeba8c9bd..b731443c67 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -664,6 +664,21 @@ typedef struct PCIIOMMUOps {
>                              uint32_t pasid, bool priv_req, bool exec_req,
>                              hwaddr addr, bool lpig, uint16_t prgi, bool is_read,
>                              bool is_write);
> +    /**
> +     * @get_msi_address_space: get the address space for MSI doorbell address
> +     * for devices
> +     *
> +     * Optional callback which returns a pointer to an #AddressSpace. This
> +     * is required if MSI doorbell also gets translated through vIOMMU(eg: ARM)
> +     *
> +     * @bus: the #PCIBus being accessed.
> +     *
> +     * @opaque: the data passed to pci_setup_iommu().
> +     *
> +     * @devfn: device and function number
> +     */
> +    AddressSpace * (*get_msi_address_space)(PCIBus *bus, void *opaque,
> +                                            int devfn);
>  } PCIIOMMUOps;
>  
>  bool pci_device_get_iommu_bus_devfn(PCIDevice *dev, PCIBus **piommu_bus,
> @@ -672,6 +687,7 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
>  bool pci_device_set_iommu_device(PCIDevice *dev, HostIOMMUDevice *hiod,
>                                   Error **errp);
>  void pci_device_unset_iommu_device(PCIDevice *dev);
> +AddressSpace *pci_device_iommu_msi_address_space(PCIDevice *dev);
>  
>  /**
>   * pci_device_get_viommu_flags: get vIOMMU flags.
> diff --git a/target/arm/kvm.c b/target/arm/kvm.c
> index 0d57081e69..0df41128d0 100644
> --- a/target/arm/kvm.c
> +++ b/target/arm/kvm.c
> @@ -1611,7 +1611,7 @@ int kvm_arm_set_irq(int cpu, int irqtype, int irq, int level)
>  int kvm_arch_fixup_msi_route(struct kvm_irq_routing_entry *route,
>                               uint64_t address, uint32_t data, PCIDevice *dev)
>  {
> -    AddressSpace *as = pci_device_iommu_address_space(dev);
> +    AddressSpace *as = pci_device_iommu_msi_address_space(dev);
>      hwaddr xlat, len, doorbell_gpa;
>      MemoryRegionSection mrs;
>      MemoryRegion *mr;

Eric

RE: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback

Posted by Shameer Kolothum 3 months, 1 week ago

Hi Eric,

> -----Original Message-----
> From: Eric Auger <eric.auger@redhat.com>
> Sent: 04 November 2025 14:12
> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
> arm@nongnu.org; qemu-devel@nongnu.org
> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> Krishnakant Jaju <kjaju@nvidia.com>
> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional
> get_msi_address_space() callback
> 
> External email: Use caution opening links or attachments
> 
> 
> Hi Shameer, Nicolin,
> 
> On 10/31/25 11:49 AM, Shameer Kolothum wrote:
> > On ARM, devices behind an IOMMU have their MSI doorbell addresses
> > translated by the IOMMU. In nested mode, this translation happens in
> > two stages (gIOVA → gPA → ITS page).
> >
> > In accelerated SMMUv3 mode, both stages are handled by hardware, so
> > get_address_space() returns the system address space so that VFIO
> > can setup stage-2 mappings for system address space.
> 
> Sorry but I still don't catch the above. Can you explain (most probably
> again) why this is a requirement to return the system as so that VFIO
> can setup stage-2 mappings for system address space. I am sorry for
> insisting (at the risk of being stubborn or dumb) but I fail to
> understand the requirement. As far as I remember the way I integrated it
> at the old times did not require that change:
> https://lore.kernel.org/all/20210411120912.15770-1-
> eric.auger@redhat.com/
> I used a vfio_prereg_listener to force the S2 mapping.

Yes I remember that.

> 
> What has changed that forces us now to have this gym

This approach achieves the same outcome, but through a 
different mechanism. Returning the system address space
here ensures that VFIO sets up the Stage-2 mappings for 
devices behind the accelerated SMMUv3.

I think, this makes sense because, in the accelerated case, the
device is no longer managed by QEMU’s SMMUv3 model. The
guest owns the Stage-1 context, and the host (VFIO) is responsible
for establishing the Stage-2 mappings accordingly. 

Do you see any issues with this approach?

Thanks,
Shameer

Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback

Posted by Eric Auger 3 months, 1 week ago


On 11/4/25 3:37 PM, Shameer Kolothum wrote:
> Hi Eric,
>
>> -----Original Message-----
>> From: Eric Auger <eric.auger@redhat.com>
>> Sent: 04 November 2025 14:12
>> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
>> arm@nongnu.org; qemu-devel@nongnu.org
>> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
>> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
>> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
>> smostafa@google.com; wangzhou1@hisilicon.com;
>> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
>> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
>> Krishnakant Jaju <kjaju@nvidia.com>
>> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional
>> get_msi_address_space() callback
>>
>> External email: Use caution opening links or attachments
>>
>>
>> Hi Shameer, Nicolin,
>>
>> On 10/31/25 11:49 AM, Shameer Kolothum wrote:
>>> On ARM, devices behind an IOMMU have their MSI doorbell addresses
>>> translated by the IOMMU. In nested mode, this translation happens in
>>> two stages (gIOVA → gPA → ITS page).
>>>
>>> In accelerated SMMUv3 mode, both stages are handled by hardware, so
>>> get_address_space() returns the system address space so that VFIO
>>> can setup stage-2 mappings for system address space.
>> Sorry but I still don't catch the above. Can you explain (most probably
>> again) why this is a requirement to return the system as so that VFIO
>> can setup stage-2 mappings for system address space. I am sorry for
>> insisting (at the risk of being stubborn or dumb) but I fail to
>> understand the requirement. As far as I remember the way I integrated it
>> at the old times did not require that change:
>> https://lore.kernel.org/all/20210411120912.15770-1-
>> eric.auger@redhat.com/
>> I used a vfio_prereg_listener to force the S2 mapping.
> Yes I remember that.
>
>> What has changed that forces us now to have this gym
> This approach achieves the same outcome, but through a 
> different mechanism. Returning the system address space
> here ensures that VFIO sets up the Stage-2 mappings for 
> devices behind the accelerated SMMUv3.
>
> I think, this makes sense because, in the accelerated case, the
> device is no longer managed by QEMU’s SMMUv3 model. The
On the other hand, as we discussed on v4 by returning system as you
pretend there is no translation in place which is not true. Now we use
an alias for it but it has not really removed its usage. Also it forces
use to hack around the MSI mapping and introduce new PCIIOMMUOps. Have
you assessed the feasability of using vfio_prereg_listener to force the
S2 mapping. Is it simply not relevant anymore or could it be used also
with the iommufd be integration? Eric
> guest owns the Stage-1 context, and the host (VFIO) is responsible
> for establishing the Stage-2 mappings accordingly. 
>
> Do you see any issues with this approach?
>
> Thanks,
> Shameer

RE: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback

Posted by Shameer Kolothum 3 months, 1 week ago


> -----Original Message-----
> From: Eric Auger <eric.auger@redhat.com>
> Sent: 04 November 2025 14:44
> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
> arm@nongnu.org; qemu-devel@nongnu.org
> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> Krishnakant Jaju <kjaju@nvidia.com>
> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional
> get_msi_address_space() callback
> 
> External email: Use caution opening links or attachments
> 
> 
> On 11/4/25 3:37 PM, Shameer Kolothum wrote:
> > Hi Eric,
> >
> >> -----Original Message-----
> >> From: Eric Auger <eric.auger@redhat.com>
> >> Sent: 04 November 2025 14:12
> >> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
> >> arm@nongnu.org; qemu-devel@nongnu.org
> >> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
> >> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
> >> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> >> smostafa@google.com; wangzhou1@hisilicon.com;
> >> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> >> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> >> Krishnakant Jaju <kjaju@nvidia.com>
> >> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional
> >> get_msi_address_space() callback
> >>
> >> External email: Use caution opening links or attachments
> >>
> >>
> >> Hi Shameer, Nicolin,
> >>
> >> On 10/31/25 11:49 AM, Shameer Kolothum wrote:
> >>> On ARM, devices behind an IOMMU have their MSI doorbell addresses
> >>> translated by the IOMMU. In nested mode, this translation happens in
> >>> two stages (gIOVA → gPA → ITS page).
> >>>
> >>> In accelerated SMMUv3 mode, both stages are handled by hardware, so
> >>> get_address_space() returns the system address space so that VFIO
> >>> can setup stage-2 mappings for system address space.
> >> Sorry but I still don't catch the above. Can you explain (most probably
> >> again) why this is a requirement to return the system as so that VFIO
> >> can setup stage-2 mappings for system address space. I am sorry for
> >> insisting (at the risk of being stubborn or dumb) but I fail to
> >> understand the requirement. As far as I remember the way I integrated it
> >> at the old times did not require that change:
> >> https://lore.kernel.org/all/20210411120912.15770-1-
> >> eric.auger@redhat.com/
> >> I used a vfio_prereg_listener to force the S2 mapping.
> > Yes I remember that.
> >
> >> What has changed that forces us now to have this gym
> > This approach achieves the same outcome, but through a
> > different mechanism. Returning the system address space
> > here ensures that VFIO sets up the Stage-2 mappings for
> > devices behind the accelerated SMMUv3.
> >
> > I think, this makes sense because, in the accelerated case, the
> > device is no longer managed by QEMU’s SMMUv3 model. The
> On the other hand, as we discussed on v4 by returning system as you
> pretend there is no translation in place which is not true. Now we use
> an alias for it but it has not really removed its usage. Also it forces
> use to hack around the MSI mapping and introduce new PCIIOMMUOps.
> Have
> you assessed the feasability of using vfio_prereg_listener to force the
> S2 mapping. Is it simply not relevant anymore or could it be used also
> with the iommufd be integration? Eric

IIUC, the prereg_listener mechanism just enables us to setup the s2
mappings. For MSI, In your version, I see that smmu_find_add_as()
always returns IOMMU as. How is that supposed to work if the Guest
has s1 bypass mode STE for the device?

Thanks,
Shameer

Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback

Posted by Eric Auger 3 months ago

Hi Shameer,

On 11/4/25 4:14 PM, Shameer Kolothum wrote:
>
>> -----Original Message-----
>> From: Eric Auger <eric.auger@redhat.com>
>> Sent: 04 November 2025 14:44
>> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
>> arm@nongnu.org; qemu-devel@nongnu.org
>> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
>> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
>> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
>> smostafa@google.com; wangzhou1@hisilicon.com;
>> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
>> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
>> Krishnakant Jaju <kjaju@nvidia.com>
>> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional
>> get_msi_address_space() callback
>>
>> External email: Use caution opening links or attachments
>>
>>
>> On 11/4/25 3:37 PM, Shameer Kolothum wrote:
>>> Hi Eric,
>>>
>>>> -----Original Message-----
>>>> From: Eric Auger <eric.auger@redhat.com>
>>>> Sent: 04 November 2025 14:12
>>>> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
>>>> arm@nongnu.org; qemu-devel@nongnu.org
>>>> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
>>>> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
>>>> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
>>>> smostafa@google.com; wangzhou1@hisilicon.com;
>>>> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
>>>> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
>>>> Krishnakant Jaju <kjaju@nvidia.com>
>>>> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional
>>>> get_msi_address_space() callback
>>>>
>>>> External email: Use caution opening links or attachments
>>>>
>>>>
>>>> Hi Shameer, Nicolin,
>>>>
>>>> On 10/31/25 11:49 AM, Shameer Kolothum wrote:
>>>>> On ARM, devices behind an IOMMU have their MSI doorbell addresses
>>>>> translated by the IOMMU. In nested mode, this translation happens in
>>>>> two stages (gIOVA → gPA → ITS page).
>>>>>
>>>>> In accelerated SMMUv3 mode, both stages are handled by hardware, so
>>>>> get_address_space() returns the system address space so that VFIO
>>>>> can setup stage-2 mappings for system address space.
>>>> Sorry but I still don't catch the above. Can you explain (most probably
>>>> again) why this is a requirement to return the system as so that VFIO
>>>> can setup stage-2 mappings for system address space. I am sorry for
>>>> insisting (at the risk of being stubborn or dumb) but I fail to
>>>> understand the requirement. As far as I remember the way I integrated it
>>>> at the old times did not require that change:
>>>> https://lore.kernel.org/all/20210411120912.15770-1-
>>>> eric.auger@redhat.com/
>>>> I used a vfio_prereg_listener to force the S2 mapping.
>>> Yes I remember that.
>>>
>>>> What has changed that forces us now to have this gym
>>> This approach achieves the same outcome, but through a
>>> different mechanism. Returning the system address space
>>> here ensures that VFIO sets up the Stage-2 mappings for
>>> devices behind the accelerated SMMUv3.
>>>
>>> I think, this makes sense because, in the accelerated case, the
>>> device is no longer managed by QEMU’s SMMUv3 model. The
>> On the other hand, as we discussed on v4 by returning system as you
>> pretend there is no translation in place which is not true. Now we use
>> an alias for it but it has not really removed its usage. Also it forces
>> use to hack around the MSI mapping and introduce new PCIIOMMUOps.
>> Have
>> you assessed the feasability of using vfio_prereg_listener to force the
>> S2 mapping. Is it simply not relevant anymore or could it be used also
>> with the iommufd be integration? Eric
> IIUC, the prereg_listener mechanism just enables us to setup the s2
> mappings. For MSI, In your version, I see that smmu_find_add_as()
> always returns IOMMU as. How is that supposed to work if the Guest
> has s1 bypass mode STE for the device?
in kvm_arch_fixup_msi_route(), as we have as != &address_space_memory in
my case, we proceed with the actual translation for the doorbell gIOVA
using address_space_translate(). I  guess if the S1 is in bypass mode
you get the flat translation, no?

Eric
>
> Thanks,
> Shameer
>
>

RE: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback

Posted by Shameer Kolothum 3 months ago

Hi Eric,

> -----Original Message-----
> From: Eric Auger <eric.auger@redhat.com>
> Sent: 05 November 2025 08:57
> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
> arm@nongnu.org; qemu-devel@nongnu.org
> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> Krishnakant Jaju <kjaju@nvidia.com>
> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional
> get_msi_address_space() callback
[...]
> > IIUC, the prereg_listener mechanism just enables us to setup the s2
> > mappings. For MSI, In your version, I see that smmu_find_add_as()
> > always returns IOMMU as. How is that supposed to work if the Guest
> > has s1 bypass mode STE for the device?
> in kvm_arch_fixup_msi_route(), as we have as != &address_space_memory in
> my case, we proceed with the actual translation for the doorbell gIOVA
> using address_space_translate(). I  guess if the S1 is in bypass mode
> you get the flat translation, no?

Yes, I noted that and replied as well.

Again, coming back to kvm_arch_fixup_msi_route(), I see that this was introduced
as part of your " ARM SMMUv3 Emulation Support" here,
https://lore.kernel.org/qemu-devel/1523518688-26674-12-git-send-email-eric.auger@redhat.com/

The VFIO support was not there at that time. I am trying to understand why
we need this MSI translation for vfio-pci in this accelerated case. My understanding
was that this is to setup the KVM MSI routings via KVM_SET_GSI_ROUTING ioctl.

Is that right?

Thanks,
Shameer

Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback

Posted by Eric Auger 3 months ago


On 11/5/25 12:41 PM, Shameer Kolothum wrote:
> Hi Eric,
>
>> -----Original Message-----
>> From: Eric Auger <eric.auger@redhat.com>
>> Sent: 05 November 2025 08:57
>> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
>> arm@nongnu.org; qemu-devel@nongnu.org
>> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
>> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
>> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
>> smostafa@google.com; wangzhou1@hisilicon.com;
>> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
>> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
>> Krishnakant Jaju <kjaju@nvidia.com>
>> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional
>> get_msi_address_space() callback
> [...]
>>> IIUC, the prereg_listener mechanism just enables us to setup the s2
>>> mappings. For MSI, In your version, I see that smmu_find_add_as()
>>> always returns IOMMU as. How is that supposed to work if the Guest
>>> has s1 bypass mode STE for the device?
>> in kvm_arch_fixup_msi_route(), as we have as != &address_space_memory in
>> my case, we proceed with the actual translation for the doorbell gIOVA
>> using address_space_translate(). I  guess if the S1 is in bypass mode
>> you get the flat translation, no?
> Yes, I noted that and replied as well.
>
> Again, coming back to kvm_arch_fixup_msi_route(), I see that this was introduced
> as part of your " ARM SMMUv3 Emulation Support" here,
> https://lore.kernel.org/qemu-devel/1523518688-26674-12-git-send-email-eric.auger@redhat.com/
>
> The VFIO support was not there at that time. I am trying to understand why
> we need this MSI translation for vfio-pci in this accelerated case. My understanding
> was that this is to setup the KVM MSI routings via KVM_SET_GSI_ROUTING ioctl.

yes that's correct. This was first needed for vhost integration. And
obviously this is also needed for VFIO.

allows vhost irqfd to trigger a gsi that will be routed by KVM to the
actual guest doorbell. On top of that it registers the guest PCI BDF for
GiCv2m or GICv3 MSI translation setup.
if the guest doorbell address is wrong because not properly translated,
vgic_msi_to_its() will fail to identify the ITS to inject the MSI in.
See kernel kvm/vgic/vgic-its.c vgic_msi_to_its and vgic_its_inject_msi

Eric
>
> Is that right?
>
> Thanks,
> Shameer
>
>
>

Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback

Posted by Jason Gunthorpe 3 months ago

On Wed, Nov 05, 2025 at 06:25:05PM +0100, Eric Auger wrote:
> if the guest doorbell address is wrong because not properly translated,
> vgic_msi_to_its() will fail to identify the ITS to inject the MSI in.
> See kernel kvm/vgic/vgic-its.c vgic_msi_to_its and
> vgic_its_inject_msi

Which has been exactly my point to Nicolin. There is no way to
"properly translate" the vMSI address in a HW accelerated SMMU
emulation.

The vMSI address must only be used for some future non-RMR HW only
path.

To keep this flow working qemu must ignore the IOVA from the guest and
always replace it with its own idea of what the correct ITS address is
for KVM to work. It means we don't correctly emulate guest
misconfiguration of the MSI address.

Thus it should never be "translated" in this configuration, that's a
broken idea when working with the HW accelerated vSMMU.

Jason

RE: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback

Posted by Shameer Kolothum 3 months ago


> -----Original Message-----
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: 05 November 2025 18:11
> To: Eric Auger <eric.auger@redhat.com>
> Cc: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
> arm@nongnu.org; qemu-devel@nongnu.org; peter.maydell@linaro.org;
> Nicolin Chen <nicolinc@nvidia.com>; ddutile@redhat.com;
> berrange@redhat.com; Nathan Chen <nathanc@nvidia.com>; Matt Ochs
> <mochs@nvidia.com>; smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> Krishnakant Jaju <kjaju@nvidia.com>
> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional
> get_msi_address_space() callback
> 
> On Wed, Nov 05, 2025 at 06:25:05PM +0100, Eric Auger wrote:
> > if the guest doorbell address is wrong because not properly translated,
> > vgic_msi_to_its() will fail to identify the ITS to inject the MSI in.
> > See kernel kvm/vgic/vgic-its.c vgic_msi_to_its and
> > vgic_its_inject_msi
> 
> Which has been exactly my point to Nicolin. There is no way to
> "properly translate" the vMSI address in a HW accelerated SMMU
> emulation.
> 
> The vMSI address must only be used for some future non-RMR HW only
> path.
> 
> To keep this flow working qemu must ignore the IOVA from the guest and
> always replace it with its own idea of what the correct ITS address is
> for KVM to work. It means we don't correctly emulate guest
> misconfiguration of the MSI address.
> 
> Thus it should never be "translated" in this configuration, that's a
> broken idea when working with the HW accelerated vSMMU.

Ah.. I get it now. You are not questioning the flow here but the 
"translate" part. Agree it is not safe to use smmuv3_translate()
in an HW accelerated case. We need somehow to hook into this
path and provide a correct ITS address for KVM. 

Hmm.... need to see how to do that in the least invasive way.

Thanks,
Shameer

Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback

Posted by Nicolin Chen 3 months ago

On Wed, Nov 05, 2025 at 02:10:49PM -0400, Jason Gunthorpe wrote:
> On Wed, Nov 05, 2025 at 06:25:05PM +0100, Eric Auger wrote:
> > if the guest doorbell address is wrong because not properly translated,
> > vgic_msi_to_its() will fail to identify the ITS to inject the MSI in.
> > See kernel kvm/vgic/vgic-its.c vgic_msi_to_its and
> > vgic_its_inject_msi
> 
> Which has been exactly my point to Nicolin. There is no way to
> "properly translate" the vMSI address in a HW accelerated SMMU
> emulation.

Hmm, I still can't connect the dots here. QEMU knows where the
guest CD table is to get the stage-1 translation table to walk
through. We could choose to not let it walk through. Yet, why?

Asking this to know what we should justify for the patch in a
different direction.

> The vMSI address must only be used for some future non-RMR HW only
> path.
> 
> To keep this flow working qemu must ignore the IOVA from the guest and
> always replace it with its own idea of what the correct ITS address is
> for KVM to work. It means we don't correctly emulate guest
> misconfiguration of the MSI address.

That is something alternative in my mind, to simplify things,
especially we are having a discussion, on the other side, for
selecting a correct (QEMU) address space depending on whether
vIOMMU needs a stage-1 translation or not. This MSI translate
thing makes the whole narrative more complicated indeed.

We could use a different PCI op to forward the vITS physical
address to KVM layer bypassing the translation pathway.

Thanks
Nicolin

Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback

Posted by Jason Gunthorpe 3 months ago

On Wed, Nov 05, 2025 at 10:33:08AM -0800, Nicolin Chen wrote:
> On Wed, Nov 05, 2025 at 02:10:49PM -0400, Jason Gunthorpe wrote:
> > On Wed, Nov 05, 2025 at 06:25:05PM +0100, Eric Auger wrote:
> > > if the guest doorbell address is wrong because not properly translated,
> > > vgic_msi_to_its() will fail to identify the ITS to inject the MSI in.
> > > See kernel kvm/vgic/vgic-its.c vgic_msi_to_its and
> > > vgic_its_inject_msi
> > 
> > Which has been exactly my point to Nicolin. There is no way to
> > "properly translate" the vMSI address in a HW accelerated SMMU
> > emulation.
> 
> Hmm, I still can't connect the dots here. QEMU knows where the
> guest CD table is to get the stage-1 translation table to walk
> through. We could choose to not let it walk through. Yet, why?

You cannot walk any tables in guest memory without fully trapping all
invalidation on all command queues. Like real HW qemu needs to fence
its walks with any concurrent invalidate & sync to ensure it doesn't
walk into a UAF situation.

Since we can't trap or mediate vCMDQ the walking simply cannot be
done.

Thus, the general principle of the HW accelerated vSMMU is that it
NEVER walks any of these guest tables for any reason.

Thus, we cannot do anything with vMSI address beyond program it
directly into a real PCI device so it undergoes real HW translation.

Jason

Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback

Posted by Eric Auger 3 months ago


On 11/5/25 7:58 PM, Jason Gunthorpe wrote:
> On Wed, Nov 05, 2025 at 10:33:08AM -0800, Nicolin Chen wrote:
>> On Wed, Nov 05, 2025 at 02:10:49PM -0400, Jason Gunthorpe wrote:
>>> On Wed, Nov 05, 2025 at 06:25:05PM +0100, Eric Auger wrote:
>>>> if the guest doorbell address is wrong because not properly translated,
>>>> vgic_msi_to_its() will fail to identify the ITS to inject the MSI in.
>>>> See kernel kvm/vgic/vgic-its.c vgic_msi_to_its and
>>>> vgic_its_inject_msi
>>> Which has been exactly my point to Nicolin. There is no way to
>>> "properly translate" the vMSI address in a HW accelerated SMMU
>>> emulation.
>> Hmm, I still can't connect the dots here. QEMU knows where the
>> guest CD table is to get the stage-1 translation table to walk
>> through. We could choose to not let it walk through. Yet, why?
> You cannot walk any tables in guest memory without fully trapping all
> invalidation on all command queues. Like real HW qemu needs to fence
> its walks with any concurrent invalidate & sync to ensure it doesn't
> walk into a UAF situation.
But at the moment we do trap IOTLB invalidates so logically we can still
do the translate in that config. The problem you describe will show up
with vCMDQ which is not part of this series.
>
> Since we can't trap or mediate vCMDQ the walking simply cannot be
> done.
>
> Thus, the general principle of the HW accelerated vSMMU is that it
> NEVER walks any of these guest tables for any reason.
>
> Thus, we cannot do anything with vMSI address beyond program it
> directly into a real PCI device so it undergoes real HW translation.
But anyway you need to provide KVM a valid info about the guest doorbell
for this latter to setup irqfd gsi routing and also program ITS
translation tables. At the moment we have a single vITS in qemu so maybe
we can cheat.

Eric
>
> Jason
>

Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback

Posted by Jason Gunthorpe 3 months ago

On Thu, Nov 06, 2025 at 08:42:31AM +0100, Eric Auger wrote:
> 
> 
> On 11/5/25 7:58 PM, Jason Gunthorpe wrote:
> > On Wed, Nov 05, 2025 at 10:33:08AM -0800, Nicolin Chen wrote:
> >> On Wed, Nov 05, 2025 at 02:10:49PM -0400, Jason Gunthorpe wrote:
> >>> On Wed, Nov 05, 2025 at 06:25:05PM +0100, Eric Auger wrote:
> >>>> if the guest doorbell address is wrong because not properly translated,
> >>>> vgic_msi_to_its() will fail to identify the ITS to inject the MSI in.
> >>>> See kernel kvm/vgic/vgic-its.c vgic_msi_to_its and
> >>>> vgic_its_inject_msi
> >>> Which has been exactly my point to Nicolin. There is no way to
> >>> "properly translate" the vMSI address in a HW accelerated SMMU
> >>> emulation.
> >> Hmm, I still can't connect the dots here. QEMU knows where the
> >> guest CD table is to get the stage-1 translation table to walk
> >> through. We could choose to not let it walk through. Yet, why?
> > You cannot walk any tables in guest memory without fully trapping all
> > invalidation on all command queues. Like real HW qemu needs to fence
> > its walks with any concurrent invalidate & sync to ensure it doesn't
> > walk into a UAF situation.
> But at the moment we do trap IOTLB invalidates so logically we can still
> do the translate in that config. The problem you describe will show up
> with vCMDQ which is not part of this series.

This is why I said:

> > Thus, the general principle of the HW accelerated vSMMU is that it
> > NEVER walks any of these guest tables for any reason.

It would make no sense to add table walking then have to figure out
how to rip it out.

> But anyway you need to provide KVM a valid info about the guest doorbell
> for this latter to setup irqfd gsi routing and also program ITS
> translation tables. At the moment we have a single vITS in qemu so maybe
> we can cheat.

qemu should always know what VITS is linked to a pci device to tell
kvm whatever it needs, even if there are more than one.

Jason

Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback

Posted by Eric Auger 3 months ago


On 11/6/25 3:32 PM, Jason Gunthorpe wrote:
> On Thu, Nov 06, 2025 at 08:42:31AM +0100, Eric Auger wrote:
>>
>> On 11/5/25 7:58 PM, Jason Gunthorpe wrote:
>>> On Wed, Nov 05, 2025 at 10:33:08AM -0800, Nicolin Chen wrote:
>>>> On Wed, Nov 05, 2025 at 02:10:49PM -0400, Jason Gunthorpe wrote:
>>>>> On Wed, Nov 05, 2025 at 06:25:05PM +0100, Eric Auger wrote:
>>>>>> if the guest doorbell address is wrong because not properly translated,
>>>>>> vgic_msi_to_its() will fail to identify the ITS to inject the MSI in.
>>>>>> See kernel kvm/vgic/vgic-its.c vgic_msi_to_its and
>>>>>> vgic_its_inject_msi
>>>>> Which has been exactly my point to Nicolin. There is no way to
>>>>> "properly translate" the vMSI address in a HW accelerated SMMU
>>>>> emulation.
>>>> Hmm, I still can't connect the dots here. QEMU knows where the
>>>> guest CD table is to get the stage-1 translation table to walk
>>>> through. We could choose to not let it walk through. Yet, why?
>>> You cannot walk any tables in guest memory without fully trapping all
>>> invalidation on all command queues. Like real HW qemu needs to fence
>>> its walks with any concurrent invalidate & sync to ensure it doesn't
>>> walk into a UAF situation.
>> But at the moment we do trap IOTLB invalidates so logically we can still
>> do the translate in that config. The problem you describe will show up
>> with vCMDQ which is not part of this series.
> This is why I said:
>
>>> Thus, the general principle of the HW accelerated vSMMU is that it
>>> NEVER walks any of these guest tables for any reason.
> It would make no sense to add table walking then have to figure out
> how to rip it out.

understood. Though strictly speaking you are not adding it as it is
already there ;-)
>
>> But anyway you need to provide KVM a valid info about the guest doorbell
>> for this latter to setup irqfd gsi routing and also program ITS
>> translation tables. At the moment we have a single vITS in qemu so maybe
>> we can cheat.
> qemu should always know what VITS is linked to a pci device to tell
> kvm whatever it needs, even if there are more than one.
Yeah we can work in that direction instead. But this could be worked on
later on along with vcmdq series as well ;-)

Eric
>
> Jason
>

RE: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback

Posted by Shameer Kolothum 3 months ago

> -----Original Message-----
> From: Eric Auger <eric.auger@redhat.com>
> Sent: 06 November 2025 07:43
> To: Jason Gunthorpe <jgg@nvidia.com>; Nicolin Chen <nicolinc@nvidia.com>
> Cc: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
> arm@nongnu.org; qemu-devel@nongnu.org; peter.maydell@linaro.org;
> ddutile@redhat.com; berrange@redhat.com; Nathan Chen
> <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> Krishnakant Jaju <kjaju@nvidia.com>
> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional
> get_msi_address_space() callback
> 
> External email: Use caution opening links or attachments
> 
> 
> On 11/5/25 7:58 PM, Jason Gunthorpe wrote:
> > On Wed, Nov 05, 2025 at 10:33:08AM -0800, Nicolin Chen wrote:
> >> On Wed, Nov 05, 2025 at 02:10:49PM -0400, Jason Gunthorpe wrote:
> >>> On Wed, Nov 05, 2025 at 06:25:05PM +0100, Eric Auger wrote:
> >>>> if the guest doorbell address is wrong because not properly translated,
> >>>> vgic_msi_to_its() will fail to identify the ITS to inject the MSI in.
> >>>> See kernel kvm/vgic/vgic-its.c vgic_msi_to_its and
> >>>> vgic_its_inject_msi
> >>> Which has been exactly my point to Nicolin. There is no way to
> >>> "properly translate" the vMSI address in a HW accelerated SMMU
> >>> emulation.
> >> Hmm, I still can't connect the dots here. QEMU knows where the
> >> guest CD table is to get the stage-1 translation table to walk
> >> through. We could choose to not let it walk through. Yet, why?
> > You cannot walk any tables in guest memory without fully trapping all
> > invalidation on all command queues. Like real HW qemu needs to fence
> > its walks with any concurrent invalidate & sync to ensure it doesn't
> > walk into a UAF situation.
> But at the moment we do trap IOTLB invalidates so logically we can still
> do the translate in that config. The problem you describe will show up
> with vCMDQ which is not part of this series.
> >
> > Since we can't trap or mediate vCMDQ the walking simply cannot be
> > done.
> >
> > Thus, the general principle of the HW accelerated vSMMU is that it
> > NEVER walks any of these guest tables for any reason.
> >
> > Thus, we cannot do anything with vMSI address beyond program it
> > directly into a real PCI device so it undergoes real HW translation.
> But anyway you need to provide KVM a valid info about the guest doorbell
> for this latter to setup irqfd gsi routing and also program ITS
> translation tables. At the moment we have a single vITS in qemu so maybe
> we can cheat.

I have tried to address the “translate” issue below. This introduces a new
get_msi_address() callback to retrieve the MSI doorbell address directly
from the vIOMMU, so we can drop the existing get_msi_address_space() logic.
Please take a look and let me know your thoughts.

Thanks,
Shameer

---
 hw/arm/smmuv3-accel.c   | 10 ++++++++++
 hw/arm/smmuv3.c         |  1 +
 hw/arm/virt.c           |  4 ++++
 hw/pci/pci.c            | 17 +++++++++++++++++
 include/hw/arm/smmuv3.h |  1 +
 include/hw/pci/pci.h    | 15 +++++++++++++++
 target/arm/kvm.c        | 14 ++++++++++++--
 7 files changed, 60 insertions(+), 2 deletions(-)

diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
index e6c81c4786..8b2a45a915 100644
--- a/hw/arm/smmuv3-accel.c
+++ b/hw/arm/smmuv3-accel.c
@@ -667,6 +667,15 @@ static void smmuv3_accel_unset_iommu_device(PCIBus *bus, void *opaque,
     }
 }

+static uint64_t smmuv3_accel_get_msi_address(PCIBus *bus, void *opaque,
+                                             int devfn)
+{
+    SMMUState *bs = opaque;
+    SMMUv3State *s = ARM_SMMUV3(bs);
+
+    g_assert(s->msi_doorbell);
+    return s->msi_doorbell;
+}
 static AddressSpace *smmuv3_accel_get_msi_as(PCIBus *bus, void *opaque,
                                              int devfn)
 {
@@ -788,6 +797,7 @@ static const PCIIOMMUOps smmuv3_accel_ops = {
     .set_iommu_device = smmuv3_accel_set_iommu_device,
     .unset_iommu_device = smmuv3_accel_unset_iommu_device,
     .get_msi_address_space = smmuv3_accel_get_msi_as,
+    .get_msi_address = smmuv3_accel_get_msi_address,
 };

 void smmuv3_accel_idr_override(SMMUv3State *s)
diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index 43d297698b..3f2ee8bcce 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -2120,6 +2120,7 @@ static const Property smmuv3_properties[] = {
     DEFINE_PROP_BOOL("ats", SMMUv3State, ats, false),
     DEFINE_PROP_UINT8("oas", SMMUv3State, oas, 44),
     DEFINE_PROP_BOOL("pasid", SMMUv3State, pasid, false),
+    DEFINE_PROP_UINT64("msi-doorbell", SMMUv3State, msi_doorbell, 0),
 };

 static void smmuv3_instance_init(Object *obj)
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 2498e3beff..d2dcb89235 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -3097,6 +3097,8 @@ static void virt_machine_device_plug_cb(HotplugHandler *hotplug_dev,

             create_smmuv3_dev_dtb(vms, dev, bus, errp);
             if (object_property_get_bool(OBJECT(dev), "accel", &error_abort)) {
+                hwaddr db_start = base_memmap[VIRT_GIC_ITS].base +
+                                  ITS_TRANS_SIZE + GITS_TRANSLATER;
                 char *stage;
                 stage = object_property_get_str(OBJECT(dev), "stage",
                                                 &error_fatal);
@@ -3107,6 +3109,8 @@ static void virt_machine_device_plug_cb(HotplugHandler *hotplug_dev,
                     return;
                 }
                 vms->pci_preserve_config = true;
+                object_property_set_uint(OBJECT(dev), "msi-doorbell", db_start,
+                                         &error_abort);
             }
         }
     }
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 1edd711247..45e79a3c23 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -2982,6 +2982,23 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
     return &address_space_memory;
 }

+bool pci_device_iommu_msi_direct_address(PCIDevice *dev, hwaddr *out_doorbell)
+{
+    PCIBus *bus;
+    PCIBus *iommu_bus;
+    int devfn;
+
+    pci_device_get_iommu_bus_devfn(dev, &iommu_bus, &bus, &devfn);
+    if (iommu_bus) {
+        if (iommu_bus->iommu_ops->get_msi_address) {
+            *out_doorbell = iommu_bus->iommu_ops->get_msi_address(bus,
+                                 iommu_bus->iommu_opaque, devfn);
+            return true;
+        }
+    }
+    return false;
+}
+
 AddressSpace *pci_device_iommu_msi_address_space(PCIDevice *dev)
 {
     PCIBus *bus;
diff --git a/include/hw/arm/smmuv3.h b/include/hw/arm/smmuv3.h
index ee0b5ed74f..f50d8c72bd 100644
--- a/include/hw/arm/smmuv3.h
+++ b/include/hw/arm/smmuv3.h
@@ -72,6 +72,7 @@ struct SMMUv3State {
     bool ats;
     uint8_t oas;
     bool pasid;
+    uint64_t msi_doorbell;
 };

 typedef enum {
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index b731443c67..e1709b0bfe 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -679,6 +679,20 @@ typedef struct PCIIOMMUOps {
      */
     AddressSpace * (*get_msi_address_space)(PCIBus *bus, void *opaque,
                                             int devfn);
+    /**
+     * @get_msi_address: get the address of MSI doorbell for the device
+     * on a PCI bus.
+     *
+     * Optional callback, if implemented must return a valid MSI doorbell
+     * address.
+     *
+     * @bus: the #PCIBus being accessed.
+     *
+     * @opaque: the data passed to pci_setup_iommu().
+     *
+     * @devfn: device and function number
+     */
+    uint64_t (*get_msi_address)(PCIBus *bus, void *opaque, int devfn);
 } PCIIOMMUOps;

 bool pci_device_get_iommu_bus_devfn(PCIDevice *dev, PCIBus **piommu_bus,
@@ -688,6 +702,7 @@ bool pci_device_set_iommu_device(PCIDevice *dev, HostIOMMUDevice *hiod,
                                  Error **errp);
 void pci_device_unset_iommu_device(PCIDevice *dev);
 AddressSpace *pci_device_iommu_msi_address_space(PCIDevice *dev);
+bool pci_device_iommu_msi_direct_address(PCIDevice *dev, hwaddr *out_doorbell);

 /**
  * pci_device_get_viommu_flags: get vIOMMU flags.
diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index 0df41128d0..8d4d2be0bc 100644
--- a/target/arm/kvm.c
+++ b/target/arm/kvm.c
@@ -1611,35 +1611,45 @@ int kvm_arm_set_irq(int cpu, int irqtype, int irq, int level)
 int kvm_arch_fixup_msi_route(struct kvm_irq_routing_entry *route,
                              uint64_t address, uint32_t data, PCIDevice *dev)
 {
-    AddressSpace *as = pci_device_iommu_msi_address_space(dev);
+    AddressSpace *as;
     hwaddr xlat, len, doorbell_gpa;
     MemoryRegionSection mrs;
     MemoryRegion *mr;

+    /* Check if there is a direct msi address available */
+    if (pci_device_iommu_msi_direct_address(dev, &doorbell_gpa)) {
+        goto set_doorbell;
+    }
+
+    as = pci_device_iommu_msi_address_space(dev);
     if (as == &address_space_memory) {
         return 0;
     }

     /* MSI doorbell address is translated by an IOMMU */

-    RCU_READ_LOCK_GUARD();
+    rcu_read_lock();

     mr = address_space_translate(as, address, &xlat, &len, true,
                                  MEMTXATTRS_UNSPECIFIED);

     if (!mr) {
+        rcu_read_unlock();
         return 1;
     }

     mrs = memory_region_find(mr, xlat, 1);

     if (!mrs.mr) {
+        rcu_read_unlock();
         return 1;
     }

     doorbell_gpa = mrs.offset_within_address_space;
     memory_region_unref(mrs.mr);
+    rcu_read_unlock();

+set_doorbell:
     route->u.msi.address_lo = doorbell_gpa;
     route->u.msi.address_hi = doorbell_gpa >> 32;

--

Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback

Posted by Eric Auger 3 months ago

Hi Shameer,

On 11/6/25 12:48 PM, Shameer Kolothum wrote:
>> -----Original Message-----
>> From: Eric Auger <eric.auger@redhat.com>
>> Sent: 06 November 2025 07:43
>> To: Jason Gunthorpe <jgg@nvidia.com>; Nicolin Chen <nicolinc@nvidia.com>
>> Cc: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
>> arm@nongnu.org; qemu-devel@nongnu.org; peter.maydell@linaro.org;
>> ddutile@redhat.com; berrange@redhat.com; Nathan Chen
>> <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
>> smostafa@google.com; wangzhou1@hisilicon.com;
>> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
>> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
>> Krishnakant Jaju <kjaju@nvidia.com>
>> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional
>> get_msi_address_space() callback
>>
>> External email: Use caution opening links or attachments
>>
>>
>> On 11/5/25 7:58 PM, Jason Gunthorpe wrote:
>>> On Wed, Nov 05, 2025 at 10:33:08AM -0800, Nicolin Chen wrote:
>>>> On Wed, Nov 05, 2025 at 02:10:49PM -0400, Jason Gunthorpe wrote:
>>>>> On Wed, Nov 05, 2025 at 06:25:05PM +0100, Eric Auger wrote:
>>>>>> if the guest doorbell address is wrong because not properly translated,
>>>>>> vgic_msi_to_its() will fail to identify the ITS to inject the MSI in.
>>>>>> See kernel kvm/vgic/vgic-its.c vgic_msi_to_its and
>>>>>> vgic_its_inject_msi
>>>>> Which has been exactly my point to Nicolin. There is no way to
>>>>> "properly translate" the vMSI address in a HW accelerated SMMU
>>>>> emulation.
>>>> Hmm, I still can't connect the dots here. QEMU knows where the
>>>> guest CD table is to get the stage-1 translation table to walk
>>>> through. We could choose to not let it walk through. Yet, why?
>>> You cannot walk any tables in guest memory without fully trapping all
>>> invalidation on all command queues. Like real HW qemu needs to fence
>>> its walks with any concurrent invalidate & sync to ensure it doesn't
>>> walk into a UAF situation.
>> But at the moment we do trap IOTLB invalidates so logically we can still
>> do the translate in that config. The problem you describe will show up
>> with vCMDQ which is not part of this series.
>>> Since we can't trap or mediate vCMDQ the walking simply cannot be
>>> done.
>>>
>>> Thus, the general principle of the HW accelerated vSMMU is that it
>>> NEVER walks any of these guest tables for any reason.
>>>
>>> Thus, we cannot do anything with vMSI address beyond program it
>>> directly into a real PCI device so it undergoes real HW translation.
>> But anyway you need to provide KVM a valid info about the guest doorbell
>> for this latter to setup irqfd gsi routing and also program ITS
>> translation tables. At the moment we have a single vITS in qemu so maybe
>> we can cheat.
> I have tried to address the “translate” issue below. This introduces a new
> get_msi_address() callback to retrieve the MSI doorbell address directly
> from the vIOMMU, so we can drop the existing get_msi_address_space() logic.
> Please take a look and let me know your thoughts.
>
> Thanks,
> Shameer
>
> ---
>  hw/arm/smmuv3-accel.c   | 10 ++++++++++
>  hw/arm/smmuv3.c         |  1 +
>  hw/arm/virt.c           |  4 ++++
>  hw/pci/pci.c            | 17 +++++++++++++++++
>  include/hw/arm/smmuv3.h |  1 +
>  include/hw/pci/pci.h    | 15 +++++++++++++++
>  target/arm/kvm.c        | 14 ++++++++++++--
>  7 files changed, 60 insertions(+), 2 deletions(-)
>
> diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
> index e6c81c4786..8b2a45a915 100644
> --- a/hw/arm/smmuv3-accel.c
> +++ b/hw/arm/smmuv3-accel.c
> @@ -667,6 +667,15 @@ static void smmuv3_accel_unset_iommu_device(PCIBus *bus, void *opaque,
>      }
>  }
>
> +static uint64_t smmuv3_accel_get_msi_address(PCIBus *bus, void *opaque,
> +                                             int devfn)
> +{
> +    SMMUState *bs = opaque;
> +    SMMUv3State *s = ARM_SMMUV3(bs);
> +
> +    g_assert(s->msi_doorbell);
> +    return s->msi_doorbell;
> +}
>  static AddressSpace *smmuv3_accel_get_msi_as(PCIBus *bus, void *opaque,
>                                               int devfn)
>  {
> @@ -788,6 +797,7 @@ static const PCIIOMMUOps smmuv3_accel_ops = {
>      .set_iommu_device = smmuv3_accel_set_iommu_device,
>      .unset_iommu_device = smmuv3_accel_unset_iommu_device,
>      .get_msi_address_space = smmuv3_accel_get_msi_as,
to be removed then
> +    .get_msi_address = smmuv3_accel_get_msi_address,
>  };
>
>  void smmuv3_accel_idr_override(SMMUv3State *s)
> diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
> index 43d297698b..3f2ee8bcce 100644
> --- a/hw/arm/smmuv3.c
> +++ b/hw/arm/smmuv3.c
> @@ -2120,6 +2120,7 @@ static const Property smmuv3_properties[] = {
>      DEFINE_PROP_BOOL("ats", SMMUv3State, ats, false),
>      DEFINE_PROP_UINT8("oas", SMMUv3State, oas, 44),
>      DEFINE_PROP_BOOL("pasid", SMMUv3State, pasid, false),
> +    DEFINE_PROP_UINT64("msi-doorbell", SMMUv3State, msi_doorbell, 0),
>  };
>
>  static void smmuv3_instance_init(Object *obj)
> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> index 2498e3beff..d2dcb89235 100644
> --- a/hw/arm/virt.c
> +++ b/hw/arm/virt.c
> @@ -3097,6 +3097,8 @@ static void virt_machine_device_plug_cb(HotplugHandler *hotplug_dev,
>
>              create_smmuv3_dev_dtb(vms, dev, bus, errp);
>              if (object_property_get_bool(OBJECT(dev), "accel", &error_abort)) {
> +                hwaddr db_start = base_memmap[VIRT_GIC_ITS].base +
> +                                  ITS_TRANS_SIZE + GITS_TRANSLATER;
there are still use cases where you count target GICv2M doorbell so at
least you would need to add some logic to switch between both
>                  char *stage;
>                  stage = object_property_get_str(OBJECT(dev), "stage",
>                                                  &error_fatal);
> @@ -3107,6 +3109,8 @@ static void virt_machine_device_plug_cb(HotplugHandler *hotplug_dev,
>                      return;
>                  }
>                  vms->pci_preserve_config = true;
> +                object_property_set_uint(OBJECT(dev), "msi-doorbell", db_start,
> +                                         &error_abort);
>              }
>          }
>      }
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index 1edd711247..45e79a3c23 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -2982,6 +2982,23 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
>      return &address_space_memory;
>  }
>
> +bool pci_device_iommu_msi_direct_address(PCIDevice *dev, hwaddr *out_doorbell)
> +{
> +    PCIBus *bus;
> +    PCIBus *iommu_bus;
> +    int devfn;
> +
> +    pci_device_get_iommu_bus_devfn(dev, &iommu_bus, &bus, &devfn);
> +    if (iommu_bus) {
> +        if (iommu_bus->iommu_ops->get_msi_address) {
> +            *out_doorbell = iommu_bus->iommu_ops->get_msi_address(bus,
> +                                 iommu_bus->iommu_opaque, devfn);
> +            return true;
> +        }
> +    }
> +    return false;
> +}
> +
>  AddressSpace *pci_device_iommu_msi_address_space(PCIDevice *dev)
>  {
>      PCIBus *bus;
> diff --git a/include/hw/arm/smmuv3.h b/include/hw/arm/smmuv3.h
> index ee0b5ed74f..f50d8c72bd 100644
> --- a/include/hw/arm/smmuv3.h
> +++ b/include/hw/arm/smmuv3.h
> @@ -72,6 +72,7 @@ struct SMMUv3State {
>      bool ats;
>      uint8_t oas;
>      bool pasid;
> +    uint64_t msi_doorbell;
>  };
>
>  typedef enum {
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index b731443c67..e1709b0bfe 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -679,6 +679,20 @@ typedef struct PCIIOMMUOps {
>       */
>      AddressSpace * (*get_msi_address_space)(PCIBus *bus, void *opaque,
>                                              int devfn);
> +    /**
> +     * @get_msi_address: get the address of MSI doorbell for the device
(gpa) address
> +     * on a PCI bus.
> +     *
> +     * Optional callback, if implemented must return a valid MSI doorbell
> +     * address.
> +     *
> +     * @bus: the #PCIBus being accessed.
> +     *
> +     * @opaque: the data passed to pci_setup_iommu().
> +     *
> +     * @devfn: device and function number
> +     */
> +    uint64_t (*get_msi_address)(PCIBus *bus, void *opaque, int devfn);
>  } PCIIOMMUOps;
>
>  bool pci_device_get_iommu_bus_devfn(PCIDevice *dev, PCIBus **piommu_bus,
> @@ -688,6 +702,7 @@ bool pci_device_set_iommu_device(PCIDevice *dev, HostIOMMUDevice *hiod,
>                                   Error **errp);
>  void pci_device_unset_iommu_device(PCIDevice *dev);
>  AddressSpace *pci_device_iommu_msi_address_space(PCIDevice *dev);
> +bool pci_device_iommu_msi_direct_address(PCIDevice *dev, hwaddr *out_doorbell);
>
>  /**
>   * pci_device_get_viommu_flags: get vIOMMU flags.
> diff --git a/target/arm/kvm.c b/target/arm/kvm.c
> index 0df41128d0..8d4d2be0bc 100644
> --- a/target/arm/kvm.c
> +++ b/target/arm/kvm.c
> @@ -1611,35 +1611,45 @@ int kvm_arm_set_irq(int cpu, int irqtype, int irq, int level)
>  int kvm_arch_fixup_msi_route(struct kvm_irq_routing_entry *route,
>                               uint64_t address, uint32_t data, PCIDevice *dev)
>  {
> -    AddressSpace *as = pci_device_iommu_msi_address_space(dev);
> +    AddressSpace *as;
>      hwaddr xlat, len, doorbell_gpa;
>      MemoryRegionSection mrs;
>      MemoryRegion *mr;
>
> +    /* Check if there is a direct msi address available */
> +    if (pci_device_iommu_msi_direct_address(dev, &doorbell_gpa)) {
> +        goto set_doorbell;
> +    }
> +
> +    as = pci_device_iommu_msi_address_space(dev);
logically this should be after the test below (ie. meaning we have an
IOMMU). But this means that you shall use an as which is not the
address_space_memory.

This works but it is not neat either because it totally ignored the
@address. So you have to build a solid commit msg to explain readers why
this is needed ;-)
>      if (as == &address_space_memory) {
>          return 0;
>      }
>
>      /* MSI doorbell address is translated by an IOMMU */
>
> -    RCU_READ_LOCK_GUARD();
> +    rcu_read_lock();
>
>      mr = address_space_translate(as, address, &xlat, &len, true,
>                                   MEMTXATTRS_UNSPECIFIED);
>
>      if (!mr) {
> +        rcu_read_unlock();
>          return 1;
>      }
>
>      mrs = memory_region_find(mr, xlat, 1);
>
>      if (!mrs.mr) {
> +        rcu_read_unlock();
>          return 1;
>      }
>
>      doorbell_gpa = mrs.offset_within_address_space;
>      memory_region_unref(mrs.mr);
> +    rcu_read_unlock();
>
> +set_doorbell:
>      route->u.msi.address_lo = doorbell_gpa;
>      route->u.msi.address_hi = doorbell_gpa >> 32;
>
> --
>
>
>
>
>
>
Thanks

Eric

RE: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback

Posted by Shameer Kolothum 3 months ago

Hi Eric,

> -----Original Message-----
> From: Eric Auger <eric.auger@redhat.com>
> Sent: 06 November 2025 17:05
> To: Shameer Kolothum <skolothumtho@nvidia.com>; Jason Gunthorpe
> <jgg@nvidia.com>; Nicolin Chen <nicolinc@nvidia.com>
> Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
> peter.maydell@linaro.org; ddutile@redhat.com; berrange@redhat.com;
> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> Krishnakant Jaju <kjaju@nvidia.com>
> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional
> get_msi_address_space() callback
> 
[...] 
> 
> > I have tried to address the “translate” issue below. This introduces a new
> > get_msi_address() callback to retrieve the MSI doorbell address directly
> > from the vIOMMU, so we can drop the existing get_msi_address_space()
> logic.
> > Please take a look and let me know your thoughts.
> >
> > Thanks,
> > Shameer
> >
> > ---
> >  hw/arm/smmuv3-accel.c   | 10 ++++++++++
> >  hw/arm/smmuv3.c         |  1 +
> >  hw/arm/virt.c           |  4 ++++
> >  hw/pci/pci.c            | 17 +++++++++++++++++
> >  include/hw/arm/smmuv3.h |  1 +
> >  include/hw/pci/pci.h    | 15 +++++++++++++++
> >  target/arm/kvm.c        | 14 ++++++++++++--
> >  7 files changed, 60 insertions(+), 2 deletions(-)
> >
> > diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
> > index e6c81c4786..8b2a45a915 100644
> > --- a/hw/arm/smmuv3-accel.c
> > +++ b/hw/arm/smmuv3-accel.c
> > @@ -667,6 +667,15 @@ static void
> smmuv3_accel_unset_iommu_device(PCIBus *bus, void *opaque,
> >      }
> >  }
> >
> > +static uint64_t smmuv3_accel_get_msi_address(PCIBus *bus, void
> *opaque,
> > +                                             int devfn)
> > +{
> > +    SMMUState *bs = opaque;
> > +    SMMUv3State *s = ARM_SMMUV3(bs);
> > +
> > +    g_assert(s->msi_doorbell);
> > +    return s->msi_doorbell;
> > +}
> >  static AddressSpace *smmuv3_accel_get_msi_as(PCIBus *bus, void
> *opaque,
> >                                               int devfn)
> >  {
> > @@ -788,6 +797,7 @@ static const PCIIOMMUOps smmuv3_accel_ops = {
> >      .set_iommu_device = smmuv3_accel_set_iommu_device,
> >      .unset_iommu_device = smmuv3_accel_unset_iommu_device,
> >      .get_msi_address_space = smmuv3_accel_get_msi_as,
> to be removed then

Yes, Of course. Will drop that.

> > +    .get_msi_address = smmuv3_accel_get_msi_address,
> >  };
> >
> >  void smmuv3_accel_idr_override(SMMUv3State *s)
> > diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
> > index 43d297698b..3f2ee8bcce 100644
> > --- a/hw/arm/smmuv3.c
> > +++ b/hw/arm/smmuv3.c
> > @@ -2120,6 +2120,7 @@ static const Property smmuv3_properties[] = {
> >      DEFINE_PROP_BOOL("ats", SMMUv3State, ats, false),
> >      DEFINE_PROP_UINT8("oas", SMMUv3State, oas, 44),
> >      DEFINE_PROP_BOOL("pasid", SMMUv3State, pasid, false),
> > +    DEFINE_PROP_UINT64("msi-doorbell", SMMUv3State, msi_doorbell, 0),
> >  };
> >
> >  static void smmuv3_instance_init(Object *obj)
> > diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> > index 2498e3beff..d2dcb89235 100644
> > --- a/hw/arm/virt.c
> > +++ b/hw/arm/virt.c
> > @@ -3097,6 +3097,8 @@ static void
> virt_machine_device_plug_cb(HotplugHandler *hotplug_dev,
> >
> >              create_smmuv3_dev_dtb(vms, dev, bus, errp);
> >              if (object_property_get_bool(OBJECT(dev), "accel", &error_abort)) {
> > +                hwaddr db_start = base_memmap[VIRT_GIC_ITS].base +
> > +                                  ITS_TRANS_SIZE + GITS_TRANSLATER;
> there are still use cases where you count target GICv2M doorbell so at
> least you would need to add some logic to switch between both

But with KVM, virt doesn't support GICv2 , right?
That reminds me we should probably add a check to see KVM enabled
for SMMUV3 accel=on case.

> >                  char *stage;
> >                  stage = object_property_get_str(OBJECT(dev), "stage",
> >                                                  &error_fatal);
> > @@ -3107,6 +3109,8 @@ static void
> virt_machine_device_plug_cb(HotplugHandler *hotplug_dev,
> >                      return;
> >                  }
> >                  vms->pci_preserve_config = true;
> > +                object_property_set_uint(OBJECT(dev), "msi-doorbell", db_start,
> > +                                         &error_abort);
> >              }
> >          }
> >      }
> > diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> > index 1edd711247..45e79a3c23 100644
> > --- a/hw/pci/pci.c
> > +++ b/hw/pci/pci.c
> > @@ -2982,6 +2982,23 @@ AddressSpace
> *pci_device_iommu_address_space(PCIDevice *dev)
> >      return &address_space_memory;
> >  }
> >
> > +bool pci_device_iommu_msi_direct_address(PCIDevice *dev, hwaddr
> *out_doorbell)
> > +{
> > +    PCIBus *bus;
> > +    PCIBus *iommu_bus;
> > +    int devfn;
> > +
> > +    pci_device_get_iommu_bus_devfn(dev, &iommu_bus, &bus, &devfn);
> > +    if (iommu_bus) {
> > +        if (iommu_bus->iommu_ops->get_msi_address) {
> > +            *out_doorbell = iommu_bus->iommu_ops->get_msi_address(bus,
> > +                                 iommu_bus->iommu_opaque, devfn);
> > +            return true;
> > +        }
> > +    }
> > +    return false;
> > +}
> > +
> >  AddressSpace *pci_device_iommu_msi_address_space(PCIDevice *dev)
> >  {
> >      PCIBus *bus;
> > diff --git a/include/hw/arm/smmuv3.h b/include/hw/arm/smmuv3.h
> > index ee0b5ed74f..f50d8c72bd 100644
> > --- a/include/hw/arm/smmuv3.h
> > +++ b/include/hw/arm/smmuv3.h
> > @@ -72,6 +72,7 @@ struct SMMUv3State {
> >      bool ats;
> >      uint8_t oas;
> >      bool pasid;
> > +    uint64_t msi_doorbell;
> >  };
> >
> >  typedef enum {
> > diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> > index b731443c67..e1709b0bfe 100644
> > --- a/include/hw/pci/pci.h
> > +++ b/include/hw/pci/pci.h
> > @@ -679,6 +679,20 @@ typedef struct PCIIOMMUOps {
> >       */
> >      AddressSpace * (*get_msi_address_space)(PCIBus *bus, void *opaque,
> >                                              int devfn);
> > +    /**
> > +     * @get_msi_address: get the address of MSI doorbell for the device
> (gpa) address
> > +     * on a PCI bus.
> > +     *
> > +     * Optional callback, if implemented must return a valid MSI doorbell
> > +     * address.
> > +     *
> > +     * @bus: the #PCIBus being accessed.
> > +     *
> > +     * @opaque: the data passed to pci_setup_iommu().
> > +     *
> > +     * @devfn: device and function number
> > +     */
> > +    uint64_t (*get_msi_address)(PCIBus *bus, void *opaque, int devfn);
> >  } PCIIOMMUOps;
> >
> >  bool pci_device_get_iommu_bus_devfn(PCIDevice *dev, PCIBus
> **piommu_bus,
> > @@ -688,6 +702,7 @@ bool pci_device_set_iommu_device(PCIDevice *dev,
> HostIOMMUDevice *hiod,
> >                                   Error **errp);
> >  void pci_device_unset_iommu_device(PCIDevice *dev);
> >  AddressSpace *pci_device_iommu_msi_address_space(PCIDevice *dev);
> > +bool pci_device_iommu_msi_direct_address(PCIDevice *dev, hwaddr
> *out_doorbell);
> >
> >  /**
> >   * pci_device_get_viommu_flags: get vIOMMU flags.
> > diff --git a/target/arm/kvm.c b/target/arm/kvm.c
> > index 0df41128d0..8d4d2be0bc 100644
> > --- a/target/arm/kvm.c
> > +++ b/target/arm/kvm.c
> > @@ -1611,35 +1611,45 @@ int kvm_arm_set_irq(int cpu, int irqtype, int irq,
> int level)
> >  int kvm_arch_fixup_msi_route(struct kvm_irq_routing_entry *route,
> >                               uint64_t address, uint32_t data, PCIDevice *dev)
> >  {
> > -    AddressSpace *as = pci_device_iommu_msi_address_space(dev);
> > +    AddressSpace *as;
> >      hwaddr xlat, len, doorbell_gpa;
> >      MemoryRegionSection mrs;
> >      MemoryRegion *mr;
> >
> > +    /* Check if there is a direct msi address available */
> > +    if (pci_device_iommu_msi_direct_address(dev, &doorbell_gpa)) {
> > +        goto set_doorbell;
> > +    }
> > +
> > +    as = pci_device_iommu_msi_address_space(dev);
> logically this should be after the test below (ie. meaning we have an
> IOMMU). But this means that you shall use an as which is not the
> address_space_memory.

Ok. I will move it then.

> 
> This works but it is not neat either because it totally ignored the
> @address. So you have to build a solid commit msg to explain readers why
> this is needed ;-)

Sure. I will try to do a solid one explaining why we don’t need @address for 
this path😊.

Thanks,
Shameer

Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback

Posted by Nicolin Chen 3 months ago

On Wed, Nov 05, 2025 at 02:58:16PM -0400, Jason Gunthorpe wrote:
> On Wed, Nov 05, 2025 at 10:33:08AM -0800, Nicolin Chen wrote:
> > On Wed, Nov 05, 2025 at 02:10:49PM -0400, Jason Gunthorpe wrote:
> > > On Wed, Nov 05, 2025 at 06:25:05PM +0100, Eric Auger wrote:
> > > > if the guest doorbell address is wrong because not properly translated,
> > > > vgic_msi_to_its() will fail to identify the ITS to inject the MSI in.
> > > > See kernel kvm/vgic/vgic-its.c vgic_msi_to_its and
> > > > vgic_its_inject_msi
> > > 
> > > Which has been exactly my point to Nicolin. There is no way to
> > > "properly translate" the vMSI address in a HW accelerated SMMU
> > > emulation.
> > 
> > Hmm, I still can't connect the dots here. QEMU knows where the
> > guest CD table is to get the stage-1 translation table to walk
> > through. We could choose to not let it walk through. Yet, why?
> 
> You cannot walk any tables in guest memory without fully trapping all
> invalidation on all command queues. Like real HW qemu needs to fence
> its walks with any concurrent invalidate & sync to ensure it doesn't
> walk into a UAF situation.
> 
> Since we can't trap or mediate vCMDQ the walking simply cannot be
> done.
> 
> Thus, the general principle of the HW accelerated vSMMU is that it
> NEVER walks any of these guest tables for any reason.
>
> Thus, we cannot do anything with vMSI address beyond program it
> directly into a real PCI device so it undergoes real HW translation.

It's clear to me now. Thanks for the elaboration!

Nicolin

Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback

Posted by Eric Auger 3 months, 1 week ago


On 11/4/25 4:14 PM, Shameer Kolothum wrote:
>
>> -----Original Message-----
>> From: Eric Auger <eric.auger@redhat.com>
>> Sent: 04 November 2025 14:44
>> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
>> arm@nongnu.org; qemu-devel@nongnu.org
>> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
>> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
>> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
>> smostafa@google.com; wangzhou1@hisilicon.com;
>> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
>> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
>> Krishnakant Jaju <kjaju@nvidia.com>
>> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional
>> get_msi_address_space() callback
>>
>> External email: Use caution opening links or attachments
>>
>>
>> On 11/4/25 3:37 PM, Shameer Kolothum wrote:
>>> Hi Eric,
>>>
>>>> -----Original Message-----
>>>> From: Eric Auger <eric.auger@redhat.com>
>>>> Sent: 04 November 2025 14:12
>>>> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
>>>> arm@nongnu.org; qemu-devel@nongnu.org
>>>> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
>>>> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
>>>> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
>>>> smostafa@google.com; wangzhou1@hisilicon.com;
>>>> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
>>>> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
>>>> Krishnakant Jaju <kjaju@nvidia.com>
>>>> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional
>>>> get_msi_address_space() callback
>>>>
>>>> External email: Use caution opening links or attachments
>>>>
>>>>
>>>> Hi Shameer, Nicolin,
>>>>
>>>> On 10/31/25 11:49 AM, Shameer Kolothum wrote:
>>>>> On ARM, devices behind an IOMMU have their MSI doorbell addresses
>>>>> translated by the IOMMU. In nested mode, this translation happens in
>>>>> two stages (gIOVA → gPA → ITS page).
>>>>>
>>>>> In accelerated SMMUv3 mode, both stages are handled by hardware, so
>>>>> get_address_space() returns the system address space so that VFIO
>>>>> can setup stage-2 mappings for system address space.
>>>> Sorry but I still don't catch the above. Can you explain (most probably
>>>> again) why this is a requirement to return the system as so that VFIO
>>>> can setup stage-2 mappings for system address space. I am sorry for
>>>> insisting (at the risk of being stubborn or dumb) but I fail to
>>>> understand the requirement. As far as I remember the way I integrated it
>>>> at the old times did not require that change:
>>>> https://lore.kernel.org/all/20210411120912.15770-1-
>>>> eric.auger@redhat.com/
>>>> I used a vfio_prereg_listener to force the S2 mapping.
>>> Yes I remember that.
>>>
>>>> What has changed that forces us now to have this gym
>>> This approach achieves the same outcome, but through a
>>> different mechanism. Returning the system address space
>>> here ensures that VFIO sets up the Stage-2 mappings for
>>> devices behind the accelerated SMMUv3.
>>>
>>> I think, this makes sense because, in the accelerated case, the
>>> device is no longer managed by QEMU’s SMMUv3 model. The
>> On the other hand, as we discussed on v4 by returning system as you
>> pretend there is no translation in place which is not true. Now we use
>> an alias for it but it has not really removed its usage. Also it forces
>> use to hack around the MSI mapping and introduce new PCIIOMMUOps.
>> Have
>> you assessed the feasability of using vfio_prereg_listener to force the
>> S2 mapping. Is it simply not relevant anymore or could it be used also
>> with the iommufd be integration? Eric
> IIUC, the prereg_listener mechanism just enables us to setup the s2
> mappings. For MSI, In your version, I see that smmu_find_add_as()
> always returns IOMMU as. How is that supposed to work if the Guest
> has s1 bypass mode STE for the device?

I need to delve into it again as I forgot the details. Will come back to
you ...

Eric
>
> Thanks,
> Shameer
>
>

RE: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback

Posted by Shameer Kolothum 3 months, 1 week ago


> -----Original Message-----
> From: Eric Auger <eric.auger@redhat.com>
> Sent: 04 November 2025 16:02
> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
> arm@nongnu.org; qemu-devel@nongnu.org
> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> Krishnakant Jaju <kjaju@nvidia.com>
> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional
> get_msi_address_space() callback
> 
> External email: Use caution opening links or attachments
> 
> 
> On 11/4/25 4:14 PM, Shameer Kolothum wrote:
> >
> >> -----Original Message-----
> >> From: Eric Auger <eric.auger@redhat.com>
> >> Sent: 04 November 2025 14:44
> >> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
> arm@nongnu.org;
> >> qemu-devel@nongnu.org
> >> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>;
> >> Nicolin Chen <nicolinc@nvidia.com>; ddutile@redhat.com;
> >> berrange@redhat.com; Nathan Chen <nathanc@nvidia.com>; Matt Ochs
> >> <mochs@nvidia.com>; smostafa@google.com; wangzhou1@hisilicon.com;
> >> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> >> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com;
> >> yi.l.liu@intel.com; Krishnakant Jaju <kjaju@nvidia.com>
> >> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional
> >> get_msi_address_space() callback
> >>
> >> External email: Use caution opening links or attachments
> >>
> >>
> >> On 11/4/25 3:37 PM, Shameer Kolothum wrote:
> >>> Hi Eric,
> >>>
> >>>> -----Original Message-----
> >>>> From: Eric Auger <eric.auger@redhat.com>
> >>>> Sent: 04 November 2025 14:12
> >>>> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
> >>>> arm@nongnu.org; qemu-devel@nongnu.org
> >>>> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>;
> >>>> Nicolin Chen <nicolinc@nvidia.com>; ddutile@redhat.com;
> >>>> berrange@redhat.com; Nathan Chen <nathanc@nvidia.com>; Matt
> Ochs
> >>>> <mochs@nvidia.com>; smostafa@google.com;
> wangzhou1@hisilicon.com;
> >>>> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> >>>> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com;
> >>>> yi.l.liu@intel.com; Krishnakant Jaju <kjaju@nvidia.com>
> >>>> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional
> >>>> get_msi_address_space() callback
> >>>>
> >>>> External email: Use caution opening links or attachments
> >>>>
> >>>>
> >>>> Hi Shameer, Nicolin,
> >>>>
> >>>> On 10/31/25 11:49 AM, Shameer Kolothum wrote:
> >>>>> On ARM, devices behind an IOMMU have their MSI doorbell addresses
> >>>>> translated by the IOMMU. In nested mode, this translation happens
> >>>>> in two stages (gIOVA → gPA → ITS page).
> >>>>>
> >>>>> In accelerated SMMUv3 mode, both stages are handled by hardware,
> >>>>> so
> >>>>> get_address_space() returns the system address space so that VFIO
> >>>>> can setup stage-2 mappings for system address space.
> >>>> Sorry but I still don't catch the above. Can you explain (most
> >>>> probably
> >>>> again) why this is a requirement to return the system as so that
> >>>> VFIO can setup stage-2 mappings for system address space. I am
> >>>> sorry for insisting (at the risk of being stubborn or dumb) but I
> >>>> fail to understand the requirement. As far as I remember the way I
> >>>> integrated it at the old times did not require that change:
> >>>> https://lore.kernel.org/all/20210411120912.15770-1-
> >>>> eric.auger@redhat.com/
> >>>> I used a vfio_prereg_listener to force the S2 mapping.
> >>> Yes I remember that.
> >>>
> >>>> What has changed that forces us now to have this gym
> >>> This approach achieves the same outcome, but through a different
> >>> mechanism. Returning the system address space here ensures that VFIO
> >>> sets up the Stage-2 mappings for devices behind the accelerated
> >>> SMMUv3.
> >>>
> >>> I think, this makes sense because, in the accelerated case, the
> >>> device is no longer managed by QEMU’s SMMUv3 model. The
> >> On the other hand, as we discussed on v4 by returning system as you
> >> pretend there is no translation in place which is not true. Now we
> >> use an alias for it but it has not really removed its usage. Also it
> >> forces use to hack around the MSI mapping and introduce new
> PCIIOMMUOps.
> >> Have
> >> you assessed the feasability of using vfio_prereg_listener to force
> >> the
> >> S2 mapping. Is it simply not relevant anymore or could it be used
> >> also with the iommufd be integration? Eric
> > IIUC, the prereg_listener mechanism just enables us to setup the s2
> > mappings. For MSI, In your version, I see that smmu_find_add_as()
> > always returns IOMMU as. How is that supposed to work if the Guest has
> > s1 bypass mode STE for the device?
> 
> I need to delve into it again as I forgot the details. Will come back to you ...

I think the BYPASS case will work anyway as in smmuv3_translate() fn 
we are checking the ste config (SMMU_TRANS_BYPASS) and it will just
return the same address back.

So we can do the same here in get_msi_address_space() and return
IOMMU as always. And that completely avoids &address_space_memory
from SMMUv3-accel if that’s the concern.

Thanks,
Shameer

Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback

Posted by Nicolin Chen 3 months, 1 week ago

On Tue, Nov 04, 2025 at 05:01:57PM +0100, Eric Auger wrote:
> >>>> On 10/31/25 11:49 AM, Shameer Kolothum wrote:
> >>>>> On ARM, devices behind an IOMMU have their MSI doorbell addresses
> >>>>> translated by the IOMMU. In nested mode, this translation happens in
> >>>>> two stages (gIOVA → gPA → ITS page).
> >>>>>
> >>>>> In accelerated SMMUv3 mode, both stages are handled by hardware, so
> >>>>> get_address_space() returns the system address space so that VFIO
> >>>>> can setup stage-2 mappings for system address space.
> >>>> Sorry but I still don't catch the above. Can you explain (most probably
> >>>> again) why this is a requirement to return the system as so that VFIO
> >>>> can setup stage-2 mappings for system address space. I am sorry for
> >>>> insisting (at the risk of being stubborn or dumb) but I fail to
> >>>> understand the requirement. As far as I remember the way I integrated it
> >>>> at the old times did not require that change:
> >>>> https://lore.kernel.org/all/20210411120912.15770-1-
> >>>> eric.auger@redhat.com/
> >>>> I used a vfio_prereg_listener to force the S2 mapping.
> >>> Yes I remember that.
> >>>
> >>>> What has changed that forces us now to have this gym
> >>> This approach achieves the same outcome, but through a
> >>> different mechanism. Returning the system address space
> >>> here ensures that VFIO sets up the Stage-2 mappings for
> >>> devices behind the accelerated SMMUv3.
> >>>
> >>> I think, this makes sense because, in the accelerated case, the
> >>> device is no longer managed by QEMU’s SMMUv3 model. The
> >> On the other hand, as we discussed on v4 by returning system as you
> >> pretend there is no translation in place which is not true. Now we use
> >> an alias for it but it has not really removed its usage. Also it forces
> >> use to hack around the MSI mapping and introduce new PCIIOMMUOps.
> >> Have
> >> you assessed the feasability of using vfio_prereg_listener to force the
> >> S2 mapping. Is it simply not relevant anymore or could it be used also
> >> with the iommufd be integration? Eric
> > IIUC, the prereg_listener mechanism just enables us to setup the s2
> > mappings. For MSI, In your version, I see that smmu_find_add_as()
> > always returns IOMMU as. How is that supposed to work if the Guest
> > has s1 bypass mode STE for the device?
> 
> I need to delve into it again as I forgot the details. Will come back to
> you ...

We aligned with Intel previously about this system address space.
You might know these very well, yet here are the breakdowns:

1. VFIO core has a container that manages an HWPT. By default, it
   allocates a stage-1 normal HWPT, unless vIOMMU requests for a
   nesting parent HWPT for accelerated cases.
2. VFIO core adds a listener for that HWPT and sets up a handler
   vfio_container_region_add() where it checks the memory region
   whether it is iommu or not.
   a. In case of !IOMMU as (i.e. system address space), it treats
      the address space as a RAM region, and handles all stage-2
      mappings for the core allocated nesting parent HWPT.
   b. In case of IOMMU as (i.e. a translation type) it sets up
      the IOTLB notifier and translation replay while bypassing
      the listener for RAM region.

In an accelerated case, we need stage-2 mappings to match with the
nesting parent HWPT. So, returning system address space or an alias
of that notifies the vfio core to take the 2.a path.

If we take 2.b path by returning IOMMU as in smmu_find_add_as, the
VFIO core would no longer listen to the RAM region for us, i.e. no
stage-2 HWPT nor mappings. vIOMMU would have to allocate a nesting
parent and manage the stage-2 mappings by adding a listener in its
own code, which is largely duplicated with the core code.

-------------- so far this works for Intel and ARM--------------

3. On ARM, vPCI device is programmed with gIOVA, so KVM has to
   follow what the vPCI is told to inject vIRQs. This requires
   a translation at the nested stage-1 address space. Note that
   vSMMU in this case doesn't manage translation as it doesn't
   need to. But there is no other sane way for KVM to know the
   vITS page corresponding to the given gIOVA. So, we invented
   the get_msi_address_space op.

(3) makes sense because there is a complication in the MSI that
does a 2-stage translation on ARM and KVM must follow the stage-1
input address, leaving us no choice to have two address spaces.

Thanks
Nicolin

Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback

Posted by Eric Auger 3 months ago

Hi Nicolin,

On 11/4/25 6:47 PM, Nicolin Chen wrote:
> On Tue, Nov 04, 2025 at 05:01:57PM +0100, Eric Auger wrote:
>>>>>> On 10/31/25 11:49 AM, Shameer Kolothum wrote:
>>>>>>> On ARM, devices behind an IOMMU have their MSI doorbell addresses
>>>>>>> translated by the IOMMU. In nested mode, this translation happens in
>>>>>>> two stages (gIOVA → gPA → ITS page).
>>>>>>>
>>>>>>> In accelerated SMMUv3 mode, both stages are handled by hardware, so
>>>>>>> get_address_space() returns the system address space so that VFIO
>>>>>>> can setup stage-2 mappings for system address space.
>>>>>> Sorry but I still don't catch the above. Can you explain (most probably
>>>>>> again) why this is a requirement to return the system as so that VFIO
>>>>>> can setup stage-2 mappings for system address space. I am sorry for
>>>>>> insisting (at the risk of being stubborn or dumb) but I fail to
>>>>>> understand the requirement. As far as I remember the way I integrated it
>>>>>> at the old times did not require that change:
>>>>>> https://lore.kernel.org/all/20210411120912.15770-1-
>>>>>> eric.auger@redhat.com/
>>>>>> I used a vfio_prereg_listener to force the S2 mapping.
>>>>> Yes I remember that.
>>>>>
>>>>>> What has changed that forces us now to have this gym
>>>>> This approach achieves the same outcome, but through a
>>>>> different mechanism. Returning the system address space
>>>>> here ensures that VFIO sets up the Stage-2 mappings for
>>>>> devices behind the accelerated SMMUv3.
>>>>>
>>>>> I think, this makes sense because, in the accelerated case, the
>>>>> device is no longer managed by QEMU’s SMMUv3 model. The
>>>> On the other hand, as we discussed on v4 by returning system as you
>>>> pretend there is no translation in place which is not true. Now we use
>>>> an alias for it but it has not really removed its usage. Also it forces
>>>> use to hack around the MSI mapping and introduce new PCIIOMMUOps.
>>>> Have
>>>> you assessed the feasability of using vfio_prereg_listener to force the
>>>> S2 mapping. Is it simply not relevant anymore or could it be used also
>>>> with the iommufd be integration? Eric
>>> IIUC, the prereg_listener mechanism just enables us to setup the s2
>>> mappings. For MSI, In your version, I see that smmu_find_add_as()
>>> always returns IOMMU as. How is that supposed to work if the Guest
>>> has s1 bypass mode STE for the device?
>> I need to delve into it again as I forgot the details. Will come back to
>> you ...
> We aligned with Intel previously about this system address space.
> You might know these very well, yet here are the breakdowns:
>
> 1. VFIO core has a container that manages an HWPT. By default, it
>    allocates a stage-1 normal HWPT, unless vIOMMU requests for a
You may precise this stage-1 normal HWPT is used to map GPA to HPA (so
eventually implements stage 2).
>    nesting parent HWPT for accelerated cases.
> 2. VFIO core adds a listener for that HWPT and sets up a handler
>    vfio_container_region_add() where it checks the memory region
>    whether it is iommu or not.
>    a. In case of !IOMMU as (i.e. system address space), it treats
>       the address space as a RAM region, and handles all stage-2
>       mappings for the core allocated nesting parent HWPT.
>    b. In case of IOMMU as (i.e. a translation type) it sets up
>       the IOTLB notifier and translation replay while bypassing
>       the listener for RAM region.
yes S1+S2 are combined through vfio_iommu_map_notify()
>
> In an accelerated case, we need stage-2 mappings to match with the
> nesting parent HWPT. So, returning system address space or an alias
> of that notifies the vfio core to take the 2.a path.
>
> If we take 2.b path by returning IOMMU as in smmu_find_add_as, the
> VFIO core would no longer listen to the RAM region for us, i.e. no
> stage-2 HWPT nor mappings. vIOMMU would have to allocate a nesting
except if you change the VFIO common.c as I did the past to force the S2
mapping in the nested config.
See
https://lore.kernel.org/all/20210411120912.15770-16-eric.auger@redhat.com/
and vfio_prereg_listener()
Again I do not say this is the right way to do but using system address
space is not the "only" implementation choice I think and it needs to be
properly justified, especially has it has at least 2 side effects:
- somehow abusing the semantic of returned address space and pretends
there is no IOMMU translation in place and
- also impacting the way MSIs are handled (introduction of a new
PCIIOMMUOps).
This kind of explanation you wrote is absolutely needed in the commit
msg for reviewers to understand the design choice I think.

Eric
> parent and manage the stage-2 mappings by adding a listener in its
> own code, which is largely duplicated with the core code.
>
> -------------- so far this works for Intel and ARM--------------
>
> 3. On ARM, vPCI device is programmed with gIOVA, so KVM has to
>    follow what the vPCI is told to inject vIRQs. This requires
>    a translation at the nested stage-1 address space. Note that
>    vSMMU in this case doesn't manage translation as it doesn't
>    need to. But there is no other sane way for KVM to know the
>    vITS page corresponding to the given gIOVA. So, we invented
>    the get_msi_address_space op.
>
> (3) makes sense because there is a complication in the MSI that
> does a 2-stage translation on ARM and KVM must follow the stage-1
> input address, leaving us no choice to have two address spaces.
>
> Thanks
> Nicolin
>

Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback

Posted by Nicolin Chen 3 months ago

Hi Eric,

On Wed, Nov 05, 2025 at 08:47:56AM +0100, Eric Auger wrote:
> > We aligned with Intel previously about this system address space.
> > You might know these very well, yet here are the breakdowns:
> >
> > 1. VFIO core has a container that manages an HWPT. By default, it
> >    allocates a stage-1 normal HWPT, unless vIOMMU requests for a

> You may precise this stage-1 normal HWPT is used to map GPA to HPA (so
> eventually implements stage 2).

Functional-wise, that would work. But not as clean as we create
an S2 parent hwpt from the beginning, right?

> >    nesting parent HWPT for accelerated cases.
> > 2. VFIO core adds a listener for that HWPT and sets up a handler
> >    vfio_container_region_add() where it checks the memory region
> >    whether it is iommu or not.
> >    a. In case of !IOMMU as (i.e. system address space), it treats
> >       the address space as a RAM region, and handles all stage-2
> >       mappings for the core allocated nesting parent HWPT.
> >    b. In case of IOMMU as (i.e. a translation type) it sets up
> >       the IOTLB notifier and translation replay while bypassing
> >       the listener for RAM region.

> yes S1+S2 are combined through vfio_iommu_map_notify()

But that map/unmap notifier is useless in the accelerated mode:
we don't need those translation code in the emulated mode (MSI
is likely to bypass translation as well); and we don't need the
emulated IOTLB either since no page table walk through.

Also, S1 and S2 are separated following iommufd design. In this
regard, letting the core manage the S2 hwpt and mappings while
vIOMMU handling the S1 hwpt allocation/attach/invalidation can
look much cleaner.

> > In an accelerated case, we need stage-2 mappings to match with the
> > nesting parent HWPT. So, returning system address space or an alias
> > of that notifies the vfio core to take the 2.a path.
> >
> > If we take 2.b path by returning IOMMU as in smmu_find_add_as, the
> > VFIO core would no longer listen to the RAM region for us, i.e. no
> > stage-2 HWPT nor mappings. vIOMMU would have to allocate a nesting

> except if you change the VFIO common.c as I did the past to force the S2
> mapping in the nested config.
> See
> https://lore.kernel.org/all/20210411120912.15770-16-eric.auger@redhat.com/
> and vfio_prereg_listener()

Yea, I remember that. But that's somewhat duplicated IMHO. The
VFIO core already registers a listener on guest RAM for system
address space. Having another set of vfio_prereg_listener does
not feel optimal.

> Again I do not say this is the right way to do but using system address
> space is not the "only" implementation choice I think

Oh, neither do I mean that's the "only" way. Sorry I did not
make this clear.

I had studied your vfio_prereg_listener approach and studied
Intel's approach using the system address space, and concluded
this "cleaner" way that works for both architectures.

> and it needs to be
> properly justified, especially has it has at least 2 side effects:
> - somehow abusing the semantic of returned address space and pretends
> there is no IOMMU translation in place and

Perhaps we shall say "there is no emulated translation" :)

> - also impacting the way MSIs are handled (introduction of a new
> PCIIOMMUOps).

That is a solid point. Yet I think it's less confusing now per
Jason's remarks -- we will bypass the translation pathway for
MSI in accelerated mode.

> This kind of explanation you wrote is absolutely needed in the commit
> msg for reviewers to understand the design choice I think.

Sure. My bad that I didn't explain it well in the first place.

Thanks
Nicolin

Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback

Posted by Jason Gunthorpe 3 months, 1 week ago

On Tue, Nov 04, 2025 at 03:11:55PM +0100, Eric Auger wrote:
> > However, QEMU/KVM also calls this callback when resolving
> > MSI doorbells:
> >
> >   kvm_irqchip_add_msi_route()
> >     kvm_arch_fixup_msi_route()
> >       pci_device_iommu_address_space()
> >         get_address_space()
> >
> > VFIO device in the guest with a SMMUv3 is programmed with a gIOVA for
> > MSI doorbell. This gIOVA can't be used to setup the MSI doorbell
> > directly. This needs to be translated to vITS gPA. In order to do the
> > doorbell transalation it needs IOMMU address space.

Why does qemu do anything with the msi address? It is opaque and qemu
cannot determine anything meaningful from it. I expect it to ignore it?

Jason

RE: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback

Posted by Shameer Kolothum 3 months, 1 week ago


> -----Original Message-----
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: 04 November 2025 14:21
> To: Eric Auger <eric.auger@redhat.com>
> Cc: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
> arm@nongnu.org; qemu-devel@nongnu.org; peter.maydell@linaro.org;
> Nicolin Chen <nicolinc@nvidia.com>; ddutile@redhat.com;
> berrange@redhat.com; Nathan Chen <nathanc@nvidia.com>; Matt Ochs
> <mochs@nvidia.com>; smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> Krishnakant Jaju <kjaju@nvidia.com>
> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional
> get_msi_address_space() callback
> 
> On Tue, Nov 04, 2025 at 03:11:55PM +0100, Eric Auger wrote:
> > > However, QEMU/KVM also calls this callback when resolving
> > > MSI doorbells:
> > >
> > >   kvm_irqchip_add_msi_route()
> > >     kvm_arch_fixup_msi_route()
> > >       pci_device_iommu_address_space()
> > >         get_address_space()
> > >
> > > VFIO device in the guest with a SMMUv3 is programmed with a gIOVA for
> > > MSI doorbell. This gIOVA can't be used to setup the MSI doorbell
> > > directly. This needs to be translated to vITS gPA. In order to do the
> > > doorbell transalation it needs IOMMU address space.
> 
> Why does qemu do anything with the msi address? It is opaque and qemu
> cannot determine anything meaningful from it. I expect it to ignore it?

I am afraid not. Guest MSI table write gets trapped and it then configures the 
doorbell( this is where this patch comes handy) and sets up the KVM 
routing etc.

Thanks,
Shameer

Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback

Posted by Jason Gunthorpe 3 months, 1 week ago

On Tue, Nov 04, 2025 at 02:42:57PM +0000, Shameer Kolothum wrote:
> > On Tue, Nov 04, 2025 at 03:11:55PM +0100, Eric Auger wrote:
> > > > However, QEMU/KVM also calls this callback when resolving
> > > > MSI doorbells:
> > > >
> > > >   kvm_irqchip_add_msi_route()
> > > >     kvm_arch_fixup_msi_route()
> > > >       pci_device_iommu_address_space()
> > > >         get_address_space()
> > > >
> > > > VFIO device in the guest with a SMMUv3 is programmed with a gIOVA for
> > > > MSI doorbell. This gIOVA can't be used to setup the MSI doorbell
> > > > directly. This needs to be translated to vITS gPA. In order to do the
> > > > doorbell transalation it needs IOMMU address space.
> > 
> > Why does qemu do anything with the msi address? It is opaque and qemu
> > cannot determine anything meaningful from it. I expect it to ignore it?
> 
> I am afraid not. Guest MSI table write gets trapped and it then configures the 
> doorbell( this is where this patch comes handy) and sets up the KVM 
> routing etc.

Sure it is trapped, but nothing should be looking at the MSI address
from the guest, it is meaningless and wrong information. Just ignore
it.

Jason

RE: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback

Posted by Shameer Kolothum 3 months, 1 week ago


> -----Original Message-----
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: 04 November 2025 14:52
> To: Shameer Kolothum <skolothumtho@nvidia.com>
> Cc: Eric Auger <eric.auger@redhat.com>; qemu-arm@nongnu.org; qemu-
> devel@nongnu.org; peter.maydell@linaro.org; Nicolin Chen
> <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com; Nathan
> Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> Krishnakant Jaju <kjaju@nvidia.com>
> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional
> get_msi_address_space() callback
> 
> On Tue, Nov 04, 2025 at 02:42:57PM +0000, Shameer Kolothum wrote:
> > > On Tue, Nov 04, 2025 at 03:11:55PM +0100, Eric Auger wrote:
> > > > > However, QEMU/KVM also calls this callback when resolving
> > > > > MSI doorbells:
> > > > >
> > > > >   kvm_irqchip_add_msi_route()
> > > > >     kvm_arch_fixup_msi_route()
> > > > >       pci_device_iommu_address_space()
> > > > >         get_address_space()
> > > > >
> > > > > VFIO device in the guest with a SMMUv3 is programmed with a gIOVA
> for
> > > > > MSI doorbell. This gIOVA can't be used to setup the MSI doorbell
> > > > > directly. This needs to be translated to vITS gPA. In order to do the
> > > > > doorbell transalation it needs IOMMU address space.
> > >
> > > Why does qemu do anything with the msi address? It is opaque and qemu
> > > cannot determine anything meaningful from it. I expect it to ignore it?
> >
> > I am afraid not. Guest MSI table write gets trapped and it then configures the
> > doorbell( this is where this patch comes handy) and sets up the KVM
> > routing etc.
> 
> Sure it is trapped, but nothing should be looking at the MSI address
> from the guest, it is meaningless and wrong information. Just ignore
> it.

Hmm.. we need to setup the doorbell address correctly. If we don't do
the translation here, it will use the Guest IOVA address. Remember,
we are using the IORT RMR identity mapping to get MSI working.

See this discussion here,

https://lore.kernel.org/qemu-devel/CH3PR12MB754810AE8D308630041F9AFEABF2A@CH3PR12MB7548.namprd12.prod.outlook.com/

Thanks,
Shameer

Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback

Posted by Jason Gunthorpe 3 months, 1 week ago

On Tue, Nov 04, 2025 at 02:58:44PM +0000, Shameer Kolothum wrote:
> > Sure it is trapped, but nothing should be looking at the MSI address
> > from the guest, it is meaningless and wrong information. Just ignore
> > it.
> 
> Hmm.. we need to setup the doorbell address correctly. 

> If we don't do the translation here, it will use the Guest IOVA
> address. Remember, we are using the IORT RMR identity mapping to get
> MSI working.

Either you use the RMR value, which is forced by the kernel into the
physical MSI through iommufd and kernel ignores anything qemu
does. So fully ignore the guest's vMSI address.

Eventually qemu should transfer the unchanged guest vMSI address
directly to the kernel, but we haven't figured that out yet.

Jason

RE: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback

Posted by Shameer Kolothum 3 months, 1 week ago


> -----Original Message-----
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: 04 November 2025 15:13
> To: Shameer Kolothum <skolothumtho@nvidia.com>
> Cc: Eric Auger <eric.auger@redhat.com>; qemu-arm@nongnu.org; qemu-
> devel@nongnu.org; peter.maydell@linaro.org; Nicolin Chen
> <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com; Nathan
> Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> Krishnakant Jaju <kjaju@nvidia.com>
> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional
> get_msi_address_space() callback
> 
> On Tue, Nov 04, 2025 at 02:58:44PM +0000, Shameer Kolothum wrote:
> > > Sure it is trapped, but nothing should be looking at the MSI address
> > > from the guest, it is meaningless and wrong information. Just ignore
> > > it.
> >
> > Hmm.. we need to setup the doorbell address correctly.
> 
> > If we don't do the translation here, it will use the Guest IOVA
> > address. Remember, we are using the IORT RMR identity mapping to get
> > MSI working.
> 
> Either you use the RMR value, which is forced by the kernel into the
> physical MSI through iommufd and kernel ignores anything qemu
> does. So fully ignore the guest's vMSI address.

Well, we are sort of trying to do the same through this patch here. 
But to avoid a "translation" completely it will involve some changes to
Qemu pci subsystem. I think this is the least intrusive path I can think
of now. And this is a one time setup mostly.

Thanks,
Shameer

Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback

Posted by Jason Gunthorpe 3 months, 1 week ago

On Tue, Nov 04, 2025 at 03:20:59PM +0000, Shameer Kolothum wrote:
> > On Tue, Nov 04, 2025 at 02:58:44PM +0000, Shameer Kolothum wrote:
> > > > Sure it is trapped, but nothing should be looking at the MSI address
> > > > from the guest, it is meaningless and wrong information. Just ignore
> > > > it.
> > >
> > > Hmm.. we need to setup the doorbell address correctly.
> > 
> > > If we don't do the translation here, it will use the Guest IOVA
> > > address. Remember, we are using the IORT RMR identity mapping to get
> > > MSI working.
> > 
> > Either you use the RMR value, which is forced by the kernel into the
> > physical MSI through iommufd and kernel ignores anything qemu
> > does. So fully ignore the guest's vMSI address.
> 
> Well, we are sort of trying to do the same through this patch here. 
> But to avoid a "translation" completely it will involve some changes to
> Qemu pci subsystem. I think this is the least intrusive path I can think
> of now. And this is a one time setup mostly.

Should be explained in the commit message that the translation is
pointless. I'm not sure about this, any translation seems risky
because it could fail. The guest can use any IOVA for MSI and none may
fail.

Jason

Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback

Posted by Nicolin Chen 3 months, 1 week ago

On Tue, Nov 04, 2025 at 11:35:35AM -0400, Jason Gunthorpe wrote:
> On Tue, Nov 04, 2025 at 03:20:59PM +0000, Shameer Kolothum wrote:
> > > On Tue, Nov 04, 2025 at 02:58:44PM +0000, Shameer Kolothum wrote:
> > > > > Sure it is trapped, but nothing should be looking at the MSI address
> > > > > from the guest, it is meaningless and wrong information. Just ignore
> > > > > it.
> > > >
> > > > Hmm.. we need to setup the doorbell address correctly.
> > > 
> > > > If we don't do the translation here, it will use the Guest IOVA
> > > > address. Remember, we are using the IORT RMR identity mapping to get
> > > > MSI working.
> > > 
> > > Either you use the RMR value, which is forced by the kernel into the
> > > physical MSI through iommufd and kernel ignores anything qemu
> > > does. So fully ignore the guest's vMSI address.
> > 
> > Well, we are sort of trying to do the same through this patch here. 
> > But to avoid a "translation" completely it will involve some changes to
> > Qemu pci subsystem. I think this is the least intrusive path I can think
> > of now. And this is a one time setup mostly.
> 
> Should be explained in the commit message that the translation is
> pointless. I'm not sure about this, any translation seems risky
> because it could fail. The guest can use any IOVA for MSI and none may
> fail.

In the current design of KVM in QEMU, it does a generic translation
from gIOVA->gPA for the doorbell location to inject IRQ, whether VM
has an accelerated IOMMU or an emulated IOMMU.

In the accelerated case, this translation is pointless for the SMMU
HW underlying. But the IRQ injection routine still stands.

We could have invented something like get_msi_physical_address, but
the vPCI device is programmed with gIOVA for MSI. So it makes sense
for VMM to follow that gIOVA? Even if the gIOVA is a wrong address,
I think VMM shouldn't correct that, since a real HW wouldn't.

Thanks
Nicolin

Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback

Posted by Jason Gunthorpe 3 months, 1 week ago

On Tue, Nov 04, 2025 at 09:11:55AM -0800, Nicolin Chen wrote:
> On Tue, Nov 04, 2025 at 11:35:35AM -0400, Jason Gunthorpe wrote:
> > On Tue, Nov 04, 2025 at 03:20:59PM +0000, Shameer Kolothum wrote:
> > > > On Tue, Nov 04, 2025 at 02:58:44PM +0000, Shameer Kolothum wrote:
> > > > > > Sure it is trapped, but nothing should be looking at the MSI address
> > > > > > from the guest, it is meaningless and wrong information. Just ignore
> > > > > > it.
> > > > >
> > > > > Hmm.. we need to setup the doorbell address correctly.
> > > > 
> > > > > If we don't do the translation here, it will use the Guest IOVA
> > > > > address. Remember, we are using the IORT RMR identity mapping to get
> > > > > MSI working.
> > > > 
> > > > Either you use the RMR value, which is forced by the kernel into the
> > > > physical MSI through iommufd and kernel ignores anything qemu
> > > > does. So fully ignore the guest's vMSI address.
> > > 
> > > Well, we are sort of trying to do the same through this patch here. 
> > > But to avoid a "translation" completely it will involve some changes to
> > > Qemu pci subsystem. I think this is the least intrusive path I can think
> > > of now. And this is a one time setup mostly.
> > 
> > Should be explained in the commit message that the translation is
> > pointless. I'm not sure about this, any translation seems risky
> > because it could fail. The guest can use any IOVA for MSI and none may
> > fail.
> 
> In the current design of KVM in QEMU, it does a generic translation
> from gIOVA->gPA for the doorbell location to inject IRQ, whether VM
> has an accelerated IOMMU or an emulated IOMMU.

And what happens if the translation fails because there is no mapping?
It should be ignored for this case and not ignored for others.

Jason

Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback

Posted by Eric Auger 3 months ago


On 11/4/25 6:41 PM, Jason Gunthorpe wrote:
> On Tue, Nov 04, 2025 at 09:11:55AM -0800, Nicolin Chen wrote:
>> On Tue, Nov 04, 2025 at 11:35:35AM -0400, Jason Gunthorpe wrote:
>>> On Tue, Nov 04, 2025 at 03:20:59PM +0000, Shameer Kolothum wrote:
>>>>> On Tue, Nov 04, 2025 at 02:58:44PM +0000, Shameer Kolothum wrote:
>>>>>>> Sure it is trapped, but nothing should be looking at the MSI address
>>>>>>> from the guest, it is meaningless and wrong information. Just ignore
>>>>>>> it.
>>>>>> Hmm.. we need to setup the doorbell address correctly.
>>>>>> If we don't do the translation here, it will use the Guest IOVA
>>>>>> address. Remember, we are using the IORT RMR identity mapping to get
>>>>>> MSI working.
>>>>> Either you use the RMR value, which is forced by the kernel into the
>>>>> physical MSI through iommufd and kernel ignores anything qemu
>>>>> does. So fully ignore the guest's vMSI address.
>>>> Well, we are sort of trying to do the same through this patch here. 
>>>> But to avoid a "translation" completely it will involve some changes to
>>>> Qemu pci subsystem. I think this is the least intrusive path I can think
>>>> of now. And this is a one time setup mostly.
>>> Should be explained in the commit message that the translation is
>>> pointless. I'm not sure about this, any translation seems risky
>>> because it could fail. The guest can use any IOVA for MSI and none may
>>> fail.
in general the translation is not pointless (I mean when RMR are not
applied). In case a vhost device (virtio-net) for instance is protected
by SMMU, vhost triggers irqfds upon which a gsi is injected in vgic.
This latter does irq_routing mapping and this gsi is associated to an
MSI address/data. If the MSI address is wrong, ie. not corresponding to
the vITS gpa doorbell, kernel kvm/vgic/vgic-its.c vgic_its_trigger_msi
will fail to inject the MSI on guest since
vgic_msi_to_its/__vgic_doorbell_to_its will fail to find the ITS
instance to inject in.

Thanks

Eric
>> In the current design of KVM in QEMU, it does a generic translation
>> from gIOVA->gPA for the doorbell location to inject IRQ, whether VM
>> has an accelerated IOMMU or an emulated IOMMU.
> And what happens if the translation fails because there is no mapping?
> It should be ignored for this case and not ignored for others.
>
> Jason
>

Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback

Posted by Nicolin Chen 3 months, 1 week ago

On Tue, Nov 04, 2025 at 01:41:52PM -0400, Jason Gunthorpe wrote:
> On Tue, Nov 04, 2025 at 09:11:55AM -0800, Nicolin Chen wrote:
> > On Tue, Nov 04, 2025 at 11:35:35AM -0400, Jason Gunthorpe wrote:
> > > On Tue, Nov 04, 2025 at 03:20:59PM +0000, Shameer Kolothum wrote:
> > > > > On Tue, Nov 04, 2025 at 02:58:44PM +0000, Shameer Kolothum wrote:
> > > > > > > Sure it is trapped, but nothing should be looking at the MSI address
> > > > > > > from the guest, it is meaningless and wrong information. Just ignore
> > > > > > > it.
> > > > > >
> > > > > > Hmm.. we need to setup the doorbell address correctly.
> > > > > 
> > > > > > If we don't do the translation here, it will use the Guest IOVA
> > > > > > address. Remember, we are using the IORT RMR identity mapping to get
> > > > > > MSI working.
> > > > > 
> > > > > Either you use the RMR value, which is forced by the kernel into the
> > > > > physical MSI through iommufd and kernel ignores anything qemu
> > > > > does. So fully ignore the guest's vMSI address.
> > > > 
> > > > Well, we are sort of trying to do the same through this patch here. 
> > > > But to avoid a "translation" completely it will involve some changes to
> > > > Qemu pci subsystem. I think this is the least intrusive path I can think
> > > > of now. And this is a one time setup mostly.
> > > 
> > > Should be explained in the commit message that the translation is
> > > pointless. I'm not sure about this, any translation seems risky
> > > because it could fail. The guest can use any IOVA for MSI and none may
> > > fail.
> > 
> > In the current design of KVM in QEMU, it does a generic translation
> > from gIOVA->gPA for the doorbell location to inject IRQ, whether VM
> > has an accelerated IOMMU or an emulated IOMMU.
> 
> And what happens if the translation fails because there is no mapping?
> It should be ignored for this case and not ignored for others.

It errors out and does no injection. IOW, yea, "ignored".

Nicolin

Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback

Posted by Jason Gunthorpe 3 months, 1 week ago

On Tue, Nov 04, 2025 at 09:57:53AM -0800, Nicolin Chen wrote:
> On Tue, Nov 04, 2025 at 01:41:52PM -0400, Jason Gunthorpe wrote:
> > On Tue, Nov 04, 2025 at 09:11:55AM -0800, Nicolin Chen wrote:
> > > On Tue, Nov 04, 2025 at 11:35:35AM -0400, Jason Gunthorpe wrote:
> > > > On Tue, Nov 04, 2025 at 03:20:59PM +0000, Shameer Kolothum wrote:
> > > > > > On Tue, Nov 04, 2025 at 02:58:44PM +0000, Shameer Kolothum wrote:
> > > > > > > > Sure it is trapped, but nothing should be looking at the MSI address
> > > > > > > > from the guest, it is meaningless and wrong information. Just ignore
> > > > > > > > it.
> > > > > > >
> > > > > > > Hmm.. we need to setup the doorbell address correctly.
> > > > > > 
> > > > > > > If we don't do the translation here, it will use the Guest IOVA
> > > > > > > address. Remember, we are using the IORT RMR identity mapping to get
> > > > > > > MSI working.
> > > > > > 
> > > > > > Either you use the RMR value, which is forced by the kernel into the
> > > > > > physical MSI through iommufd and kernel ignores anything qemu
> > > > > > does. So fully ignore the guest's vMSI address.
> > > > > 
> > > > > Well, we are sort of trying to do the same through this patch here. 
> > > > > But to avoid a "translation" completely it will involve some changes to
> > > > > Qemu pci subsystem. I think this is the least intrusive path I can think
> > > > > of now. And this is a one time setup mostly.
> > > > 
> > > > Should be explained in the commit message that the translation is
> > > > pointless. I'm not sure about this, any translation seems risky
> > > > because it could fail. The guest can use any IOVA for MSI and none may
> > > > fail.
> > > 
> > > In the current design of KVM in QEMU, it does a generic translation
> > > from gIOVA->gPA for the doorbell location to inject IRQ, whether VM
> > > has an accelerated IOMMU or an emulated IOMMU.
> > 
> > And what happens if the translation fails because there is no mapping?
> > It should be ignored for this case and not ignored for others.
> 
> It errors out and does no injection. IOW, yea, "ignored".

"does no injection" does not sound like ignored to me..

Jason

Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback

Posted by Nicolin Chen 3 months, 1 week ago

On Tue, Nov 04, 2025 at 02:09:28PM -0400, Jason Gunthorpe wrote:
> On Tue, Nov 04, 2025 at 09:57:53AM -0800, Nicolin Chen wrote:
> > On Tue, Nov 04, 2025 at 01:41:52PM -0400, Jason Gunthorpe wrote:
> > > On Tue, Nov 04, 2025 at 09:11:55AM -0800, Nicolin Chen wrote:
> > > > On Tue, Nov 04, 2025 at 11:35:35AM -0400, Jason Gunthorpe wrote:
> > > > > On Tue, Nov 04, 2025 at 03:20:59PM +0000, Shameer Kolothum wrote:
> > > > > > > On Tue, Nov 04, 2025 at 02:58:44PM +0000, Shameer Kolothum wrote:
> > > > > > > > > Sure it is trapped, but nothing should be looking at the MSI address
> > > > > > > > > from the guest, it is meaningless and wrong information. Just ignore
> > > > > > > > > it.
> > > > > > > >
> > > > > > > > Hmm.. we need to setup the doorbell address correctly.
> > > > > > > 
> > > > > > > > If we don't do the translation here, it will use the Guest IOVA
> > > > > > > > address. Remember, we are using the IORT RMR identity mapping to get
> > > > > > > > MSI working.
> > > > > > > 
> > > > > > > Either you use the RMR value, which is forced by the kernel into the
> > > > > > > physical MSI through iommufd and kernel ignores anything qemu
> > > > > > > does. So fully ignore the guest's vMSI address.
> > > > > > 
> > > > > > Well, we are sort of trying to do the same through this patch here. 
> > > > > > But to avoid a "translation" completely it will involve some changes to
> > > > > > Qemu pci subsystem. I think this is the least intrusive path I can think
> > > > > > of now. And this is a one time setup mostly.
> > > > > 
> > > > > Should be explained in the commit message that the translation is
> > > > > pointless. I'm not sure about this, any translation seems risky
> > > > > because it could fail. The guest can use any IOVA for MSI and none may
> > > > > fail.
> > > > 
> > > > In the current design of KVM in QEMU, it does a generic translation
> > > > from gIOVA->gPA for the doorbell location to inject IRQ, whether VM
> > > > has an accelerated IOMMU or an emulated IOMMU.
> > > 
> > > And what happens if the translation fails because there is no mapping?
> > > It should be ignored for this case and not ignored for others.
> > 
> > It errors out and does no injection. IOW, yea, "ignored".
> 
> "does no injection" does not sound like ignored to me..

Sorry. I think I've missed your point.

The hardware path is programmed with a RMR-ed sw_msi in the host
via VFIO's PCI IRQ, ignoring the gIOVA and vITS in the guest VM,
even if the vPCI is programmed with a wrong gIOVA that could not
be translated.

KVM would always get the IRQ from HW, since the HW is programmed
correctly. But if gIOVA->vITS is not mapped, i.e. gIOVA is given
incorrectly, it can't inject the IRQ.

(Perhaps vSMMU in this case should F_TRANSLATION to the device.)

What was the meaning of "ignore" in your remarks?

Thanks
Nicolin

Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback

Posted by Jason Gunthorpe 3 months, 1 week ago

On Tue, Nov 04, 2025 at 10:44:27AM -0800, Nicolin Chen wrote:
> The hardware path is programmed with a RMR-ed sw_msi in the host
> via VFIO's PCI IRQ, ignoring the gIOVA and vITS in the guest VM,
> even if the vPCI is programmed with a wrong gIOVA that could not
> be translated.

Yes

> KVM would always get the IRQ from HW, since the HW is programmed
> correctly. But if gIOVA->vITS is not mapped, i.e. gIOVA is given
> incorrectly, it can't inject the IRQ.

But this is a software interrupt, and I think it should still just
ignore vMSI's address and assume it is mapped to a legal ITS
page. There is just no way to validate it.

Even SW MSI shouldn't fail because the vMSI has some weird IOVA in it
that isn't mapped in the S2. That's wrong and is something the guest
is permitted to do.

Jason

Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback

Posted by Nicolin Chen 3 months, 1 week ago

On Tue, Nov 04, 2025 at 02:56:51PM -0400, Jason Gunthorpe wrote:
> On Tue, Nov 04, 2025 at 10:44:27AM -0800, Nicolin Chen wrote:
> > KVM would always get the IRQ from HW, since the HW is programmed
> > correctly. But if gIOVA->vITS is not mapped, i.e. gIOVA is given
> > incorrectly, it can't inject the IRQ.
> 
> But this is a software interrupt, and I think it should still just
> ignore vMSI's address and assume it is mapped to a legal ITS
> page. There is just no way to validate it.
>
> Even SW MSI shouldn't fail because the vMSI has some weird IOVA in it
> that isn't mapped in the S2. That's wrong and is something the guest
> is permitted to do.

Hmm, that feels like a self-correction? But in a baremetal case,
if HW is programmed with a weird IOVA, interrupt would not work,
right?

Thanks
Nicolin

Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback

Posted by Jason Gunthorpe 3 months, 1 week ago

On Tue, Nov 04, 2025 at 11:31:50AM -0800, Nicolin Chen wrote:
> On Tue, Nov 04, 2025 at 02:56:51PM -0400, Jason Gunthorpe wrote:
> > On Tue, Nov 04, 2025 at 10:44:27AM -0800, Nicolin Chen wrote:
> > > KVM would always get the IRQ from HW, since the HW is programmed
> > > correctly. But if gIOVA->vITS is not mapped, i.e. gIOVA is given
> > > incorrectly, it can't inject the IRQ.
> > 
> > But this is a software interrupt, and I think it should still just
> > ignore vMSI's address and assume it is mapped to a legal ITS
> > page. There is just no way to validate it.
> >
> > Even SW MSI shouldn't fail because the vMSI has some weird IOVA in it
> > that isn't mapped in the S2. That's wrong and is something the guest
> > is permitted to do.
> 
> Hmm, that feels like a self-correction? But in a baremetal case,
> if HW is programmed with a weird IOVA, interrupt would not work,
> right?

Right, but qemu has no way to duplicate that behavior unless it walks
the full s1 and s2 page tables, which we have said it isn't going to
do.

So it should probably just ignore this check and assume the IOVA is
set properly, exactly the same as if it was HW injected using the RMR.

Jason

RE: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback

Posted by Shameer Kolothum 3 months ago


> -----Original Message-----
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: 04 November 2025 19:35
> To: Nicolin Chen <nicolinc@nvidia.com>
> Cc: Shameer Kolothum <skolothumtho@nvidia.com>; Eric Auger
> <eric.auger@redhat.com>; qemu-arm@nongnu.org; qemu-
> devel@nongnu.org; peter.maydell@linaro.org; ddutile@redhat.com;
> berrange@redhat.com; Nathan Chen <nathanc@nvidia.com>; Matt Ochs
> <mochs@nvidia.com>; smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> Krishnakant Jaju <kjaju@nvidia.com>
> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional
> get_msi_address_space() callback
> 
> On Tue, Nov 04, 2025 at 11:31:50AM -0800, Nicolin Chen wrote:
> > On Tue, Nov 04, 2025 at 02:56:51PM -0400, Jason Gunthorpe wrote:
> > > On Tue, Nov 04, 2025 at 10:44:27AM -0800, Nicolin Chen wrote:
> > > > KVM would always get the IRQ from HW, since the HW is programmed
> > > > correctly. But if gIOVA->vITS is not mapped, i.e. gIOVA is given
> > > > incorrectly, it can't inject the IRQ.
> > >
> > > But this is a software interrupt, and I think it should still just
> > > ignore vMSI's address and assume it is mapped to a legal ITS
> > > page. There is just no way to validate it.
> > >
> > > Even SW MSI shouldn't fail because the vMSI has some weird IOVA in it
> > > that isn't mapped in the S2. That's wrong and is something the guest
> > > is permitted to do.
> >
> > Hmm, that feels like a self-correction? But in a baremetal case,
> > if HW is programmed with a weird IOVA, interrupt would not work,
> > right?
> 
> Right, but qemu has no way to duplicate that behavior unless it walks
> the full s1 and s2 page tables, which we have said it isn't going to
> do.
> So it should probably just ignore this check and assume the IOVA is
> set properly, exactly the same as if it was HW injected using the RMR.

TBH, I am a bit lost here. Anyway, this is my understanding.

If we ignore and don't return the correct doorbell (gPA) here, 
Qemu will end up invoking KVM_SET_GSI_ROUTING with wrong doorbell
which sets up the in-kernel vgic irq routing information. And when HW
raises the IRQ, KVM can't inject it properly.

Thanks,
Shameer

Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback

Posted by Jason Gunthorpe 3 months ago

On Tue, Nov 04, 2025 at 07:46:46PM +0000, Shameer Kolothum wrote:

> If we ignore and don't return the correct doorbell (gPA) here, 
> Qemu will end up invoking KVM_SET_GSI_ROUTING with wrong doorbell
> which sets up the in-kernel vgic irq routing information. And when HW
> raises the IRQ, KVM can't inject it properly.

That cannot be true.

Again, there is no way for qmeu to put something meaningful into the
'struct kvm_irq_routing_msi' address_lo/hi. It cannot walk the page
tables so it just ends up with some random meaningless guest IOVA.

Qemu MUST ignore the vMSI's address information.

So either the kernel ignores address_lo/high

OR qemu should match the vPCI device to its single vGIC and put in the
kernel expected address_lo/high always.

It should never, ever use the value from the guest once nesting is
enabled, and it should never be trying to translate the vMSI through
some S2, or any other, address space.

Translation is OK for non-nesting only.

Jason

Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback

Posted by Nicolin Chen 3 months ago

On Tue, Nov 04, 2025 at 03:35:21PM -0400, Jason Gunthorpe wrote:
> On Tue, Nov 04, 2025 at 11:31:50AM -0800, Nicolin Chen wrote:
> > On Tue, Nov 04, 2025 at 02:56:51PM -0400, Jason Gunthorpe wrote:
> > > On Tue, Nov 04, 2025 at 10:44:27AM -0800, Nicolin Chen wrote:
> > > > KVM would always get the IRQ from HW, since the HW is programmed
> > > > correctly. But if gIOVA->vITS is not mapped, i.e. gIOVA is given
> > > > incorrectly, it can't inject the IRQ.
> > > 
> > > But this is a software interrupt, and I think it should still just
> > > ignore vMSI's address and assume it is mapped to a legal ITS
> > > page. There is just no way to validate it.
> > >
> > > Even SW MSI shouldn't fail because the vMSI has some weird IOVA in it
> > > that isn't mapped in the S2. That's wrong and is something the guest
> > > is permitted to do.
> > 
> > Hmm, that feels like a self-correction? But in a baremetal case,
> > if HW is programmed with a weird IOVA, interrupt would not work,
> > right?
> 
> Right, but qemu has no way to duplicate that behavior unless it walks
> the full s1 and s2 page tables, which we have said it isn't going to
> do.

I think it could.

The stage-1 page table is in the guest RAM. And vSMMU has already
implemented the logic to walk through a guest page table. What KVM
has already been doing today is to ask vSMMU to translate that.

What we haven't implemented today is, if gIOVA is a weird one that
isn't translatable, vSMMU should trigger an F_TRANSLATION event as
the real HW does.

> So it should probably just ignore this check and assume the IOVA is
> set properly, exactly the same as if it was HW injected using the RMR.

Hmm, I am not sure about that, especially considering our plan to
support the true 2-stage mapping: gIOVA->vITS->pITS :-/

Thanks
Nicolin

Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback

Posted by Jason Gunthorpe 3 months ago

On Tue, Nov 04, 2025 at 11:43:07AM -0800, Nicolin Chen wrote:
> > Right, but qemu has no way to duplicate that behavior unless it walks
> > the full s1 and s2 page tables, which we have said it isn't going to
> > do.
> 
> I think it could.
> 
> The stage-1 page table is in the guest RAM. And vSMMU has already
> implemented the logic to walk through a guest page table. What KVM
> has already been doing today is to ask vSMMU to translate that.

No, we can't. The existing vsmmu code could do it because it mediated
the invalidation path. As soon as you have something like vcmdq the
hypervisor cannot walk the page tables.

> > So it should probably just ignore this check and assume the IOVA is
> > set properly, exactly the same as if it was HW injected using the RMR.
> 
> Hmm, I am not sure about that, especially considering our plan to
> support the true 2-stage mapping: gIOVA->vITS->pITS :-/

In true mode the HW path will work perfectly and the SW path will
remain deficient in not checking for invalid configuration

I don't see another sensible choice.

Jason

Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional get_msi_address_space() callback

Posted by Nicolin Chen 3 months ago

On Tue, Nov 04, 2025 at 03:45:52PM -0400, Jason Gunthorpe wrote:
> On Tue, Nov 04, 2025 at 11:43:07AM -0800, Nicolin Chen wrote:
> > > Right, but qemu has no way to duplicate that behavior unless it walks
> > > the full s1 and s2 page tables, which we have said it isn't going to
> > > do.
> > 
> > I think it could.
> > 
> > The stage-1 page table is in the guest RAM. And vSMMU has already
> > implemented the logic to walk through a guest page table. What KVM
> > has already been doing today is to ask vSMMU to translate that.
> 
> No, we can't. The existing vsmmu code could do it because it mediated
> the invalidation path. As soon as you have something like vcmdq the
> hypervisor cannot walk the page tables.

Hmm? It does walk through the page table (not invalidation path):
https://github.com/qemu/qemu/blob/master/hw/arm/smmu-common.c#L444

And VCMDQ can work with that. We've tested it..

Nicolin