On ARM, devices behind an IOMMU have their MSI doorbell addresses
translated by the IOMMU. In nested mode, this translation happens in
two stages (gIOVA → gPA → ITS page).
In accelerated SMMUv3 mode, both stages are handled by hardware, so
get_address_space() returns the system address space so that VFIO
can setup stage-2 mappings for system address space.
However, QEMU/KVM also calls this callback when resolving
MSI doorbells:
kvm_irqchip_add_msi_route()
kvm_arch_fixup_msi_route()
pci_device_iommu_address_space()
get_address_space()
VFIO device in the guest with a SMMUv3 is programmed with a gIOVA for
MSI doorbell. This gIOVA can't be used to setup the MSI doorbell
directly. This needs to be translated to vITS gPA. In order to do the
doorbell transalation it needs IOMMU address space.
Add an optional get_msi_address_space() callback and use it in this
path to return the correct address space for such cases.
Cc: Michael S. Tsirkin <mst@redhat.com>
Suggested-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
hw/pci/pci.c | 18 ++++++++++++++++++
include/hw/pci/pci.h | 16 ++++++++++++++++
target/arm/kvm.c | 2 +-
3 files changed, 35 insertions(+), 1 deletion(-)
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index fa9cf5dab2..1edd711247 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -2982,6 +2982,24 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
return &address_space_memory;
}
+AddressSpace *pci_device_iommu_msi_address_space(PCIDevice *dev)
+{
+ PCIBus *bus;
+ PCIBus *iommu_bus;
+ int devfn;
+
+ pci_device_get_iommu_bus_devfn(dev, &iommu_bus, &bus, &devfn);
+ if (iommu_bus) {
+ if (iommu_bus->iommu_ops->get_msi_address_space) {
+ return iommu_bus->iommu_ops->get_msi_address_space(bus,
+ iommu_bus->iommu_opaque, devfn);
+ }
+ return iommu_bus->iommu_ops->get_address_space(bus,
+ iommu_bus->iommu_opaque, devfn);
+ }
+ return &address_space_memory;
+}
+
int pci_iommu_init_iotlb_notifier(PCIDevice *dev, IOMMUNotifier *n,
IOMMUNotify fn, void *opaque)
{
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index dfeba8c9bd..b731443c67 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -664,6 +664,21 @@ typedef struct PCIIOMMUOps {
uint32_t pasid, bool priv_req, bool exec_req,
hwaddr addr, bool lpig, uint16_t prgi, bool is_read,
bool is_write);
+ /**
+ * @get_msi_address_space: get the address space for MSI doorbell address
+ * for devices
+ *
+ * Optional callback which returns a pointer to an #AddressSpace. This
+ * is required if MSI doorbell also gets translated through vIOMMU(eg: ARM)
+ *
+ * @bus: the #PCIBus being accessed.
+ *
+ * @opaque: the data passed to pci_setup_iommu().
+ *
+ * @devfn: device and function number
+ */
+ AddressSpace * (*get_msi_address_space)(PCIBus *bus, void *opaque,
+ int devfn);
} PCIIOMMUOps;
bool pci_device_get_iommu_bus_devfn(PCIDevice *dev, PCIBus **piommu_bus,
@@ -672,6 +687,7 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
bool pci_device_set_iommu_device(PCIDevice *dev, HostIOMMUDevice *hiod,
Error **errp);
void pci_device_unset_iommu_device(PCIDevice *dev);
+AddressSpace *pci_device_iommu_msi_address_space(PCIDevice *dev);
/**
* pci_device_get_viommu_flags: get vIOMMU flags.
diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index 0d57081e69..0df41128d0 100644
--- a/target/arm/kvm.c
+++ b/target/arm/kvm.c
@@ -1611,7 +1611,7 @@ int kvm_arm_set_irq(int cpu, int irqtype, int irq, int level)
int kvm_arch_fixup_msi_route(struct kvm_irq_routing_entry *route,
uint64_t address, uint32_t data, PCIDevice *dev)
{
- AddressSpace *as = pci_device_iommu_address_space(dev);
+ AddressSpace *as = pci_device_iommu_msi_address_space(dev);
hwaddr xlat, len, doorbell_gpa;
MemoryRegionSection mrs;
MemoryRegion *mr;
--
2.43.0
Hi Shameer, Nicolin,
On 10/31/25 11:49 AM, Shameer Kolothum wrote:
> On ARM, devices behind an IOMMU have their MSI doorbell addresses
> translated by the IOMMU. In nested mode, this translation happens in
> two stages (gIOVA → gPA → ITS page).
>
> In accelerated SMMUv3 mode, both stages are handled by hardware, so
> get_address_space() returns the system address space so that VFIO
> can setup stage-2 mappings for system address space.
Sorry but I still don't catch the above. Can you explain (most probably
again) why this is a requirement to return the system as so that VFIO
can setup stage-2 mappings for system address space. I am sorry for
insisting (at the risk of being stubborn or dumb) but I fail to
understand the requirement. As far as I remember the way I integrated it
at the old times did not require that change:
https://lore.kernel.org/all/20210411120912.15770-1-eric.auger@redhat.com/
I used a vfio_prereg_listener to force the S2 mapping.
What has changed that forces us now to have this gym
>
> However, QEMU/KVM also calls this callback when resolving
> MSI doorbells:
>
> kvm_irqchip_add_msi_route()
> kvm_arch_fixup_msi_route()
> pci_device_iommu_address_space()
> get_address_space()
>
> VFIO device in the guest with a SMMUv3 is programmed with a gIOVA for
> MSI doorbell. This gIOVA can't be used to setup the MSI doorbell
> directly. This needs to be translated to vITS gPA. In order to do the
> doorbell transalation it needs IOMMU address space.
>
> Add an optional get_msi_address_space() callback and use it in this
> path to return the correct address space for such cases.
>
> Cc: Michael S. Tsirkin <mst@redhat.com>
> Suggested-by: Nicolin Chen <nicolinc@nvidia.com>
> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> Reviewed-by Nicolin Chen <nicolinc@nvidia.com>
> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> ---
> hw/pci/pci.c | 18 ++++++++++++++++++
> include/hw/pci/pci.h | 16 ++++++++++++++++
> target/arm/kvm.c | 2 +-
> 3 files changed, 35 insertions(+), 1 deletion(-)
>
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index fa9cf5dab2..1edd711247 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -2982,6 +2982,24 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
> return &address_space_memory;
> }
>
> +AddressSpace *pci_device_iommu_msi_address_space(PCIDevice *dev)
> +{
> + PCIBus *bus;
> + PCIBus *iommu_bus;
> + int devfn;
> +
> + pci_device_get_iommu_bus_devfn(dev, &iommu_bus, &bus, &devfn);
> + if (iommu_bus) {
> + if (iommu_bus->iommu_ops->get_msi_address_space) {
> + return iommu_bus->iommu_ops->get_msi_address_space(bus,
> + iommu_bus->iommu_opaque, devfn);
> + }
> + return iommu_bus->iommu_ops->get_address_space(bus,
> + iommu_bus->iommu_opaque, devfn);
> + }
> + return &address_space_memory;
> +}
> +
> int pci_iommu_init_iotlb_notifier(PCIDevice *dev, IOMMUNotifier *n,
> IOMMUNotify fn, void *opaque)
> {
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index dfeba8c9bd..b731443c67 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -664,6 +664,21 @@ typedef struct PCIIOMMUOps {
> uint32_t pasid, bool priv_req, bool exec_req,
> hwaddr addr, bool lpig, uint16_t prgi, bool is_read,
> bool is_write);
> + /**
> + * @get_msi_address_space: get the address space for MSI doorbell address
> + * for devices
> + *
> + * Optional callback which returns a pointer to an #AddressSpace. This
> + * is required if MSI doorbell also gets translated through vIOMMU(eg: ARM)
> + *
> + * @bus: the #PCIBus being accessed.
> + *
> + * @opaque: the data passed to pci_setup_iommu().
> + *
> + * @devfn: device and function number
> + */
> + AddressSpace * (*get_msi_address_space)(PCIBus *bus, void *opaque,
> + int devfn);
> } PCIIOMMUOps;
>
> bool pci_device_get_iommu_bus_devfn(PCIDevice *dev, PCIBus **piommu_bus,
> @@ -672,6 +687,7 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev);
> bool pci_device_set_iommu_device(PCIDevice *dev, HostIOMMUDevice *hiod,
> Error **errp);
> void pci_device_unset_iommu_device(PCIDevice *dev);
> +AddressSpace *pci_device_iommu_msi_address_space(PCIDevice *dev);
>
> /**
> * pci_device_get_viommu_flags: get vIOMMU flags.
> diff --git a/target/arm/kvm.c b/target/arm/kvm.c
> index 0d57081e69..0df41128d0 100644
> --- a/target/arm/kvm.c
> +++ b/target/arm/kvm.c
> @@ -1611,7 +1611,7 @@ int kvm_arm_set_irq(int cpu, int irqtype, int irq, int level)
> int kvm_arch_fixup_msi_route(struct kvm_irq_routing_entry *route,
> uint64_t address, uint32_t data, PCIDevice *dev)
> {
> - AddressSpace *as = pci_device_iommu_address_space(dev);
> + AddressSpace *as = pci_device_iommu_msi_address_space(dev);
> hwaddr xlat, len, doorbell_gpa;
> MemoryRegionSection mrs;
> MemoryRegion *mr;
Eric
Hi Eric, > -----Original Message----- > From: Eric Auger <eric.auger@redhat.com> > Sent: 04 November 2025 14:12 > To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu- > arm@nongnu.org; qemu-devel@nongnu.org > Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin > Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com; > Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>; > smostafa@google.com; wangzhou1@hisilicon.com; > jiangkunkun@huawei.com; jonathan.cameron@huawei.com; > zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com; > Krishnakant Jaju <kjaju@nvidia.com> > Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional > get_msi_address_space() callback > > External email: Use caution opening links or attachments > > > Hi Shameer, Nicolin, > > On 10/31/25 11:49 AM, Shameer Kolothum wrote: > > On ARM, devices behind an IOMMU have their MSI doorbell addresses > > translated by the IOMMU. In nested mode, this translation happens in > > two stages (gIOVA → gPA → ITS page). > > > > In accelerated SMMUv3 mode, both stages are handled by hardware, so > > get_address_space() returns the system address space so that VFIO > > can setup stage-2 mappings for system address space. > > Sorry but I still don't catch the above. Can you explain (most probably > again) why this is a requirement to return the system as so that VFIO > can setup stage-2 mappings for system address space. I am sorry for > insisting (at the risk of being stubborn or dumb) but I fail to > understand the requirement. As far as I remember the way I integrated it > at the old times did not require that change: > https://lore.kernel.org/all/20210411120912.15770-1- > eric.auger@redhat.com/ > I used a vfio_prereg_listener to force the S2 mapping. Yes I remember that. > > What has changed that forces us now to have this gym This approach achieves the same outcome, but through a different mechanism. Returning the system address space here ensures that VFIO sets up the Stage-2 mappings for devices behind the accelerated SMMUv3. I think, this makes sense because, in the accelerated case, the device is no longer managed by QEMU’s SMMUv3 model. The guest owns the Stage-1 context, and the host (VFIO) is responsible for establishing the Stage-2 mappings accordingly. Do you see any issues with this approach? Thanks, Shameer
On 11/4/25 3:37 PM, Shameer Kolothum wrote: > Hi Eric, > >> -----Original Message----- >> From: Eric Auger <eric.auger@redhat.com> >> Sent: 04 November 2025 14:12 >> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu- >> arm@nongnu.org; qemu-devel@nongnu.org >> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin >> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com; >> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>; >> smostafa@google.com; wangzhou1@hisilicon.com; >> jiangkunkun@huawei.com; jonathan.cameron@huawei.com; >> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com; >> Krishnakant Jaju <kjaju@nvidia.com> >> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional >> get_msi_address_space() callback >> >> External email: Use caution opening links or attachments >> >> >> Hi Shameer, Nicolin, >> >> On 10/31/25 11:49 AM, Shameer Kolothum wrote: >>> On ARM, devices behind an IOMMU have their MSI doorbell addresses >>> translated by the IOMMU. In nested mode, this translation happens in >>> two stages (gIOVA → gPA → ITS page). >>> >>> In accelerated SMMUv3 mode, both stages are handled by hardware, so >>> get_address_space() returns the system address space so that VFIO >>> can setup stage-2 mappings for system address space. >> Sorry but I still don't catch the above. Can you explain (most probably >> again) why this is a requirement to return the system as so that VFIO >> can setup stage-2 mappings for system address space. I am sorry for >> insisting (at the risk of being stubborn or dumb) but I fail to >> understand the requirement. As far as I remember the way I integrated it >> at the old times did not require that change: >> https://lore.kernel.org/all/20210411120912.15770-1- >> eric.auger@redhat.com/ >> I used a vfio_prereg_listener to force the S2 mapping. > Yes I remember that. > >> What has changed that forces us now to have this gym > This approach achieves the same outcome, but through a > different mechanism. Returning the system address space > here ensures that VFIO sets up the Stage-2 mappings for > devices behind the accelerated SMMUv3. > > I think, this makes sense because, in the accelerated case, the > device is no longer managed by QEMU’s SMMUv3 model. The On the other hand, as we discussed on v4 by returning system as you pretend there is no translation in place which is not true. Now we use an alias for it but it has not really removed its usage. Also it forces use to hack around the MSI mapping and introduce new PCIIOMMUOps. Have you assessed the feasability of using vfio_prereg_listener to force the S2 mapping. Is it simply not relevant anymore or could it be used also with the iommufd be integration? Eric > guest owns the Stage-1 context, and the host (VFIO) is responsible > for establishing the Stage-2 mappings accordingly. > > Do you see any issues with this approach? > > Thanks, > Shameer
> -----Original Message----- > From: Eric Auger <eric.auger@redhat.com> > Sent: 04 November 2025 14:44 > To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu- > arm@nongnu.org; qemu-devel@nongnu.org > Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin > Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com; > Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>; > smostafa@google.com; wangzhou1@hisilicon.com; > jiangkunkun@huawei.com; jonathan.cameron@huawei.com; > zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com; > Krishnakant Jaju <kjaju@nvidia.com> > Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional > get_msi_address_space() callback > > External email: Use caution opening links or attachments > > > On 11/4/25 3:37 PM, Shameer Kolothum wrote: > > Hi Eric, > > > >> -----Original Message----- > >> From: Eric Auger <eric.auger@redhat.com> > >> Sent: 04 November 2025 14:12 > >> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu- > >> arm@nongnu.org; qemu-devel@nongnu.org > >> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin > >> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com; > >> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>; > >> smostafa@google.com; wangzhou1@hisilicon.com; > >> jiangkunkun@huawei.com; jonathan.cameron@huawei.com; > >> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com; > >> Krishnakant Jaju <kjaju@nvidia.com> > >> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional > >> get_msi_address_space() callback > >> > >> External email: Use caution opening links or attachments > >> > >> > >> Hi Shameer, Nicolin, > >> > >> On 10/31/25 11:49 AM, Shameer Kolothum wrote: > >>> On ARM, devices behind an IOMMU have their MSI doorbell addresses > >>> translated by the IOMMU. In nested mode, this translation happens in > >>> two stages (gIOVA → gPA → ITS page). > >>> > >>> In accelerated SMMUv3 mode, both stages are handled by hardware, so > >>> get_address_space() returns the system address space so that VFIO > >>> can setup stage-2 mappings for system address space. > >> Sorry but I still don't catch the above. Can you explain (most probably > >> again) why this is a requirement to return the system as so that VFIO > >> can setup stage-2 mappings for system address space. I am sorry for > >> insisting (at the risk of being stubborn or dumb) but I fail to > >> understand the requirement. As far as I remember the way I integrated it > >> at the old times did not require that change: > >> https://lore.kernel.org/all/20210411120912.15770-1- > >> eric.auger@redhat.com/ > >> I used a vfio_prereg_listener to force the S2 mapping. > > Yes I remember that. > > > >> What has changed that forces us now to have this gym > > This approach achieves the same outcome, but through a > > different mechanism. Returning the system address space > > here ensures that VFIO sets up the Stage-2 mappings for > > devices behind the accelerated SMMUv3. > > > > I think, this makes sense because, in the accelerated case, the > > device is no longer managed by QEMU’s SMMUv3 model. The > On the other hand, as we discussed on v4 by returning system as you > pretend there is no translation in place which is not true. Now we use > an alias for it but it has not really removed its usage. Also it forces > use to hack around the MSI mapping and introduce new PCIIOMMUOps. > Have > you assessed the feasability of using vfio_prereg_listener to force the > S2 mapping. Is it simply not relevant anymore or could it be used also > with the iommufd be integration? Eric IIUC, the prereg_listener mechanism just enables us to setup the s2 mappings. For MSI, In your version, I see that smmu_find_add_as() always returns IOMMU as. How is that supposed to work if the Guest has s1 bypass mode STE for the device? Thanks, Shameer
Hi Shameer, On 11/4/25 4:14 PM, Shameer Kolothum wrote: > >> -----Original Message----- >> From: Eric Auger <eric.auger@redhat.com> >> Sent: 04 November 2025 14:44 >> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu- >> arm@nongnu.org; qemu-devel@nongnu.org >> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin >> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com; >> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>; >> smostafa@google.com; wangzhou1@hisilicon.com; >> jiangkunkun@huawei.com; jonathan.cameron@huawei.com; >> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com; >> Krishnakant Jaju <kjaju@nvidia.com> >> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional >> get_msi_address_space() callback >> >> External email: Use caution opening links or attachments >> >> >> On 11/4/25 3:37 PM, Shameer Kolothum wrote: >>> Hi Eric, >>> >>>> -----Original Message----- >>>> From: Eric Auger <eric.auger@redhat.com> >>>> Sent: 04 November 2025 14:12 >>>> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu- >>>> arm@nongnu.org; qemu-devel@nongnu.org >>>> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin >>>> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com; >>>> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>; >>>> smostafa@google.com; wangzhou1@hisilicon.com; >>>> jiangkunkun@huawei.com; jonathan.cameron@huawei.com; >>>> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com; >>>> Krishnakant Jaju <kjaju@nvidia.com> >>>> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional >>>> get_msi_address_space() callback >>>> >>>> External email: Use caution opening links or attachments >>>> >>>> >>>> Hi Shameer, Nicolin, >>>> >>>> On 10/31/25 11:49 AM, Shameer Kolothum wrote: >>>>> On ARM, devices behind an IOMMU have their MSI doorbell addresses >>>>> translated by the IOMMU. In nested mode, this translation happens in >>>>> two stages (gIOVA → gPA → ITS page). >>>>> >>>>> In accelerated SMMUv3 mode, both stages are handled by hardware, so >>>>> get_address_space() returns the system address space so that VFIO >>>>> can setup stage-2 mappings for system address space. >>>> Sorry but I still don't catch the above. Can you explain (most probably >>>> again) why this is a requirement to return the system as so that VFIO >>>> can setup stage-2 mappings for system address space. I am sorry for >>>> insisting (at the risk of being stubborn or dumb) but I fail to >>>> understand the requirement. As far as I remember the way I integrated it >>>> at the old times did not require that change: >>>> https://lore.kernel.org/all/20210411120912.15770-1- >>>> eric.auger@redhat.com/ >>>> I used a vfio_prereg_listener to force the S2 mapping. >>> Yes I remember that. >>> >>>> What has changed that forces us now to have this gym >>> This approach achieves the same outcome, but through a >>> different mechanism. Returning the system address space >>> here ensures that VFIO sets up the Stage-2 mappings for >>> devices behind the accelerated SMMUv3. >>> >>> I think, this makes sense because, in the accelerated case, the >>> device is no longer managed by QEMU’s SMMUv3 model. The >> On the other hand, as we discussed on v4 by returning system as you >> pretend there is no translation in place which is not true. Now we use >> an alias for it but it has not really removed its usage. Also it forces >> use to hack around the MSI mapping and introduce new PCIIOMMUOps. >> Have >> you assessed the feasability of using vfio_prereg_listener to force the >> S2 mapping. Is it simply not relevant anymore or could it be used also >> with the iommufd be integration? Eric > IIUC, the prereg_listener mechanism just enables us to setup the s2 > mappings. For MSI, In your version, I see that smmu_find_add_as() > always returns IOMMU as. How is that supposed to work if the Guest > has s1 bypass mode STE for the device? in kvm_arch_fixup_msi_route(), as we have as != &address_space_memory in my case, we proceed with the actual translation for the doorbell gIOVA using address_space_translate(). I guess if the S1 is in bypass mode you get the flat translation, no? Eric > > Thanks, > Shameer > >
Hi Eric, > -----Original Message----- > From: Eric Auger <eric.auger@redhat.com> > Sent: 05 November 2025 08:57 > To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu- > arm@nongnu.org; qemu-devel@nongnu.org > Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin > Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com; > Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>; > smostafa@google.com; wangzhou1@hisilicon.com; > jiangkunkun@huawei.com; jonathan.cameron@huawei.com; > zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com; > Krishnakant Jaju <kjaju@nvidia.com> > Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional > get_msi_address_space() callback [...] > > IIUC, the prereg_listener mechanism just enables us to setup the s2 > > mappings. For MSI, In your version, I see that smmu_find_add_as() > > always returns IOMMU as. How is that supposed to work if the Guest > > has s1 bypass mode STE for the device? > in kvm_arch_fixup_msi_route(), as we have as != &address_space_memory in > my case, we proceed with the actual translation for the doorbell gIOVA > using address_space_translate(). I guess if the S1 is in bypass mode > you get the flat translation, no? Yes, I noted that and replied as well. Again, coming back to kvm_arch_fixup_msi_route(), I see that this was introduced as part of your " ARM SMMUv3 Emulation Support" here, https://lore.kernel.org/qemu-devel/1523518688-26674-12-git-send-email-eric.auger@redhat.com/ The VFIO support was not there at that time. I am trying to understand why we need this MSI translation for vfio-pci in this accelerated case. My understanding was that this is to setup the KVM MSI routings via KVM_SET_GSI_ROUTING ioctl. Is that right? Thanks, Shameer
On 11/5/25 12:41 PM, Shameer Kolothum wrote: > Hi Eric, > >> -----Original Message----- >> From: Eric Auger <eric.auger@redhat.com> >> Sent: 05 November 2025 08:57 >> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu- >> arm@nongnu.org; qemu-devel@nongnu.org >> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin >> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com; >> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>; >> smostafa@google.com; wangzhou1@hisilicon.com; >> jiangkunkun@huawei.com; jonathan.cameron@huawei.com; >> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com; >> Krishnakant Jaju <kjaju@nvidia.com> >> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional >> get_msi_address_space() callback > [...] >>> IIUC, the prereg_listener mechanism just enables us to setup the s2 >>> mappings. For MSI, In your version, I see that smmu_find_add_as() >>> always returns IOMMU as. How is that supposed to work if the Guest >>> has s1 bypass mode STE for the device? >> in kvm_arch_fixup_msi_route(), as we have as != &address_space_memory in >> my case, we proceed with the actual translation for the doorbell gIOVA >> using address_space_translate(). I guess if the S1 is in bypass mode >> you get the flat translation, no? > Yes, I noted that and replied as well. > > Again, coming back to kvm_arch_fixup_msi_route(), I see that this was introduced > as part of your " ARM SMMUv3 Emulation Support" here, > https://lore.kernel.org/qemu-devel/1523518688-26674-12-git-send-email-eric.auger@redhat.com/ > > The VFIO support was not there at that time. I am trying to understand why > we need this MSI translation for vfio-pci in this accelerated case. My understanding > was that this is to setup the KVM MSI routings via KVM_SET_GSI_ROUTING ioctl. yes that's correct. This was first needed for vhost integration. And obviously this is also needed for VFIO. allows vhost irqfd to trigger a gsi that will be routed by KVM to the actual guest doorbell. On top of that it registers the guest PCI BDF for GiCv2m or GICv3 MSI translation setup. if the guest doorbell address is wrong because not properly translated, vgic_msi_to_its() will fail to identify the ITS to inject the MSI in. See kernel kvm/vgic/vgic-its.c vgic_msi_to_its and vgic_its_inject_msi Eric > > Is that right? > > Thanks, > Shameer > > >
On Wed, Nov 05, 2025 at 06:25:05PM +0100, Eric Auger wrote: > if the guest doorbell address is wrong because not properly translated, > vgic_msi_to_its() will fail to identify the ITS to inject the MSI in. > See kernel kvm/vgic/vgic-its.c vgic_msi_to_its and > vgic_its_inject_msi Which has been exactly my point to Nicolin. There is no way to "properly translate" the vMSI address in a HW accelerated SMMU emulation. The vMSI address must only be used for some future non-RMR HW only path. To keep this flow working qemu must ignore the IOVA from the guest and always replace it with its own idea of what the correct ITS address is for KVM to work. It means we don't correctly emulate guest misconfiguration of the MSI address. Thus it should never be "translated" in this configuration, that's a broken idea when working with the HW accelerated vSMMU. Jason
> -----Original Message----- > From: Jason Gunthorpe <jgg@nvidia.com> > Sent: 05 November 2025 18:11 > To: Eric Auger <eric.auger@redhat.com> > Cc: Shameer Kolothum <skolothumtho@nvidia.com>; qemu- > arm@nongnu.org; qemu-devel@nongnu.org; peter.maydell@linaro.org; > Nicolin Chen <nicolinc@nvidia.com>; ddutile@redhat.com; > berrange@redhat.com; Nathan Chen <nathanc@nvidia.com>; Matt Ochs > <mochs@nvidia.com>; smostafa@google.com; wangzhou1@hisilicon.com; > jiangkunkun@huawei.com; jonathan.cameron@huawei.com; > zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com; > Krishnakant Jaju <kjaju@nvidia.com> > Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional > get_msi_address_space() callback > > On Wed, Nov 05, 2025 at 06:25:05PM +0100, Eric Auger wrote: > > if the guest doorbell address is wrong because not properly translated, > > vgic_msi_to_its() will fail to identify the ITS to inject the MSI in. > > See kernel kvm/vgic/vgic-its.c vgic_msi_to_its and > > vgic_its_inject_msi > > Which has been exactly my point to Nicolin. There is no way to > "properly translate" the vMSI address in a HW accelerated SMMU > emulation. > > The vMSI address must only be used for some future non-RMR HW only > path. > > To keep this flow working qemu must ignore the IOVA from the guest and > always replace it with its own idea of what the correct ITS address is > for KVM to work. It means we don't correctly emulate guest > misconfiguration of the MSI address. > > Thus it should never be "translated" in this configuration, that's a > broken idea when working with the HW accelerated vSMMU. Ah.. I get it now. You are not questioning the flow here but the "translate" part. Agree it is not safe to use smmuv3_translate() in an HW accelerated case. We need somehow to hook into this path and provide a correct ITS address for KVM. Hmm.... need to see how to do that in the least invasive way. Thanks, Shameer
On Wed, Nov 05, 2025 at 02:10:49PM -0400, Jason Gunthorpe wrote: > On Wed, Nov 05, 2025 at 06:25:05PM +0100, Eric Auger wrote: > > if the guest doorbell address is wrong because not properly translated, > > vgic_msi_to_its() will fail to identify the ITS to inject the MSI in. > > See kernel kvm/vgic/vgic-its.c vgic_msi_to_its and > > vgic_its_inject_msi > > Which has been exactly my point to Nicolin. There is no way to > "properly translate" the vMSI address in a HW accelerated SMMU > emulation. Hmm, I still can't connect the dots here. QEMU knows where the guest CD table is to get the stage-1 translation table to walk through. We could choose to not let it walk through. Yet, why? Asking this to know what we should justify for the patch in a different direction. > The vMSI address must only be used for some future non-RMR HW only > path. > > To keep this flow working qemu must ignore the IOVA from the guest and > always replace it with its own idea of what the correct ITS address is > for KVM to work. It means we don't correctly emulate guest > misconfiguration of the MSI address. That is something alternative in my mind, to simplify things, especially we are having a discussion, on the other side, for selecting a correct (QEMU) address space depending on whether vIOMMU needs a stage-1 translation or not. This MSI translate thing makes the whole narrative more complicated indeed. We could use a different PCI op to forward the vITS physical address to KVM layer bypassing the translation pathway. Thanks Nicolin
On Wed, Nov 05, 2025 at 10:33:08AM -0800, Nicolin Chen wrote: > On Wed, Nov 05, 2025 at 02:10:49PM -0400, Jason Gunthorpe wrote: > > On Wed, Nov 05, 2025 at 06:25:05PM +0100, Eric Auger wrote: > > > if the guest doorbell address is wrong because not properly translated, > > > vgic_msi_to_its() will fail to identify the ITS to inject the MSI in. > > > See kernel kvm/vgic/vgic-its.c vgic_msi_to_its and > > > vgic_its_inject_msi > > > > Which has been exactly my point to Nicolin. There is no way to > > "properly translate" the vMSI address in a HW accelerated SMMU > > emulation. > > Hmm, I still can't connect the dots here. QEMU knows where the > guest CD table is to get the stage-1 translation table to walk > through. We could choose to not let it walk through. Yet, why? You cannot walk any tables in guest memory without fully trapping all invalidation on all command queues. Like real HW qemu needs to fence its walks with any concurrent invalidate & sync to ensure it doesn't walk into a UAF situation. Since we can't trap or mediate vCMDQ the walking simply cannot be done. Thus, the general principle of the HW accelerated vSMMU is that it NEVER walks any of these guest tables for any reason. Thus, we cannot do anything with vMSI address beyond program it directly into a real PCI device so it undergoes real HW translation. Jason
On 11/5/25 7:58 PM, Jason Gunthorpe wrote: > On Wed, Nov 05, 2025 at 10:33:08AM -0800, Nicolin Chen wrote: >> On Wed, Nov 05, 2025 at 02:10:49PM -0400, Jason Gunthorpe wrote: >>> On Wed, Nov 05, 2025 at 06:25:05PM +0100, Eric Auger wrote: >>>> if the guest doorbell address is wrong because not properly translated, >>>> vgic_msi_to_its() will fail to identify the ITS to inject the MSI in. >>>> See kernel kvm/vgic/vgic-its.c vgic_msi_to_its and >>>> vgic_its_inject_msi >>> Which has been exactly my point to Nicolin. There is no way to >>> "properly translate" the vMSI address in a HW accelerated SMMU >>> emulation. >> Hmm, I still can't connect the dots here. QEMU knows where the >> guest CD table is to get the stage-1 translation table to walk >> through. We could choose to not let it walk through. Yet, why? > You cannot walk any tables in guest memory without fully trapping all > invalidation on all command queues. Like real HW qemu needs to fence > its walks with any concurrent invalidate & sync to ensure it doesn't > walk into a UAF situation. But at the moment we do trap IOTLB invalidates so logically we can still do the translate in that config. The problem you describe will show up with vCMDQ which is not part of this series. > > Since we can't trap or mediate vCMDQ the walking simply cannot be > done. > > Thus, the general principle of the HW accelerated vSMMU is that it > NEVER walks any of these guest tables for any reason. > > Thus, we cannot do anything with vMSI address beyond program it > directly into a real PCI device so it undergoes real HW translation. But anyway you need to provide KVM a valid info about the guest doorbell for this latter to setup irqfd gsi routing and also program ITS translation tables. At the moment we have a single vITS in qemu so maybe we can cheat. Eric > > Jason >
On Thu, Nov 06, 2025 at 08:42:31AM +0100, Eric Auger wrote: > > > On 11/5/25 7:58 PM, Jason Gunthorpe wrote: > > On Wed, Nov 05, 2025 at 10:33:08AM -0800, Nicolin Chen wrote: > >> On Wed, Nov 05, 2025 at 02:10:49PM -0400, Jason Gunthorpe wrote: > >>> On Wed, Nov 05, 2025 at 06:25:05PM +0100, Eric Auger wrote: > >>>> if the guest doorbell address is wrong because not properly translated, > >>>> vgic_msi_to_its() will fail to identify the ITS to inject the MSI in. > >>>> See kernel kvm/vgic/vgic-its.c vgic_msi_to_its and > >>>> vgic_its_inject_msi > >>> Which has been exactly my point to Nicolin. There is no way to > >>> "properly translate" the vMSI address in a HW accelerated SMMU > >>> emulation. > >> Hmm, I still can't connect the dots here. QEMU knows where the > >> guest CD table is to get the stage-1 translation table to walk > >> through. We could choose to not let it walk through. Yet, why? > > You cannot walk any tables in guest memory without fully trapping all > > invalidation on all command queues. Like real HW qemu needs to fence > > its walks with any concurrent invalidate & sync to ensure it doesn't > > walk into a UAF situation. > But at the moment we do trap IOTLB invalidates so logically we can still > do the translate in that config. The problem you describe will show up > with vCMDQ which is not part of this series. This is why I said: > > Thus, the general principle of the HW accelerated vSMMU is that it > > NEVER walks any of these guest tables for any reason. It would make no sense to add table walking then have to figure out how to rip it out. > But anyway you need to provide KVM a valid info about the guest doorbell > for this latter to setup irqfd gsi routing and also program ITS > translation tables. At the moment we have a single vITS in qemu so maybe > we can cheat. qemu should always know what VITS is linked to a pci device to tell kvm whatever it needs, even if there are more than one. Jason
On 11/6/25 3:32 PM, Jason Gunthorpe wrote: > On Thu, Nov 06, 2025 at 08:42:31AM +0100, Eric Auger wrote: >> >> On 11/5/25 7:58 PM, Jason Gunthorpe wrote: >>> On Wed, Nov 05, 2025 at 10:33:08AM -0800, Nicolin Chen wrote: >>>> On Wed, Nov 05, 2025 at 02:10:49PM -0400, Jason Gunthorpe wrote: >>>>> On Wed, Nov 05, 2025 at 06:25:05PM +0100, Eric Auger wrote: >>>>>> if the guest doorbell address is wrong because not properly translated, >>>>>> vgic_msi_to_its() will fail to identify the ITS to inject the MSI in. >>>>>> See kernel kvm/vgic/vgic-its.c vgic_msi_to_its and >>>>>> vgic_its_inject_msi >>>>> Which has been exactly my point to Nicolin. There is no way to >>>>> "properly translate" the vMSI address in a HW accelerated SMMU >>>>> emulation. >>>> Hmm, I still can't connect the dots here. QEMU knows where the >>>> guest CD table is to get the stage-1 translation table to walk >>>> through. We could choose to not let it walk through. Yet, why? >>> You cannot walk any tables in guest memory without fully trapping all >>> invalidation on all command queues. Like real HW qemu needs to fence >>> its walks with any concurrent invalidate & sync to ensure it doesn't >>> walk into a UAF situation. >> But at the moment we do trap IOTLB invalidates so logically we can still >> do the translate in that config. The problem you describe will show up >> with vCMDQ which is not part of this series. > This is why I said: > >>> Thus, the general principle of the HW accelerated vSMMU is that it >>> NEVER walks any of these guest tables for any reason. > It would make no sense to add table walking then have to figure out > how to rip it out. understood. Though strictly speaking you are not adding it as it is already there ;-) > >> But anyway you need to provide KVM a valid info about the guest doorbell >> for this latter to setup irqfd gsi routing and also program ITS >> translation tables. At the moment we have a single vITS in qemu so maybe >> we can cheat. > qemu should always know what VITS is linked to a pci device to tell > kvm whatever it needs, even if there are more than one. Yeah we can work in that direction instead. But this could be worked on later on along with vcmdq series as well ;-) Eric > > Jason >
> -----Original Message-----
> From: Eric Auger <eric.auger@redhat.com>
> Sent: 06 November 2025 07:43
> To: Jason Gunthorpe <jgg@nvidia.com>; Nicolin Chen <nicolinc@nvidia.com>
> Cc: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
> arm@nongnu.org; qemu-devel@nongnu.org; peter.maydell@linaro.org;
> ddutile@redhat.com; berrange@redhat.com; Nathan Chen
> <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> Krishnakant Jaju <kjaju@nvidia.com>
> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional
> get_msi_address_space() callback
>
> External email: Use caution opening links or attachments
>
>
> On 11/5/25 7:58 PM, Jason Gunthorpe wrote:
> > On Wed, Nov 05, 2025 at 10:33:08AM -0800, Nicolin Chen wrote:
> >> On Wed, Nov 05, 2025 at 02:10:49PM -0400, Jason Gunthorpe wrote:
> >>> On Wed, Nov 05, 2025 at 06:25:05PM +0100, Eric Auger wrote:
> >>>> if the guest doorbell address is wrong because not properly translated,
> >>>> vgic_msi_to_its() will fail to identify the ITS to inject the MSI in.
> >>>> See kernel kvm/vgic/vgic-its.c vgic_msi_to_its and
> >>>> vgic_its_inject_msi
> >>> Which has been exactly my point to Nicolin. There is no way to
> >>> "properly translate" the vMSI address in a HW accelerated SMMU
> >>> emulation.
> >> Hmm, I still can't connect the dots here. QEMU knows where the
> >> guest CD table is to get the stage-1 translation table to walk
> >> through. We could choose to not let it walk through. Yet, why?
> > You cannot walk any tables in guest memory without fully trapping all
> > invalidation on all command queues. Like real HW qemu needs to fence
> > its walks with any concurrent invalidate & sync to ensure it doesn't
> > walk into a UAF situation.
> But at the moment we do trap IOTLB invalidates so logically we can still
> do the translate in that config. The problem you describe will show up
> with vCMDQ which is not part of this series.
> >
> > Since we can't trap or mediate vCMDQ the walking simply cannot be
> > done.
> >
> > Thus, the general principle of the HW accelerated vSMMU is that it
> > NEVER walks any of these guest tables for any reason.
> >
> > Thus, we cannot do anything with vMSI address beyond program it
> > directly into a real PCI device so it undergoes real HW translation.
> But anyway you need to provide KVM a valid info about the guest doorbell
> for this latter to setup irqfd gsi routing and also program ITS
> translation tables. At the moment we have a single vITS in qemu so maybe
> we can cheat.
I have tried to address the “translate” issue below. This introduces a new
get_msi_address() callback to retrieve the MSI doorbell address directly
from the vIOMMU, so we can drop the existing get_msi_address_space() logic.
Please take a look and let me know your thoughts.
Thanks,
Shameer
---
hw/arm/smmuv3-accel.c | 10 ++++++++++
hw/arm/smmuv3.c | 1 +
hw/arm/virt.c | 4 ++++
hw/pci/pci.c | 17 +++++++++++++++++
include/hw/arm/smmuv3.h | 1 +
include/hw/pci/pci.h | 15 +++++++++++++++
target/arm/kvm.c | 14 ++++++++++++--
7 files changed, 60 insertions(+), 2 deletions(-)
diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
index e6c81c4786..8b2a45a915 100644
--- a/hw/arm/smmuv3-accel.c
+++ b/hw/arm/smmuv3-accel.c
@@ -667,6 +667,15 @@ static void smmuv3_accel_unset_iommu_device(PCIBus *bus, void *opaque,
}
}
+static uint64_t smmuv3_accel_get_msi_address(PCIBus *bus, void *opaque,
+ int devfn)
+{
+ SMMUState *bs = opaque;
+ SMMUv3State *s = ARM_SMMUV3(bs);
+
+ g_assert(s->msi_doorbell);
+ return s->msi_doorbell;
+}
static AddressSpace *smmuv3_accel_get_msi_as(PCIBus *bus, void *opaque,
int devfn)
{
@@ -788,6 +797,7 @@ static const PCIIOMMUOps smmuv3_accel_ops = {
.set_iommu_device = smmuv3_accel_set_iommu_device,
.unset_iommu_device = smmuv3_accel_unset_iommu_device,
.get_msi_address_space = smmuv3_accel_get_msi_as,
+ .get_msi_address = smmuv3_accel_get_msi_address,
};
void smmuv3_accel_idr_override(SMMUv3State *s)
diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
index 43d297698b..3f2ee8bcce 100644
--- a/hw/arm/smmuv3.c
+++ b/hw/arm/smmuv3.c
@@ -2120,6 +2120,7 @@ static const Property smmuv3_properties[] = {
DEFINE_PROP_BOOL("ats", SMMUv3State, ats, false),
DEFINE_PROP_UINT8("oas", SMMUv3State, oas, 44),
DEFINE_PROP_BOOL("pasid", SMMUv3State, pasid, false),
+ DEFINE_PROP_UINT64("msi-doorbell", SMMUv3State, msi_doorbell, 0),
};
static void smmuv3_instance_init(Object *obj)
diff --git a/hw/arm/virt.c b/hw/arm/virt.c
index 2498e3beff..d2dcb89235 100644
--- a/hw/arm/virt.c
+++ b/hw/arm/virt.c
@@ -3097,6 +3097,8 @@ static void virt_machine_device_plug_cb(HotplugHandler *hotplug_dev,
create_smmuv3_dev_dtb(vms, dev, bus, errp);
if (object_property_get_bool(OBJECT(dev), "accel", &error_abort)) {
+ hwaddr db_start = base_memmap[VIRT_GIC_ITS].base +
+ ITS_TRANS_SIZE + GITS_TRANSLATER;
char *stage;
stage = object_property_get_str(OBJECT(dev), "stage",
&error_fatal);
@@ -3107,6 +3109,8 @@ static void virt_machine_device_plug_cb(HotplugHandler *hotplug_dev,
return;
}
vms->pci_preserve_config = true;
+ object_property_set_uint(OBJECT(dev), "msi-doorbell", db_start,
+ &error_abort);
}
}
}
diff --git a/hw/pci/pci.c b/hw/pci/pci.c
index 1edd711247..45e79a3c23 100644
--- a/hw/pci/pci.c
+++ b/hw/pci/pci.c
@@ -2982,6 +2982,23 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
return &address_space_memory;
}
+bool pci_device_iommu_msi_direct_address(PCIDevice *dev, hwaddr *out_doorbell)
+{
+ PCIBus *bus;
+ PCIBus *iommu_bus;
+ int devfn;
+
+ pci_device_get_iommu_bus_devfn(dev, &iommu_bus, &bus, &devfn);
+ if (iommu_bus) {
+ if (iommu_bus->iommu_ops->get_msi_address) {
+ *out_doorbell = iommu_bus->iommu_ops->get_msi_address(bus,
+ iommu_bus->iommu_opaque, devfn);
+ return true;
+ }
+ }
+ return false;
+}
+
AddressSpace *pci_device_iommu_msi_address_space(PCIDevice *dev)
{
PCIBus *bus;
diff --git a/include/hw/arm/smmuv3.h b/include/hw/arm/smmuv3.h
index ee0b5ed74f..f50d8c72bd 100644
--- a/include/hw/arm/smmuv3.h
+++ b/include/hw/arm/smmuv3.h
@@ -72,6 +72,7 @@ struct SMMUv3State {
bool ats;
uint8_t oas;
bool pasid;
+ uint64_t msi_doorbell;
};
typedef enum {
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index b731443c67..e1709b0bfe 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -679,6 +679,20 @@ typedef struct PCIIOMMUOps {
*/
AddressSpace * (*get_msi_address_space)(PCIBus *bus, void *opaque,
int devfn);
+ /**
+ * @get_msi_address: get the address of MSI doorbell for the device
+ * on a PCI bus.
+ *
+ * Optional callback, if implemented must return a valid MSI doorbell
+ * address.
+ *
+ * @bus: the #PCIBus being accessed.
+ *
+ * @opaque: the data passed to pci_setup_iommu().
+ *
+ * @devfn: device and function number
+ */
+ uint64_t (*get_msi_address)(PCIBus *bus, void *opaque, int devfn);
} PCIIOMMUOps;
bool pci_device_get_iommu_bus_devfn(PCIDevice *dev, PCIBus **piommu_bus,
@@ -688,6 +702,7 @@ bool pci_device_set_iommu_device(PCIDevice *dev, HostIOMMUDevice *hiod,
Error **errp);
void pci_device_unset_iommu_device(PCIDevice *dev);
AddressSpace *pci_device_iommu_msi_address_space(PCIDevice *dev);
+bool pci_device_iommu_msi_direct_address(PCIDevice *dev, hwaddr *out_doorbell);
/**
* pci_device_get_viommu_flags: get vIOMMU flags.
diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index 0df41128d0..8d4d2be0bc 100644
--- a/target/arm/kvm.c
+++ b/target/arm/kvm.c
@@ -1611,35 +1611,45 @@ int kvm_arm_set_irq(int cpu, int irqtype, int irq, int level)
int kvm_arch_fixup_msi_route(struct kvm_irq_routing_entry *route,
uint64_t address, uint32_t data, PCIDevice *dev)
{
- AddressSpace *as = pci_device_iommu_msi_address_space(dev);
+ AddressSpace *as;
hwaddr xlat, len, doorbell_gpa;
MemoryRegionSection mrs;
MemoryRegion *mr;
+ /* Check if there is a direct msi address available */
+ if (pci_device_iommu_msi_direct_address(dev, &doorbell_gpa)) {
+ goto set_doorbell;
+ }
+
+ as = pci_device_iommu_msi_address_space(dev);
if (as == &address_space_memory) {
return 0;
}
/* MSI doorbell address is translated by an IOMMU */
- RCU_READ_LOCK_GUARD();
+ rcu_read_lock();
mr = address_space_translate(as, address, &xlat, &len, true,
MEMTXATTRS_UNSPECIFIED);
if (!mr) {
+ rcu_read_unlock();
return 1;
}
mrs = memory_region_find(mr, xlat, 1);
if (!mrs.mr) {
+ rcu_read_unlock();
return 1;
}
doorbell_gpa = mrs.offset_within_address_space;
memory_region_unref(mrs.mr);
+ rcu_read_unlock();
+set_doorbell:
route->u.msi.address_lo = doorbell_gpa;
route->u.msi.address_hi = doorbell_gpa >> 32;
--
Hi Shameer,
On 11/6/25 12:48 PM, Shameer Kolothum wrote:
>> -----Original Message-----
>> From: Eric Auger <eric.auger@redhat.com>
>> Sent: 06 November 2025 07:43
>> To: Jason Gunthorpe <jgg@nvidia.com>; Nicolin Chen <nicolinc@nvidia.com>
>> Cc: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
>> arm@nongnu.org; qemu-devel@nongnu.org; peter.maydell@linaro.org;
>> ddutile@redhat.com; berrange@redhat.com; Nathan Chen
>> <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
>> smostafa@google.com; wangzhou1@hisilicon.com;
>> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
>> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
>> Krishnakant Jaju <kjaju@nvidia.com>
>> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional
>> get_msi_address_space() callback
>>
>> External email: Use caution opening links or attachments
>>
>>
>> On 11/5/25 7:58 PM, Jason Gunthorpe wrote:
>>> On Wed, Nov 05, 2025 at 10:33:08AM -0800, Nicolin Chen wrote:
>>>> On Wed, Nov 05, 2025 at 02:10:49PM -0400, Jason Gunthorpe wrote:
>>>>> On Wed, Nov 05, 2025 at 06:25:05PM +0100, Eric Auger wrote:
>>>>>> if the guest doorbell address is wrong because not properly translated,
>>>>>> vgic_msi_to_its() will fail to identify the ITS to inject the MSI in.
>>>>>> See kernel kvm/vgic/vgic-its.c vgic_msi_to_its and
>>>>>> vgic_its_inject_msi
>>>>> Which has been exactly my point to Nicolin. There is no way to
>>>>> "properly translate" the vMSI address in a HW accelerated SMMU
>>>>> emulation.
>>>> Hmm, I still can't connect the dots here. QEMU knows where the
>>>> guest CD table is to get the stage-1 translation table to walk
>>>> through. We could choose to not let it walk through. Yet, why?
>>> You cannot walk any tables in guest memory without fully trapping all
>>> invalidation on all command queues. Like real HW qemu needs to fence
>>> its walks with any concurrent invalidate & sync to ensure it doesn't
>>> walk into a UAF situation.
>> But at the moment we do trap IOTLB invalidates so logically we can still
>> do the translate in that config. The problem you describe will show up
>> with vCMDQ which is not part of this series.
>>> Since we can't trap or mediate vCMDQ the walking simply cannot be
>>> done.
>>>
>>> Thus, the general principle of the HW accelerated vSMMU is that it
>>> NEVER walks any of these guest tables for any reason.
>>>
>>> Thus, we cannot do anything with vMSI address beyond program it
>>> directly into a real PCI device so it undergoes real HW translation.
>> But anyway you need to provide KVM a valid info about the guest doorbell
>> for this latter to setup irqfd gsi routing and also program ITS
>> translation tables. At the moment we have a single vITS in qemu so maybe
>> we can cheat.
> I have tried to address the “translate” issue below. This introduces a new
> get_msi_address() callback to retrieve the MSI doorbell address directly
> from the vIOMMU, so we can drop the existing get_msi_address_space() logic.
> Please take a look and let me know your thoughts.
>
> Thanks,
> Shameer
>
> ---
> hw/arm/smmuv3-accel.c | 10 ++++++++++
> hw/arm/smmuv3.c | 1 +
> hw/arm/virt.c | 4 ++++
> hw/pci/pci.c | 17 +++++++++++++++++
> include/hw/arm/smmuv3.h | 1 +
> include/hw/pci/pci.h | 15 +++++++++++++++
> target/arm/kvm.c | 14 ++++++++++++--
> 7 files changed, 60 insertions(+), 2 deletions(-)
>
> diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
> index e6c81c4786..8b2a45a915 100644
> --- a/hw/arm/smmuv3-accel.c
> +++ b/hw/arm/smmuv3-accel.c
> @@ -667,6 +667,15 @@ static void smmuv3_accel_unset_iommu_device(PCIBus *bus, void *opaque,
> }
> }
>
> +static uint64_t smmuv3_accel_get_msi_address(PCIBus *bus, void *opaque,
> + int devfn)
> +{
> + SMMUState *bs = opaque;
> + SMMUv3State *s = ARM_SMMUV3(bs);
> +
> + g_assert(s->msi_doorbell);
> + return s->msi_doorbell;
> +}
> static AddressSpace *smmuv3_accel_get_msi_as(PCIBus *bus, void *opaque,
> int devfn)
> {
> @@ -788,6 +797,7 @@ static const PCIIOMMUOps smmuv3_accel_ops = {
> .set_iommu_device = smmuv3_accel_set_iommu_device,
> .unset_iommu_device = smmuv3_accel_unset_iommu_device,
> .get_msi_address_space = smmuv3_accel_get_msi_as,
to be removed then
> + .get_msi_address = smmuv3_accel_get_msi_address,
> };
>
> void smmuv3_accel_idr_override(SMMUv3State *s)
> diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
> index 43d297698b..3f2ee8bcce 100644
> --- a/hw/arm/smmuv3.c
> +++ b/hw/arm/smmuv3.c
> @@ -2120,6 +2120,7 @@ static const Property smmuv3_properties[] = {
> DEFINE_PROP_BOOL("ats", SMMUv3State, ats, false),
> DEFINE_PROP_UINT8("oas", SMMUv3State, oas, 44),
> DEFINE_PROP_BOOL("pasid", SMMUv3State, pasid, false),
> + DEFINE_PROP_UINT64("msi-doorbell", SMMUv3State, msi_doorbell, 0),
> };
>
> static void smmuv3_instance_init(Object *obj)
> diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> index 2498e3beff..d2dcb89235 100644
> --- a/hw/arm/virt.c
> +++ b/hw/arm/virt.c
> @@ -3097,6 +3097,8 @@ static void virt_machine_device_plug_cb(HotplugHandler *hotplug_dev,
>
> create_smmuv3_dev_dtb(vms, dev, bus, errp);
> if (object_property_get_bool(OBJECT(dev), "accel", &error_abort)) {
> + hwaddr db_start = base_memmap[VIRT_GIC_ITS].base +
> + ITS_TRANS_SIZE + GITS_TRANSLATER;
there are still use cases where you count target GICv2M doorbell so at
least you would need to add some logic to switch between both
> char *stage;
> stage = object_property_get_str(OBJECT(dev), "stage",
> &error_fatal);
> @@ -3107,6 +3109,8 @@ static void virt_machine_device_plug_cb(HotplugHandler *hotplug_dev,
> return;
> }
> vms->pci_preserve_config = true;
> + object_property_set_uint(OBJECT(dev), "msi-doorbell", db_start,
> + &error_abort);
> }
> }
> }
> diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> index 1edd711247..45e79a3c23 100644
> --- a/hw/pci/pci.c
> +++ b/hw/pci/pci.c
> @@ -2982,6 +2982,23 @@ AddressSpace *pci_device_iommu_address_space(PCIDevice *dev)
> return &address_space_memory;
> }
>
> +bool pci_device_iommu_msi_direct_address(PCIDevice *dev, hwaddr *out_doorbell)
> +{
> + PCIBus *bus;
> + PCIBus *iommu_bus;
> + int devfn;
> +
> + pci_device_get_iommu_bus_devfn(dev, &iommu_bus, &bus, &devfn);
> + if (iommu_bus) {
> + if (iommu_bus->iommu_ops->get_msi_address) {
> + *out_doorbell = iommu_bus->iommu_ops->get_msi_address(bus,
> + iommu_bus->iommu_opaque, devfn);
> + return true;
> + }
> + }
> + return false;
> +}
> +
> AddressSpace *pci_device_iommu_msi_address_space(PCIDevice *dev)
> {
> PCIBus *bus;
> diff --git a/include/hw/arm/smmuv3.h b/include/hw/arm/smmuv3.h
> index ee0b5ed74f..f50d8c72bd 100644
> --- a/include/hw/arm/smmuv3.h
> +++ b/include/hw/arm/smmuv3.h
> @@ -72,6 +72,7 @@ struct SMMUv3State {
> bool ats;
> uint8_t oas;
> bool pasid;
> + uint64_t msi_doorbell;
> };
>
> typedef enum {
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index b731443c67..e1709b0bfe 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -679,6 +679,20 @@ typedef struct PCIIOMMUOps {
> */
> AddressSpace * (*get_msi_address_space)(PCIBus *bus, void *opaque,
> int devfn);
> + /**
> + * @get_msi_address: get the address of MSI doorbell for the device
(gpa) address
> + * on a PCI bus.
> + *
> + * Optional callback, if implemented must return a valid MSI doorbell
> + * address.
> + *
> + * @bus: the #PCIBus being accessed.
> + *
> + * @opaque: the data passed to pci_setup_iommu().
> + *
> + * @devfn: device and function number
> + */
> + uint64_t (*get_msi_address)(PCIBus *bus, void *opaque, int devfn);
> } PCIIOMMUOps;
>
> bool pci_device_get_iommu_bus_devfn(PCIDevice *dev, PCIBus **piommu_bus,
> @@ -688,6 +702,7 @@ bool pci_device_set_iommu_device(PCIDevice *dev, HostIOMMUDevice *hiod,
> Error **errp);
> void pci_device_unset_iommu_device(PCIDevice *dev);
> AddressSpace *pci_device_iommu_msi_address_space(PCIDevice *dev);
> +bool pci_device_iommu_msi_direct_address(PCIDevice *dev, hwaddr *out_doorbell);
>
> /**
> * pci_device_get_viommu_flags: get vIOMMU flags.
> diff --git a/target/arm/kvm.c b/target/arm/kvm.c
> index 0df41128d0..8d4d2be0bc 100644
> --- a/target/arm/kvm.c
> +++ b/target/arm/kvm.c
> @@ -1611,35 +1611,45 @@ int kvm_arm_set_irq(int cpu, int irqtype, int irq, int level)
> int kvm_arch_fixup_msi_route(struct kvm_irq_routing_entry *route,
> uint64_t address, uint32_t data, PCIDevice *dev)
> {
> - AddressSpace *as = pci_device_iommu_msi_address_space(dev);
> + AddressSpace *as;
> hwaddr xlat, len, doorbell_gpa;
> MemoryRegionSection mrs;
> MemoryRegion *mr;
>
> + /* Check if there is a direct msi address available */
> + if (pci_device_iommu_msi_direct_address(dev, &doorbell_gpa)) {
> + goto set_doorbell;
> + }
> +
> + as = pci_device_iommu_msi_address_space(dev);
logically this should be after the test below (ie. meaning we have an
IOMMU). But this means that you shall use an as which is not the
address_space_memory.
This works but it is not neat either because it totally ignored the
@address. So you have to build a solid commit msg to explain readers why
this is needed ;-)
> if (as == &address_space_memory) {
> return 0;
> }
>
> /* MSI doorbell address is translated by an IOMMU */
>
> - RCU_READ_LOCK_GUARD();
> + rcu_read_lock();
>
> mr = address_space_translate(as, address, &xlat, &len, true,
> MEMTXATTRS_UNSPECIFIED);
>
> if (!mr) {
> + rcu_read_unlock();
> return 1;
> }
>
> mrs = memory_region_find(mr, xlat, 1);
>
> if (!mrs.mr) {
> + rcu_read_unlock();
> return 1;
> }
>
> doorbell_gpa = mrs.offset_within_address_space;
> memory_region_unref(mrs.mr);
> + rcu_read_unlock();
>
> +set_doorbell:
> route->u.msi.address_lo = doorbell_gpa;
> route->u.msi.address_hi = doorbell_gpa >> 32;
>
> --
>
>
>
>
>
>
Thanks
Eric
Hi Eric,
> -----Original Message-----
> From: Eric Auger <eric.auger@redhat.com>
> Sent: 06 November 2025 17:05
> To: Shameer Kolothum <skolothumtho@nvidia.com>; Jason Gunthorpe
> <jgg@nvidia.com>; Nicolin Chen <nicolinc@nvidia.com>
> Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
> peter.maydell@linaro.org; ddutile@redhat.com; berrange@redhat.com;
> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> Krishnakant Jaju <kjaju@nvidia.com>
> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional
> get_msi_address_space() callback
>
[...]
>
> > I have tried to address the “translate” issue below. This introduces a new
> > get_msi_address() callback to retrieve the MSI doorbell address directly
> > from the vIOMMU, so we can drop the existing get_msi_address_space()
> logic.
> > Please take a look and let me know your thoughts.
> >
> > Thanks,
> > Shameer
> >
> > ---
> > hw/arm/smmuv3-accel.c | 10 ++++++++++
> > hw/arm/smmuv3.c | 1 +
> > hw/arm/virt.c | 4 ++++
> > hw/pci/pci.c | 17 +++++++++++++++++
> > include/hw/arm/smmuv3.h | 1 +
> > include/hw/pci/pci.h | 15 +++++++++++++++
> > target/arm/kvm.c | 14 ++++++++++++--
> > 7 files changed, 60 insertions(+), 2 deletions(-)
> >
> > diff --git a/hw/arm/smmuv3-accel.c b/hw/arm/smmuv3-accel.c
> > index e6c81c4786..8b2a45a915 100644
> > --- a/hw/arm/smmuv3-accel.c
> > +++ b/hw/arm/smmuv3-accel.c
> > @@ -667,6 +667,15 @@ static void
> smmuv3_accel_unset_iommu_device(PCIBus *bus, void *opaque,
> > }
> > }
> >
> > +static uint64_t smmuv3_accel_get_msi_address(PCIBus *bus, void
> *opaque,
> > + int devfn)
> > +{
> > + SMMUState *bs = opaque;
> > + SMMUv3State *s = ARM_SMMUV3(bs);
> > +
> > + g_assert(s->msi_doorbell);
> > + return s->msi_doorbell;
> > +}
> > static AddressSpace *smmuv3_accel_get_msi_as(PCIBus *bus, void
> *opaque,
> > int devfn)
> > {
> > @@ -788,6 +797,7 @@ static const PCIIOMMUOps smmuv3_accel_ops = {
> > .set_iommu_device = smmuv3_accel_set_iommu_device,
> > .unset_iommu_device = smmuv3_accel_unset_iommu_device,
> > .get_msi_address_space = smmuv3_accel_get_msi_as,
> to be removed then
Yes, Of course. Will drop that.
> > + .get_msi_address = smmuv3_accel_get_msi_address,
> > };
> >
> > void smmuv3_accel_idr_override(SMMUv3State *s)
> > diff --git a/hw/arm/smmuv3.c b/hw/arm/smmuv3.c
> > index 43d297698b..3f2ee8bcce 100644
> > --- a/hw/arm/smmuv3.c
> > +++ b/hw/arm/smmuv3.c
> > @@ -2120,6 +2120,7 @@ static const Property smmuv3_properties[] = {
> > DEFINE_PROP_BOOL("ats", SMMUv3State, ats, false),
> > DEFINE_PROP_UINT8("oas", SMMUv3State, oas, 44),
> > DEFINE_PROP_BOOL("pasid", SMMUv3State, pasid, false),
> > + DEFINE_PROP_UINT64("msi-doorbell", SMMUv3State, msi_doorbell, 0),
> > };
> >
> > static void smmuv3_instance_init(Object *obj)
> > diff --git a/hw/arm/virt.c b/hw/arm/virt.c
> > index 2498e3beff..d2dcb89235 100644
> > --- a/hw/arm/virt.c
> > +++ b/hw/arm/virt.c
> > @@ -3097,6 +3097,8 @@ static void
> virt_machine_device_plug_cb(HotplugHandler *hotplug_dev,
> >
> > create_smmuv3_dev_dtb(vms, dev, bus, errp);
> > if (object_property_get_bool(OBJECT(dev), "accel", &error_abort)) {
> > + hwaddr db_start = base_memmap[VIRT_GIC_ITS].base +
> > + ITS_TRANS_SIZE + GITS_TRANSLATER;
> there are still use cases where you count target GICv2M doorbell so at
> least you would need to add some logic to switch between both
But with KVM, virt doesn't support GICv2 , right?
That reminds me we should probably add a check to see KVM enabled
for SMMUV3 accel=on case.
> > char *stage;
> > stage = object_property_get_str(OBJECT(dev), "stage",
> > &error_fatal);
> > @@ -3107,6 +3109,8 @@ static void
> virt_machine_device_plug_cb(HotplugHandler *hotplug_dev,
> > return;
> > }
> > vms->pci_preserve_config = true;
> > + object_property_set_uint(OBJECT(dev), "msi-doorbell", db_start,
> > + &error_abort);
> > }
> > }
> > }
> > diff --git a/hw/pci/pci.c b/hw/pci/pci.c
> > index 1edd711247..45e79a3c23 100644
> > --- a/hw/pci/pci.c
> > +++ b/hw/pci/pci.c
> > @@ -2982,6 +2982,23 @@ AddressSpace
> *pci_device_iommu_address_space(PCIDevice *dev)
> > return &address_space_memory;
> > }
> >
> > +bool pci_device_iommu_msi_direct_address(PCIDevice *dev, hwaddr
> *out_doorbell)
> > +{
> > + PCIBus *bus;
> > + PCIBus *iommu_bus;
> > + int devfn;
> > +
> > + pci_device_get_iommu_bus_devfn(dev, &iommu_bus, &bus, &devfn);
> > + if (iommu_bus) {
> > + if (iommu_bus->iommu_ops->get_msi_address) {
> > + *out_doorbell = iommu_bus->iommu_ops->get_msi_address(bus,
> > + iommu_bus->iommu_opaque, devfn);
> > + return true;
> > + }
> > + }
> > + return false;
> > +}
> > +
> > AddressSpace *pci_device_iommu_msi_address_space(PCIDevice *dev)
> > {
> > PCIBus *bus;
> > diff --git a/include/hw/arm/smmuv3.h b/include/hw/arm/smmuv3.h
> > index ee0b5ed74f..f50d8c72bd 100644
> > --- a/include/hw/arm/smmuv3.h
> > +++ b/include/hw/arm/smmuv3.h
> > @@ -72,6 +72,7 @@ struct SMMUv3State {
> > bool ats;
> > uint8_t oas;
> > bool pasid;
> > + uint64_t msi_doorbell;
> > };
> >
> > typedef enum {
> > diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> > index b731443c67..e1709b0bfe 100644
> > --- a/include/hw/pci/pci.h
> > +++ b/include/hw/pci/pci.h
> > @@ -679,6 +679,20 @@ typedef struct PCIIOMMUOps {
> > */
> > AddressSpace * (*get_msi_address_space)(PCIBus *bus, void *opaque,
> > int devfn);
> > + /**
> > + * @get_msi_address: get the address of MSI doorbell for the device
> (gpa) address
> > + * on a PCI bus.
> > + *
> > + * Optional callback, if implemented must return a valid MSI doorbell
> > + * address.
> > + *
> > + * @bus: the #PCIBus being accessed.
> > + *
> > + * @opaque: the data passed to pci_setup_iommu().
> > + *
> > + * @devfn: device and function number
> > + */
> > + uint64_t (*get_msi_address)(PCIBus *bus, void *opaque, int devfn);
> > } PCIIOMMUOps;
> >
> > bool pci_device_get_iommu_bus_devfn(PCIDevice *dev, PCIBus
> **piommu_bus,
> > @@ -688,6 +702,7 @@ bool pci_device_set_iommu_device(PCIDevice *dev,
> HostIOMMUDevice *hiod,
> > Error **errp);
> > void pci_device_unset_iommu_device(PCIDevice *dev);
> > AddressSpace *pci_device_iommu_msi_address_space(PCIDevice *dev);
> > +bool pci_device_iommu_msi_direct_address(PCIDevice *dev, hwaddr
> *out_doorbell);
> >
> > /**
> > * pci_device_get_viommu_flags: get vIOMMU flags.
> > diff --git a/target/arm/kvm.c b/target/arm/kvm.c
> > index 0df41128d0..8d4d2be0bc 100644
> > --- a/target/arm/kvm.c
> > +++ b/target/arm/kvm.c
> > @@ -1611,35 +1611,45 @@ int kvm_arm_set_irq(int cpu, int irqtype, int irq,
> int level)
> > int kvm_arch_fixup_msi_route(struct kvm_irq_routing_entry *route,
> > uint64_t address, uint32_t data, PCIDevice *dev)
> > {
> > - AddressSpace *as = pci_device_iommu_msi_address_space(dev);
> > + AddressSpace *as;
> > hwaddr xlat, len, doorbell_gpa;
> > MemoryRegionSection mrs;
> > MemoryRegion *mr;
> >
> > + /* Check if there is a direct msi address available */
> > + if (pci_device_iommu_msi_direct_address(dev, &doorbell_gpa)) {
> > + goto set_doorbell;
> > + }
> > +
> > + as = pci_device_iommu_msi_address_space(dev);
> logically this should be after the test below (ie. meaning we have an
> IOMMU). But this means that you shall use an as which is not the
> address_space_memory.
Ok. I will move it then.
>
> This works but it is not neat either because it totally ignored the
> @address. So you have to build a solid commit msg to explain readers why
> this is needed ;-)
Sure. I will try to do a solid one explaining why we don’t need @address for
this path😊.
Thanks,
Shameer
On Wed, Nov 05, 2025 at 02:58:16PM -0400, Jason Gunthorpe wrote: > On Wed, Nov 05, 2025 at 10:33:08AM -0800, Nicolin Chen wrote: > > On Wed, Nov 05, 2025 at 02:10:49PM -0400, Jason Gunthorpe wrote: > > > On Wed, Nov 05, 2025 at 06:25:05PM +0100, Eric Auger wrote: > > > > if the guest doorbell address is wrong because not properly translated, > > > > vgic_msi_to_its() will fail to identify the ITS to inject the MSI in. > > > > See kernel kvm/vgic/vgic-its.c vgic_msi_to_its and > > > > vgic_its_inject_msi > > > > > > Which has been exactly my point to Nicolin. There is no way to > > > "properly translate" the vMSI address in a HW accelerated SMMU > > > emulation. > > > > Hmm, I still can't connect the dots here. QEMU knows where the > > guest CD table is to get the stage-1 translation table to walk > > through. We could choose to not let it walk through. Yet, why? > > You cannot walk any tables in guest memory without fully trapping all > invalidation on all command queues. Like real HW qemu needs to fence > its walks with any concurrent invalidate & sync to ensure it doesn't > walk into a UAF situation. > > Since we can't trap or mediate vCMDQ the walking simply cannot be > done. > > Thus, the general principle of the HW accelerated vSMMU is that it > NEVER walks any of these guest tables for any reason. > > Thus, we cannot do anything with vMSI address beyond program it > directly into a real PCI device so it undergoes real HW translation. It's clear to me now. Thanks for the elaboration! Nicolin
On 11/4/25 4:14 PM, Shameer Kolothum wrote: > >> -----Original Message----- >> From: Eric Auger <eric.auger@redhat.com> >> Sent: 04 November 2025 14:44 >> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu- >> arm@nongnu.org; qemu-devel@nongnu.org >> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin >> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com; >> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>; >> smostafa@google.com; wangzhou1@hisilicon.com; >> jiangkunkun@huawei.com; jonathan.cameron@huawei.com; >> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com; >> Krishnakant Jaju <kjaju@nvidia.com> >> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional >> get_msi_address_space() callback >> >> External email: Use caution opening links or attachments >> >> >> On 11/4/25 3:37 PM, Shameer Kolothum wrote: >>> Hi Eric, >>> >>>> -----Original Message----- >>>> From: Eric Auger <eric.auger@redhat.com> >>>> Sent: 04 November 2025 14:12 >>>> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu- >>>> arm@nongnu.org; qemu-devel@nongnu.org >>>> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin >>>> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com; >>>> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>; >>>> smostafa@google.com; wangzhou1@hisilicon.com; >>>> jiangkunkun@huawei.com; jonathan.cameron@huawei.com; >>>> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com; >>>> Krishnakant Jaju <kjaju@nvidia.com> >>>> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional >>>> get_msi_address_space() callback >>>> >>>> External email: Use caution opening links or attachments >>>> >>>> >>>> Hi Shameer, Nicolin, >>>> >>>> On 10/31/25 11:49 AM, Shameer Kolothum wrote: >>>>> On ARM, devices behind an IOMMU have their MSI doorbell addresses >>>>> translated by the IOMMU. In nested mode, this translation happens in >>>>> two stages (gIOVA → gPA → ITS page). >>>>> >>>>> In accelerated SMMUv3 mode, both stages are handled by hardware, so >>>>> get_address_space() returns the system address space so that VFIO >>>>> can setup stage-2 mappings for system address space. >>>> Sorry but I still don't catch the above. Can you explain (most probably >>>> again) why this is a requirement to return the system as so that VFIO >>>> can setup stage-2 mappings for system address space. I am sorry for >>>> insisting (at the risk of being stubborn or dumb) but I fail to >>>> understand the requirement. As far as I remember the way I integrated it >>>> at the old times did not require that change: >>>> https://lore.kernel.org/all/20210411120912.15770-1- >>>> eric.auger@redhat.com/ >>>> I used a vfio_prereg_listener to force the S2 mapping. >>> Yes I remember that. >>> >>>> What has changed that forces us now to have this gym >>> This approach achieves the same outcome, but through a >>> different mechanism. Returning the system address space >>> here ensures that VFIO sets up the Stage-2 mappings for >>> devices behind the accelerated SMMUv3. >>> >>> I think, this makes sense because, in the accelerated case, the >>> device is no longer managed by QEMU’s SMMUv3 model. The >> On the other hand, as we discussed on v4 by returning system as you >> pretend there is no translation in place which is not true. Now we use >> an alias for it but it has not really removed its usage. Also it forces >> use to hack around the MSI mapping and introduce new PCIIOMMUOps. >> Have >> you assessed the feasability of using vfio_prereg_listener to force the >> S2 mapping. Is it simply not relevant anymore or could it be used also >> with the iommufd be integration? Eric > IIUC, the prereg_listener mechanism just enables us to setup the s2 > mappings. For MSI, In your version, I see that smmu_find_add_as() > always returns IOMMU as. How is that supposed to work if the Guest > has s1 bypass mode STE for the device? I need to delve into it again as I forgot the details. Will come back to you ... Eric > > Thanks, > Shameer > >
> -----Original Message----- > From: Eric Auger <eric.auger@redhat.com> > Sent: 04 November 2025 16:02 > To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu- > arm@nongnu.org; qemu-devel@nongnu.org > Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin > Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com; > Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>; > smostafa@google.com; wangzhou1@hisilicon.com; > jiangkunkun@huawei.com; jonathan.cameron@huawei.com; > zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com; > Krishnakant Jaju <kjaju@nvidia.com> > Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional > get_msi_address_space() callback > > External email: Use caution opening links or attachments > > > On 11/4/25 4:14 PM, Shameer Kolothum wrote: > > > >> -----Original Message----- > >> From: Eric Auger <eric.auger@redhat.com> > >> Sent: 04 November 2025 14:44 > >> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu- > arm@nongnu.org; > >> qemu-devel@nongnu.org > >> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; > >> Nicolin Chen <nicolinc@nvidia.com>; ddutile@redhat.com; > >> berrange@redhat.com; Nathan Chen <nathanc@nvidia.com>; Matt Ochs > >> <mochs@nvidia.com>; smostafa@google.com; wangzhou1@hisilicon.com; > >> jiangkunkun@huawei.com; jonathan.cameron@huawei.com; > >> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; > >> yi.l.liu@intel.com; Krishnakant Jaju <kjaju@nvidia.com> > >> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional > >> get_msi_address_space() callback > >> > >> External email: Use caution opening links or attachments > >> > >> > >> On 11/4/25 3:37 PM, Shameer Kolothum wrote: > >>> Hi Eric, > >>> > >>>> -----Original Message----- > >>>> From: Eric Auger <eric.auger@redhat.com> > >>>> Sent: 04 November 2025 14:12 > >>>> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu- > >>>> arm@nongnu.org; qemu-devel@nongnu.org > >>>> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; > >>>> Nicolin Chen <nicolinc@nvidia.com>; ddutile@redhat.com; > >>>> berrange@redhat.com; Nathan Chen <nathanc@nvidia.com>; Matt > Ochs > >>>> <mochs@nvidia.com>; smostafa@google.com; > wangzhou1@hisilicon.com; > >>>> jiangkunkun@huawei.com; jonathan.cameron@huawei.com; > >>>> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; > >>>> yi.l.liu@intel.com; Krishnakant Jaju <kjaju@nvidia.com> > >>>> Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional > >>>> get_msi_address_space() callback > >>>> > >>>> External email: Use caution opening links or attachments > >>>> > >>>> > >>>> Hi Shameer, Nicolin, > >>>> > >>>> On 10/31/25 11:49 AM, Shameer Kolothum wrote: > >>>>> On ARM, devices behind an IOMMU have their MSI doorbell addresses > >>>>> translated by the IOMMU. In nested mode, this translation happens > >>>>> in two stages (gIOVA → gPA → ITS page). > >>>>> > >>>>> In accelerated SMMUv3 mode, both stages are handled by hardware, > >>>>> so > >>>>> get_address_space() returns the system address space so that VFIO > >>>>> can setup stage-2 mappings for system address space. > >>>> Sorry but I still don't catch the above. Can you explain (most > >>>> probably > >>>> again) why this is a requirement to return the system as so that > >>>> VFIO can setup stage-2 mappings for system address space. I am > >>>> sorry for insisting (at the risk of being stubborn or dumb) but I > >>>> fail to understand the requirement. As far as I remember the way I > >>>> integrated it at the old times did not require that change: > >>>> https://lore.kernel.org/all/20210411120912.15770-1- > >>>> eric.auger@redhat.com/ > >>>> I used a vfio_prereg_listener to force the S2 mapping. > >>> Yes I remember that. > >>> > >>>> What has changed that forces us now to have this gym > >>> This approach achieves the same outcome, but through a different > >>> mechanism. Returning the system address space here ensures that VFIO > >>> sets up the Stage-2 mappings for devices behind the accelerated > >>> SMMUv3. > >>> > >>> I think, this makes sense because, in the accelerated case, the > >>> device is no longer managed by QEMU’s SMMUv3 model. The > >> On the other hand, as we discussed on v4 by returning system as you > >> pretend there is no translation in place which is not true. Now we > >> use an alias for it but it has not really removed its usage. Also it > >> forces use to hack around the MSI mapping and introduce new > PCIIOMMUOps. > >> Have > >> you assessed the feasability of using vfio_prereg_listener to force > >> the > >> S2 mapping. Is it simply not relevant anymore or could it be used > >> also with the iommufd be integration? Eric > > IIUC, the prereg_listener mechanism just enables us to setup the s2 > > mappings. For MSI, In your version, I see that smmu_find_add_as() > > always returns IOMMU as. How is that supposed to work if the Guest has > > s1 bypass mode STE for the device? > > I need to delve into it again as I forgot the details. Will come back to you ... I think the BYPASS case will work anyway as in smmuv3_translate() fn we are checking the ste config (SMMU_TRANS_BYPASS) and it will just return the same address back. So we can do the same here in get_msi_address_space() and return IOMMU as always. And that completely avoids &address_space_memory from SMMUv3-accel if that’s the concern. Thanks, Shameer
On Tue, Nov 04, 2025 at 05:01:57PM +0100, Eric Auger wrote:
> >>>> On 10/31/25 11:49 AM, Shameer Kolothum wrote:
> >>>>> On ARM, devices behind an IOMMU have their MSI doorbell addresses
> >>>>> translated by the IOMMU. In nested mode, this translation happens in
> >>>>> two stages (gIOVA → gPA → ITS page).
> >>>>>
> >>>>> In accelerated SMMUv3 mode, both stages are handled by hardware, so
> >>>>> get_address_space() returns the system address space so that VFIO
> >>>>> can setup stage-2 mappings for system address space.
> >>>> Sorry but I still don't catch the above. Can you explain (most probably
> >>>> again) why this is a requirement to return the system as so that VFIO
> >>>> can setup stage-2 mappings for system address space. I am sorry for
> >>>> insisting (at the risk of being stubborn or dumb) but I fail to
> >>>> understand the requirement. As far as I remember the way I integrated it
> >>>> at the old times did not require that change:
> >>>> https://lore.kernel.org/all/20210411120912.15770-1-
> >>>> eric.auger@redhat.com/
> >>>> I used a vfio_prereg_listener to force the S2 mapping.
> >>> Yes I remember that.
> >>>
> >>>> What has changed that forces us now to have this gym
> >>> This approach achieves the same outcome, but through a
> >>> different mechanism. Returning the system address space
> >>> here ensures that VFIO sets up the Stage-2 mappings for
> >>> devices behind the accelerated SMMUv3.
> >>>
> >>> I think, this makes sense because, in the accelerated case, the
> >>> device is no longer managed by QEMU’s SMMUv3 model. The
> >> On the other hand, as we discussed on v4 by returning system as you
> >> pretend there is no translation in place which is not true. Now we use
> >> an alias for it but it has not really removed its usage. Also it forces
> >> use to hack around the MSI mapping and introduce new PCIIOMMUOps.
> >> Have
> >> you assessed the feasability of using vfio_prereg_listener to force the
> >> S2 mapping. Is it simply not relevant anymore or could it be used also
> >> with the iommufd be integration? Eric
> > IIUC, the prereg_listener mechanism just enables us to setup the s2
> > mappings. For MSI, In your version, I see that smmu_find_add_as()
> > always returns IOMMU as. How is that supposed to work if the Guest
> > has s1 bypass mode STE for the device?
>
> I need to delve into it again as I forgot the details. Will come back to
> you ...
We aligned with Intel previously about this system address space.
You might know these very well, yet here are the breakdowns:
1. VFIO core has a container that manages an HWPT. By default, it
allocates a stage-1 normal HWPT, unless vIOMMU requests for a
nesting parent HWPT for accelerated cases.
2. VFIO core adds a listener for that HWPT and sets up a handler
vfio_container_region_add() where it checks the memory region
whether it is iommu or not.
a. In case of !IOMMU as (i.e. system address space), it treats
the address space as a RAM region, and handles all stage-2
mappings for the core allocated nesting parent HWPT.
b. In case of IOMMU as (i.e. a translation type) it sets up
the IOTLB notifier and translation replay while bypassing
the listener for RAM region.
In an accelerated case, we need stage-2 mappings to match with the
nesting parent HWPT. So, returning system address space or an alias
of that notifies the vfio core to take the 2.a path.
If we take 2.b path by returning IOMMU as in smmu_find_add_as, the
VFIO core would no longer listen to the RAM region for us, i.e. no
stage-2 HWPT nor mappings. vIOMMU would have to allocate a nesting
parent and manage the stage-2 mappings by adding a listener in its
own code, which is largely duplicated with the core code.
-------------- so far this works for Intel and ARM--------------
3. On ARM, vPCI device is programmed with gIOVA, so KVM has to
follow what the vPCI is told to inject vIRQs. This requires
a translation at the nested stage-1 address space. Note that
vSMMU in this case doesn't manage translation as it doesn't
need to. But there is no other sane way for KVM to know the
vITS page corresponding to the given gIOVA. So, we invented
the get_msi_address_space op.
(3) makes sense because there is a complication in the MSI that
does a 2-stage translation on ARM and KVM must follow the stage-1
input address, leaving us no choice to have two address spaces.
Thanks
Nicolin
Hi Nicolin, On 11/4/25 6:47 PM, Nicolin Chen wrote: > On Tue, Nov 04, 2025 at 05:01:57PM +0100, Eric Auger wrote: >>>>>> On 10/31/25 11:49 AM, Shameer Kolothum wrote: >>>>>>> On ARM, devices behind an IOMMU have their MSI doorbell addresses >>>>>>> translated by the IOMMU. In nested mode, this translation happens in >>>>>>> two stages (gIOVA → gPA → ITS page). >>>>>>> >>>>>>> In accelerated SMMUv3 mode, both stages are handled by hardware, so >>>>>>> get_address_space() returns the system address space so that VFIO >>>>>>> can setup stage-2 mappings for system address space. >>>>>> Sorry but I still don't catch the above. Can you explain (most probably >>>>>> again) why this is a requirement to return the system as so that VFIO >>>>>> can setup stage-2 mappings for system address space. I am sorry for >>>>>> insisting (at the risk of being stubborn or dumb) but I fail to >>>>>> understand the requirement. As far as I remember the way I integrated it >>>>>> at the old times did not require that change: >>>>>> https://lore.kernel.org/all/20210411120912.15770-1- >>>>>> eric.auger@redhat.com/ >>>>>> I used a vfio_prereg_listener to force the S2 mapping. >>>>> Yes I remember that. >>>>> >>>>>> What has changed that forces us now to have this gym >>>>> This approach achieves the same outcome, but through a >>>>> different mechanism. Returning the system address space >>>>> here ensures that VFIO sets up the Stage-2 mappings for >>>>> devices behind the accelerated SMMUv3. >>>>> >>>>> I think, this makes sense because, in the accelerated case, the >>>>> device is no longer managed by QEMU’s SMMUv3 model. The >>>> On the other hand, as we discussed on v4 by returning system as you >>>> pretend there is no translation in place which is not true. Now we use >>>> an alias for it but it has not really removed its usage. Also it forces >>>> use to hack around the MSI mapping and introduce new PCIIOMMUOps. >>>> Have >>>> you assessed the feasability of using vfio_prereg_listener to force the >>>> S2 mapping. Is it simply not relevant anymore or could it be used also >>>> with the iommufd be integration? Eric >>> IIUC, the prereg_listener mechanism just enables us to setup the s2 >>> mappings. For MSI, In your version, I see that smmu_find_add_as() >>> always returns IOMMU as. How is that supposed to work if the Guest >>> has s1 bypass mode STE for the device? >> I need to delve into it again as I forgot the details. Will come back to >> you ... > We aligned with Intel previously about this system address space. > You might know these very well, yet here are the breakdowns: > > 1. VFIO core has a container that manages an HWPT. By default, it > allocates a stage-1 normal HWPT, unless vIOMMU requests for a You may precise this stage-1 normal HWPT is used to map GPA to HPA (so eventually implements stage 2). > nesting parent HWPT for accelerated cases. > 2. VFIO core adds a listener for that HWPT and sets up a handler > vfio_container_region_add() where it checks the memory region > whether it is iommu or not. > a. In case of !IOMMU as (i.e. system address space), it treats > the address space as a RAM region, and handles all stage-2 > mappings for the core allocated nesting parent HWPT. > b. In case of IOMMU as (i.e. a translation type) it sets up > the IOTLB notifier and translation replay while bypassing > the listener for RAM region. yes S1+S2 are combined through vfio_iommu_map_notify() > > In an accelerated case, we need stage-2 mappings to match with the > nesting parent HWPT. So, returning system address space or an alias > of that notifies the vfio core to take the 2.a path. > > If we take 2.b path by returning IOMMU as in smmu_find_add_as, the > VFIO core would no longer listen to the RAM region for us, i.e. no > stage-2 HWPT nor mappings. vIOMMU would have to allocate a nesting except if you change the VFIO common.c as I did the past to force the S2 mapping in the nested config. See https://lore.kernel.org/all/20210411120912.15770-16-eric.auger@redhat.com/ and vfio_prereg_listener() Again I do not say this is the right way to do but using system address space is not the "only" implementation choice I think and it needs to be properly justified, especially has it has at least 2 side effects: - somehow abusing the semantic of returned address space and pretends there is no IOMMU translation in place and - also impacting the way MSIs are handled (introduction of a new PCIIOMMUOps). This kind of explanation you wrote is absolutely needed in the commit msg for reviewers to understand the design choice I think. Eric > parent and manage the stage-2 mappings by adding a listener in its > own code, which is largely duplicated with the core code. > > -------------- so far this works for Intel and ARM-------------- > > 3. On ARM, vPCI device is programmed with gIOVA, so KVM has to > follow what the vPCI is told to inject vIRQs. This requires > a translation at the nested stage-1 address space. Note that > vSMMU in this case doesn't manage translation as it doesn't > need to. But there is no other sane way for KVM to know the > vITS page corresponding to the given gIOVA. So, we invented > the get_msi_address_space op. > > (3) makes sense because there is a complication in the MSI that > does a 2-stage translation on ARM and KVM must follow the stage-1 > input address, leaving us no choice to have two address spaces. > > Thanks > Nicolin >
Hi Eric, On Wed, Nov 05, 2025 at 08:47:56AM +0100, Eric Auger wrote: > > We aligned with Intel previously about this system address space. > > You might know these very well, yet here are the breakdowns: > > > > 1. VFIO core has a container that manages an HWPT. By default, it > > allocates a stage-1 normal HWPT, unless vIOMMU requests for a > You may precise this stage-1 normal HWPT is used to map GPA to HPA (so > eventually implements stage 2). Functional-wise, that would work. But not as clean as we create an S2 parent hwpt from the beginning, right? > > nesting parent HWPT for accelerated cases. > > 2. VFIO core adds a listener for that HWPT and sets up a handler > > vfio_container_region_add() where it checks the memory region > > whether it is iommu or not. > > a. In case of !IOMMU as (i.e. system address space), it treats > > the address space as a RAM region, and handles all stage-2 > > mappings for the core allocated nesting parent HWPT. > > b. In case of IOMMU as (i.e. a translation type) it sets up > > the IOTLB notifier and translation replay while bypassing > > the listener for RAM region. > yes S1+S2 are combined through vfio_iommu_map_notify() But that map/unmap notifier is useless in the accelerated mode: we don't need those translation code in the emulated mode (MSI is likely to bypass translation as well); and we don't need the emulated IOTLB either since no page table walk through. Also, S1 and S2 are separated following iommufd design. In this regard, letting the core manage the S2 hwpt and mappings while vIOMMU handling the S1 hwpt allocation/attach/invalidation can look much cleaner. > > In an accelerated case, we need stage-2 mappings to match with the > > nesting parent HWPT. So, returning system address space or an alias > > of that notifies the vfio core to take the 2.a path. > > > > If we take 2.b path by returning IOMMU as in smmu_find_add_as, the > > VFIO core would no longer listen to the RAM region for us, i.e. no > > stage-2 HWPT nor mappings. vIOMMU would have to allocate a nesting > except if you change the VFIO common.c as I did the past to force the S2 > mapping in the nested config. > See > https://lore.kernel.org/all/20210411120912.15770-16-eric.auger@redhat.com/ > and vfio_prereg_listener() Yea, I remember that. But that's somewhat duplicated IMHO. The VFIO core already registers a listener on guest RAM for system address space. Having another set of vfio_prereg_listener does not feel optimal. > Again I do not say this is the right way to do but using system address > space is not the "only" implementation choice I think Oh, neither do I mean that's the "only" way. Sorry I did not make this clear. I had studied your vfio_prereg_listener approach and studied Intel's approach using the system address space, and concluded this "cleaner" way that works for both architectures. > and it needs to be > properly justified, especially has it has at least 2 side effects: > - somehow abusing the semantic of returned address space and pretends > there is no IOMMU translation in place and Perhaps we shall say "there is no emulated translation" :) > - also impacting the way MSIs are handled (introduction of a new > PCIIOMMUOps). That is a solid point. Yet I think it's less confusing now per Jason's remarks -- we will bypass the translation pathway for MSI in accelerated mode. > This kind of explanation you wrote is absolutely needed in the commit > msg for reviewers to understand the design choice I think. Sure. My bad that I didn't explain it well in the first place. Thanks Nicolin
On Tue, Nov 04, 2025 at 03:11:55PM +0100, Eric Auger wrote: > > However, QEMU/KVM also calls this callback when resolving > > MSI doorbells: > > > > kvm_irqchip_add_msi_route() > > kvm_arch_fixup_msi_route() > > pci_device_iommu_address_space() > > get_address_space() > > > > VFIO device in the guest with a SMMUv3 is programmed with a gIOVA for > > MSI doorbell. This gIOVA can't be used to setup the MSI doorbell > > directly. This needs to be translated to vITS gPA. In order to do the > > doorbell transalation it needs IOMMU address space. Why does qemu do anything with the msi address? It is opaque and qemu cannot determine anything meaningful from it. I expect it to ignore it? Jason
> -----Original Message----- > From: Jason Gunthorpe <jgg@nvidia.com> > Sent: 04 November 2025 14:21 > To: Eric Auger <eric.auger@redhat.com> > Cc: Shameer Kolothum <skolothumtho@nvidia.com>; qemu- > arm@nongnu.org; qemu-devel@nongnu.org; peter.maydell@linaro.org; > Nicolin Chen <nicolinc@nvidia.com>; ddutile@redhat.com; > berrange@redhat.com; Nathan Chen <nathanc@nvidia.com>; Matt Ochs > <mochs@nvidia.com>; smostafa@google.com; wangzhou1@hisilicon.com; > jiangkunkun@huawei.com; jonathan.cameron@huawei.com; > zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com; > Krishnakant Jaju <kjaju@nvidia.com> > Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional > get_msi_address_space() callback > > On Tue, Nov 04, 2025 at 03:11:55PM +0100, Eric Auger wrote: > > > However, QEMU/KVM also calls this callback when resolving > > > MSI doorbells: > > > > > > kvm_irqchip_add_msi_route() > > > kvm_arch_fixup_msi_route() > > > pci_device_iommu_address_space() > > > get_address_space() > > > > > > VFIO device in the guest with a SMMUv3 is programmed with a gIOVA for > > > MSI doorbell. This gIOVA can't be used to setup the MSI doorbell > > > directly. This needs to be translated to vITS gPA. In order to do the > > > doorbell transalation it needs IOMMU address space. > > Why does qemu do anything with the msi address? It is opaque and qemu > cannot determine anything meaningful from it. I expect it to ignore it? I am afraid not. Guest MSI table write gets trapped and it then configures the doorbell( this is where this patch comes handy) and sets up the KVM routing etc. Thanks, Shameer
On Tue, Nov 04, 2025 at 02:42:57PM +0000, Shameer Kolothum wrote: > > On Tue, Nov 04, 2025 at 03:11:55PM +0100, Eric Auger wrote: > > > > However, QEMU/KVM also calls this callback when resolving > > > > MSI doorbells: > > > > > > > > kvm_irqchip_add_msi_route() > > > > kvm_arch_fixup_msi_route() > > > > pci_device_iommu_address_space() > > > > get_address_space() > > > > > > > > VFIO device in the guest with a SMMUv3 is programmed with a gIOVA for > > > > MSI doorbell. This gIOVA can't be used to setup the MSI doorbell > > > > directly. This needs to be translated to vITS gPA. In order to do the > > > > doorbell transalation it needs IOMMU address space. > > > > Why does qemu do anything with the msi address? It is opaque and qemu > > cannot determine anything meaningful from it. I expect it to ignore it? > > I am afraid not. Guest MSI table write gets trapped and it then configures the > doorbell( this is where this patch comes handy) and sets up the KVM > routing etc. Sure it is trapped, but nothing should be looking at the MSI address from the guest, it is meaningless and wrong information. Just ignore it. Jason
> -----Original Message----- > From: Jason Gunthorpe <jgg@nvidia.com> > Sent: 04 November 2025 14:52 > To: Shameer Kolothum <skolothumtho@nvidia.com> > Cc: Eric Auger <eric.auger@redhat.com>; qemu-arm@nongnu.org; qemu- > devel@nongnu.org; peter.maydell@linaro.org; Nicolin Chen > <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com; Nathan > Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>; > smostafa@google.com; wangzhou1@hisilicon.com; > jiangkunkun@huawei.com; jonathan.cameron@huawei.com; > zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com; > Krishnakant Jaju <kjaju@nvidia.com> > Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional > get_msi_address_space() callback > > On Tue, Nov 04, 2025 at 02:42:57PM +0000, Shameer Kolothum wrote: > > > On Tue, Nov 04, 2025 at 03:11:55PM +0100, Eric Auger wrote: > > > > > However, QEMU/KVM also calls this callback when resolving > > > > > MSI doorbells: > > > > > > > > > > kvm_irqchip_add_msi_route() > > > > > kvm_arch_fixup_msi_route() > > > > > pci_device_iommu_address_space() > > > > > get_address_space() > > > > > > > > > > VFIO device in the guest with a SMMUv3 is programmed with a gIOVA > for > > > > > MSI doorbell. This gIOVA can't be used to setup the MSI doorbell > > > > > directly. This needs to be translated to vITS gPA. In order to do the > > > > > doorbell transalation it needs IOMMU address space. > > > > > > Why does qemu do anything with the msi address? It is opaque and qemu > > > cannot determine anything meaningful from it. I expect it to ignore it? > > > > I am afraid not. Guest MSI table write gets trapped and it then configures the > > doorbell( this is where this patch comes handy) and sets up the KVM > > routing etc. > > Sure it is trapped, but nothing should be looking at the MSI address > from the guest, it is meaningless and wrong information. Just ignore > it. Hmm.. we need to setup the doorbell address correctly. If we don't do the translation here, it will use the Guest IOVA address. Remember, we are using the IORT RMR identity mapping to get MSI working. See this discussion here, https://lore.kernel.org/qemu-devel/CH3PR12MB754810AE8D308630041F9AFEABF2A@CH3PR12MB7548.namprd12.prod.outlook.com/ Thanks, Shameer
On Tue, Nov 04, 2025 at 02:58:44PM +0000, Shameer Kolothum wrote: > > Sure it is trapped, but nothing should be looking at the MSI address > > from the guest, it is meaningless and wrong information. Just ignore > > it. > > Hmm.. we need to setup the doorbell address correctly. > If we don't do the translation here, it will use the Guest IOVA > address. Remember, we are using the IORT RMR identity mapping to get > MSI working. Either you use the RMR value, which is forced by the kernel into the physical MSI through iommufd and kernel ignores anything qemu does. So fully ignore the guest's vMSI address. Eventually qemu should transfer the unchanged guest vMSI address directly to the kernel, but we haven't figured that out yet. Jason
> -----Original Message----- > From: Jason Gunthorpe <jgg@nvidia.com> > Sent: 04 November 2025 15:13 > To: Shameer Kolothum <skolothumtho@nvidia.com> > Cc: Eric Auger <eric.auger@redhat.com>; qemu-arm@nongnu.org; qemu- > devel@nongnu.org; peter.maydell@linaro.org; Nicolin Chen > <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com; Nathan > Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>; > smostafa@google.com; wangzhou1@hisilicon.com; > jiangkunkun@huawei.com; jonathan.cameron@huawei.com; > zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com; > Krishnakant Jaju <kjaju@nvidia.com> > Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional > get_msi_address_space() callback > > On Tue, Nov 04, 2025 at 02:58:44PM +0000, Shameer Kolothum wrote: > > > Sure it is trapped, but nothing should be looking at the MSI address > > > from the guest, it is meaningless and wrong information. Just ignore > > > it. > > > > Hmm.. we need to setup the doorbell address correctly. > > > If we don't do the translation here, it will use the Guest IOVA > > address. Remember, we are using the IORT RMR identity mapping to get > > MSI working. > > Either you use the RMR value, which is forced by the kernel into the > physical MSI through iommufd and kernel ignores anything qemu > does. So fully ignore the guest's vMSI address. Well, we are sort of trying to do the same through this patch here. But to avoid a "translation" completely it will involve some changes to Qemu pci subsystem. I think this is the least intrusive path I can think of now. And this is a one time setup mostly. Thanks, Shameer
On Tue, Nov 04, 2025 at 03:20:59PM +0000, Shameer Kolothum wrote: > > On Tue, Nov 04, 2025 at 02:58:44PM +0000, Shameer Kolothum wrote: > > > > Sure it is trapped, but nothing should be looking at the MSI address > > > > from the guest, it is meaningless and wrong information. Just ignore > > > > it. > > > > > > Hmm.. we need to setup the doorbell address correctly. > > > > > If we don't do the translation here, it will use the Guest IOVA > > > address. Remember, we are using the IORT RMR identity mapping to get > > > MSI working. > > > > Either you use the RMR value, which is forced by the kernel into the > > physical MSI through iommufd and kernel ignores anything qemu > > does. So fully ignore the guest's vMSI address. > > Well, we are sort of trying to do the same through this patch here. > But to avoid a "translation" completely it will involve some changes to > Qemu pci subsystem. I think this is the least intrusive path I can think > of now. And this is a one time setup mostly. Should be explained in the commit message that the translation is pointless. I'm not sure about this, any translation seems risky because it could fail. The guest can use any IOVA for MSI and none may fail. Jason
On Tue, Nov 04, 2025 at 11:35:35AM -0400, Jason Gunthorpe wrote: > On Tue, Nov 04, 2025 at 03:20:59PM +0000, Shameer Kolothum wrote: > > > On Tue, Nov 04, 2025 at 02:58:44PM +0000, Shameer Kolothum wrote: > > > > > Sure it is trapped, but nothing should be looking at the MSI address > > > > > from the guest, it is meaningless and wrong information. Just ignore > > > > > it. > > > > > > > > Hmm.. we need to setup the doorbell address correctly. > > > > > > > If we don't do the translation here, it will use the Guest IOVA > > > > address. Remember, we are using the IORT RMR identity mapping to get > > > > MSI working. > > > > > > Either you use the RMR value, which is forced by the kernel into the > > > physical MSI through iommufd and kernel ignores anything qemu > > > does. So fully ignore the guest's vMSI address. > > > > Well, we are sort of trying to do the same through this patch here. > > But to avoid a "translation" completely it will involve some changes to > > Qemu pci subsystem. I think this is the least intrusive path I can think > > of now. And this is a one time setup mostly. > > Should be explained in the commit message that the translation is > pointless. I'm not sure about this, any translation seems risky > because it could fail. The guest can use any IOVA for MSI and none may > fail. In the current design of KVM in QEMU, it does a generic translation from gIOVA->gPA for the doorbell location to inject IRQ, whether VM has an accelerated IOMMU or an emulated IOMMU. In the accelerated case, this translation is pointless for the SMMU HW underlying. But the IRQ injection routine still stands. We could have invented something like get_msi_physical_address, but the vPCI device is programmed with gIOVA for MSI. So it makes sense for VMM to follow that gIOVA? Even if the gIOVA is a wrong address, I think VMM shouldn't correct that, since a real HW wouldn't. Thanks Nicolin
On Tue, Nov 04, 2025 at 09:11:55AM -0800, Nicolin Chen wrote: > On Tue, Nov 04, 2025 at 11:35:35AM -0400, Jason Gunthorpe wrote: > > On Tue, Nov 04, 2025 at 03:20:59PM +0000, Shameer Kolothum wrote: > > > > On Tue, Nov 04, 2025 at 02:58:44PM +0000, Shameer Kolothum wrote: > > > > > > Sure it is trapped, but nothing should be looking at the MSI address > > > > > > from the guest, it is meaningless and wrong information. Just ignore > > > > > > it. > > > > > > > > > > Hmm.. we need to setup the doorbell address correctly. > > > > > > > > > If we don't do the translation here, it will use the Guest IOVA > > > > > address. Remember, we are using the IORT RMR identity mapping to get > > > > > MSI working. > > > > > > > > Either you use the RMR value, which is forced by the kernel into the > > > > physical MSI through iommufd and kernel ignores anything qemu > > > > does. So fully ignore the guest's vMSI address. > > > > > > Well, we are sort of trying to do the same through this patch here. > > > But to avoid a "translation" completely it will involve some changes to > > > Qemu pci subsystem. I think this is the least intrusive path I can think > > > of now. And this is a one time setup mostly. > > > > Should be explained in the commit message that the translation is > > pointless. I'm not sure about this, any translation seems risky > > because it could fail. The guest can use any IOVA for MSI and none may > > fail. > > In the current design of KVM in QEMU, it does a generic translation > from gIOVA->gPA for the doorbell location to inject IRQ, whether VM > has an accelerated IOMMU or an emulated IOMMU. And what happens if the translation fails because there is no mapping? It should be ignored for this case and not ignored for others. Jason
On 11/4/25 6:41 PM, Jason Gunthorpe wrote: > On Tue, Nov 04, 2025 at 09:11:55AM -0800, Nicolin Chen wrote: >> On Tue, Nov 04, 2025 at 11:35:35AM -0400, Jason Gunthorpe wrote: >>> On Tue, Nov 04, 2025 at 03:20:59PM +0000, Shameer Kolothum wrote: >>>>> On Tue, Nov 04, 2025 at 02:58:44PM +0000, Shameer Kolothum wrote: >>>>>>> Sure it is trapped, but nothing should be looking at the MSI address >>>>>>> from the guest, it is meaningless and wrong information. Just ignore >>>>>>> it. >>>>>> Hmm.. we need to setup the doorbell address correctly. >>>>>> If we don't do the translation here, it will use the Guest IOVA >>>>>> address. Remember, we are using the IORT RMR identity mapping to get >>>>>> MSI working. >>>>> Either you use the RMR value, which is forced by the kernel into the >>>>> physical MSI through iommufd and kernel ignores anything qemu >>>>> does. So fully ignore the guest's vMSI address. >>>> Well, we are sort of trying to do the same through this patch here. >>>> But to avoid a "translation" completely it will involve some changes to >>>> Qemu pci subsystem. I think this is the least intrusive path I can think >>>> of now. And this is a one time setup mostly. >>> Should be explained in the commit message that the translation is >>> pointless. I'm not sure about this, any translation seems risky >>> because it could fail. The guest can use any IOVA for MSI and none may >>> fail. in general the translation is not pointless (I mean when RMR are not applied). In case a vhost device (virtio-net) for instance is protected by SMMU, vhost triggers irqfds upon which a gsi is injected in vgic. This latter does irq_routing mapping and this gsi is associated to an MSI address/data. If the MSI address is wrong, ie. not corresponding to the vITS gpa doorbell, kernel kvm/vgic/vgic-its.c vgic_its_trigger_msi will fail to inject the MSI on guest since vgic_msi_to_its/__vgic_doorbell_to_its will fail to find the ITS instance to inject in. Thanks Eric >> In the current design of KVM in QEMU, it does a generic translation >> from gIOVA->gPA for the doorbell location to inject IRQ, whether VM >> has an accelerated IOMMU or an emulated IOMMU. > And what happens if the translation fails because there is no mapping? > It should be ignored for this case and not ignored for others. > > Jason >
On Tue, Nov 04, 2025 at 01:41:52PM -0400, Jason Gunthorpe wrote: > On Tue, Nov 04, 2025 at 09:11:55AM -0800, Nicolin Chen wrote: > > On Tue, Nov 04, 2025 at 11:35:35AM -0400, Jason Gunthorpe wrote: > > > On Tue, Nov 04, 2025 at 03:20:59PM +0000, Shameer Kolothum wrote: > > > > > On Tue, Nov 04, 2025 at 02:58:44PM +0000, Shameer Kolothum wrote: > > > > > > > Sure it is trapped, but nothing should be looking at the MSI address > > > > > > > from the guest, it is meaningless and wrong information. Just ignore > > > > > > > it. > > > > > > > > > > > > Hmm.. we need to setup the doorbell address correctly. > > > > > > > > > > > If we don't do the translation here, it will use the Guest IOVA > > > > > > address. Remember, we are using the IORT RMR identity mapping to get > > > > > > MSI working. > > > > > > > > > > Either you use the RMR value, which is forced by the kernel into the > > > > > physical MSI through iommufd and kernel ignores anything qemu > > > > > does. So fully ignore the guest's vMSI address. > > > > > > > > Well, we are sort of trying to do the same through this patch here. > > > > But to avoid a "translation" completely it will involve some changes to > > > > Qemu pci subsystem. I think this is the least intrusive path I can think > > > > of now. And this is a one time setup mostly. > > > > > > Should be explained in the commit message that the translation is > > > pointless. I'm not sure about this, any translation seems risky > > > because it could fail. The guest can use any IOVA for MSI and none may > > > fail. > > > > In the current design of KVM in QEMU, it does a generic translation > > from gIOVA->gPA for the doorbell location to inject IRQ, whether VM > > has an accelerated IOMMU or an emulated IOMMU. > > And what happens if the translation fails because there is no mapping? > It should be ignored for this case and not ignored for others. It errors out and does no injection. IOW, yea, "ignored". Nicolin
On Tue, Nov 04, 2025 at 09:57:53AM -0800, Nicolin Chen wrote: > On Tue, Nov 04, 2025 at 01:41:52PM -0400, Jason Gunthorpe wrote: > > On Tue, Nov 04, 2025 at 09:11:55AM -0800, Nicolin Chen wrote: > > > On Tue, Nov 04, 2025 at 11:35:35AM -0400, Jason Gunthorpe wrote: > > > > On Tue, Nov 04, 2025 at 03:20:59PM +0000, Shameer Kolothum wrote: > > > > > > On Tue, Nov 04, 2025 at 02:58:44PM +0000, Shameer Kolothum wrote: > > > > > > > > Sure it is trapped, but nothing should be looking at the MSI address > > > > > > > > from the guest, it is meaningless and wrong information. Just ignore > > > > > > > > it. > > > > > > > > > > > > > > Hmm.. we need to setup the doorbell address correctly. > > > > > > > > > > > > > If we don't do the translation here, it will use the Guest IOVA > > > > > > > address. Remember, we are using the IORT RMR identity mapping to get > > > > > > > MSI working. > > > > > > > > > > > > Either you use the RMR value, which is forced by the kernel into the > > > > > > physical MSI through iommufd and kernel ignores anything qemu > > > > > > does. So fully ignore the guest's vMSI address. > > > > > > > > > > Well, we are sort of trying to do the same through this patch here. > > > > > But to avoid a "translation" completely it will involve some changes to > > > > > Qemu pci subsystem. I think this is the least intrusive path I can think > > > > > of now. And this is a one time setup mostly. > > > > > > > > Should be explained in the commit message that the translation is > > > > pointless. I'm not sure about this, any translation seems risky > > > > because it could fail. The guest can use any IOVA for MSI and none may > > > > fail. > > > > > > In the current design of KVM in QEMU, it does a generic translation > > > from gIOVA->gPA for the doorbell location to inject IRQ, whether VM > > > has an accelerated IOMMU or an emulated IOMMU. > > > > And what happens if the translation fails because there is no mapping? > > It should be ignored for this case and not ignored for others. > > It errors out and does no injection. IOW, yea, "ignored". "does no injection" does not sound like ignored to me.. Jason
On Tue, Nov 04, 2025 at 02:09:28PM -0400, Jason Gunthorpe wrote: > On Tue, Nov 04, 2025 at 09:57:53AM -0800, Nicolin Chen wrote: > > On Tue, Nov 04, 2025 at 01:41:52PM -0400, Jason Gunthorpe wrote: > > > On Tue, Nov 04, 2025 at 09:11:55AM -0800, Nicolin Chen wrote: > > > > On Tue, Nov 04, 2025 at 11:35:35AM -0400, Jason Gunthorpe wrote: > > > > > On Tue, Nov 04, 2025 at 03:20:59PM +0000, Shameer Kolothum wrote: > > > > > > > On Tue, Nov 04, 2025 at 02:58:44PM +0000, Shameer Kolothum wrote: > > > > > > > > > Sure it is trapped, but nothing should be looking at the MSI address > > > > > > > > > from the guest, it is meaningless and wrong information. Just ignore > > > > > > > > > it. > > > > > > > > > > > > > > > > Hmm.. we need to setup the doorbell address correctly. > > > > > > > > > > > > > > > If we don't do the translation here, it will use the Guest IOVA > > > > > > > > address. Remember, we are using the IORT RMR identity mapping to get > > > > > > > > MSI working. > > > > > > > > > > > > > > Either you use the RMR value, which is forced by the kernel into the > > > > > > > physical MSI through iommufd and kernel ignores anything qemu > > > > > > > does. So fully ignore the guest's vMSI address. > > > > > > > > > > > > Well, we are sort of trying to do the same through this patch here. > > > > > > But to avoid a "translation" completely it will involve some changes to > > > > > > Qemu pci subsystem. I think this is the least intrusive path I can think > > > > > > of now. And this is a one time setup mostly. > > > > > > > > > > Should be explained in the commit message that the translation is > > > > > pointless. I'm not sure about this, any translation seems risky > > > > > because it could fail. The guest can use any IOVA for MSI and none may > > > > > fail. > > > > > > > > In the current design of KVM in QEMU, it does a generic translation > > > > from gIOVA->gPA for the doorbell location to inject IRQ, whether VM > > > > has an accelerated IOMMU or an emulated IOMMU. > > > > > > And what happens if the translation fails because there is no mapping? > > > It should be ignored for this case and not ignored for others. > > > > It errors out and does no injection. IOW, yea, "ignored". > > "does no injection" does not sound like ignored to me.. Sorry. I think I've missed your point. The hardware path is programmed with a RMR-ed sw_msi in the host via VFIO's PCI IRQ, ignoring the gIOVA and vITS in the guest VM, even if the vPCI is programmed with a wrong gIOVA that could not be translated. KVM would always get the IRQ from HW, since the HW is programmed correctly. But if gIOVA->vITS is not mapped, i.e. gIOVA is given incorrectly, it can't inject the IRQ. (Perhaps vSMMU in this case should F_TRANSLATION to the device.) What was the meaning of "ignore" in your remarks? Thanks Nicolin
On Tue, Nov 04, 2025 at 10:44:27AM -0800, Nicolin Chen wrote: > The hardware path is programmed with a RMR-ed sw_msi in the host > via VFIO's PCI IRQ, ignoring the gIOVA and vITS in the guest VM, > even if the vPCI is programmed with a wrong gIOVA that could not > be translated. Yes > KVM would always get the IRQ from HW, since the HW is programmed > correctly. But if gIOVA->vITS is not mapped, i.e. gIOVA is given > incorrectly, it can't inject the IRQ. But this is a software interrupt, and I think it should still just ignore vMSI's address and assume it is mapped to a legal ITS page. There is just no way to validate it. Even SW MSI shouldn't fail because the vMSI has some weird IOVA in it that isn't mapped in the S2. That's wrong and is something the guest is permitted to do. Jason
On Tue, Nov 04, 2025 at 02:56:51PM -0400, Jason Gunthorpe wrote: > On Tue, Nov 04, 2025 at 10:44:27AM -0800, Nicolin Chen wrote: > > KVM would always get the IRQ from HW, since the HW is programmed > > correctly. But if gIOVA->vITS is not mapped, i.e. gIOVA is given > > incorrectly, it can't inject the IRQ. > > But this is a software interrupt, and I think it should still just > ignore vMSI's address and assume it is mapped to a legal ITS > page. There is just no way to validate it. > > Even SW MSI shouldn't fail because the vMSI has some weird IOVA in it > that isn't mapped in the S2. That's wrong and is something the guest > is permitted to do. Hmm, that feels like a self-correction? But in a baremetal case, if HW is programmed with a weird IOVA, interrupt would not work, right? Thanks Nicolin
On Tue, Nov 04, 2025 at 11:31:50AM -0800, Nicolin Chen wrote: > On Tue, Nov 04, 2025 at 02:56:51PM -0400, Jason Gunthorpe wrote: > > On Tue, Nov 04, 2025 at 10:44:27AM -0800, Nicolin Chen wrote: > > > KVM would always get the IRQ from HW, since the HW is programmed > > > correctly. But if gIOVA->vITS is not mapped, i.e. gIOVA is given > > > incorrectly, it can't inject the IRQ. > > > > But this is a software interrupt, and I think it should still just > > ignore vMSI's address and assume it is mapped to a legal ITS > > page. There is just no way to validate it. > > > > Even SW MSI shouldn't fail because the vMSI has some weird IOVA in it > > that isn't mapped in the S2. That's wrong and is something the guest > > is permitted to do. > > Hmm, that feels like a self-correction? But in a baremetal case, > if HW is programmed with a weird IOVA, interrupt would not work, > right? Right, but qemu has no way to duplicate that behavior unless it walks the full s1 and s2 page tables, which we have said it isn't going to do. So it should probably just ignore this check and assume the IOVA is set properly, exactly the same as if it was HW injected using the RMR. Jason
> -----Original Message----- > From: Jason Gunthorpe <jgg@nvidia.com> > Sent: 04 November 2025 19:35 > To: Nicolin Chen <nicolinc@nvidia.com> > Cc: Shameer Kolothum <skolothumtho@nvidia.com>; Eric Auger > <eric.auger@redhat.com>; qemu-arm@nongnu.org; qemu- > devel@nongnu.org; peter.maydell@linaro.org; ddutile@redhat.com; > berrange@redhat.com; Nathan Chen <nathanc@nvidia.com>; Matt Ochs > <mochs@nvidia.com>; smostafa@google.com; wangzhou1@hisilicon.com; > jiangkunkun@huawei.com; jonathan.cameron@huawei.com; > zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com; > Krishnakant Jaju <kjaju@nvidia.com> > Subject: Re: [PATCH v5 15/32] hw/pci/pci: Introduce optional > get_msi_address_space() callback > > On Tue, Nov 04, 2025 at 11:31:50AM -0800, Nicolin Chen wrote: > > On Tue, Nov 04, 2025 at 02:56:51PM -0400, Jason Gunthorpe wrote: > > > On Tue, Nov 04, 2025 at 10:44:27AM -0800, Nicolin Chen wrote: > > > > KVM would always get the IRQ from HW, since the HW is programmed > > > > correctly. But if gIOVA->vITS is not mapped, i.e. gIOVA is given > > > > incorrectly, it can't inject the IRQ. > > > > > > But this is a software interrupt, and I think it should still just > > > ignore vMSI's address and assume it is mapped to a legal ITS > > > page. There is just no way to validate it. > > > > > > Even SW MSI shouldn't fail because the vMSI has some weird IOVA in it > > > that isn't mapped in the S2. That's wrong and is something the guest > > > is permitted to do. > > > > Hmm, that feels like a self-correction? But in a baremetal case, > > if HW is programmed with a weird IOVA, interrupt would not work, > > right? > > Right, but qemu has no way to duplicate that behavior unless it walks > the full s1 and s2 page tables, which we have said it isn't going to > do. > So it should probably just ignore this check and assume the IOVA is > set properly, exactly the same as if it was HW injected using the RMR. TBH, I am a bit lost here. Anyway, this is my understanding. If we ignore and don't return the correct doorbell (gPA) here, Qemu will end up invoking KVM_SET_GSI_ROUTING with wrong doorbell which sets up the in-kernel vgic irq routing information. And when HW raises the IRQ, KVM can't inject it properly. Thanks, Shameer
On Tue, Nov 04, 2025 at 07:46:46PM +0000, Shameer Kolothum wrote: > If we ignore and don't return the correct doorbell (gPA) here, > Qemu will end up invoking KVM_SET_GSI_ROUTING with wrong doorbell > which sets up the in-kernel vgic irq routing information. And when HW > raises the IRQ, KVM can't inject it properly. That cannot be true. Again, there is no way for qmeu to put something meaningful into the 'struct kvm_irq_routing_msi' address_lo/hi. It cannot walk the page tables so it just ends up with some random meaningless guest IOVA. Qemu MUST ignore the vMSI's address information. So either the kernel ignores address_lo/high OR qemu should match the vPCI device to its single vGIC and put in the kernel expected address_lo/high always. It should never, ever use the value from the guest once nesting is enabled, and it should never be trying to translate the vMSI through some S2, or any other, address space. Translation is OK for non-nesting only. Jason
On Tue, Nov 04, 2025 at 03:35:21PM -0400, Jason Gunthorpe wrote: > On Tue, Nov 04, 2025 at 11:31:50AM -0800, Nicolin Chen wrote: > > On Tue, Nov 04, 2025 at 02:56:51PM -0400, Jason Gunthorpe wrote: > > > On Tue, Nov 04, 2025 at 10:44:27AM -0800, Nicolin Chen wrote: > > > > KVM would always get the IRQ from HW, since the HW is programmed > > > > correctly. But if gIOVA->vITS is not mapped, i.e. gIOVA is given > > > > incorrectly, it can't inject the IRQ. > > > > > > But this is a software interrupt, and I think it should still just > > > ignore vMSI's address and assume it is mapped to a legal ITS > > > page. There is just no way to validate it. > > > > > > Even SW MSI shouldn't fail because the vMSI has some weird IOVA in it > > > that isn't mapped in the S2. That's wrong and is something the guest > > > is permitted to do. > > > > Hmm, that feels like a self-correction? But in a baremetal case, > > if HW is programmed with a weird IOVA, interrupt would not work, > > right? > > Right, but qemu has no way to duplicate that behavior unless it walks > the full s1 and s2 page tables, which we have said it isn't going to > do. I think it could. The stage-1 page table is in the guest RAM. And vSMMU has already implemented the logic to walk through a guest page table. What KVM has already been doing today is to ask vSMMU to translate that. What we haven't implemented today is, if gIOVA is a weird one that isn't translatable, vSMMU should trigger an F_TRANSLATION event as the real HW does. > So it should probably just ignore this check and assume the IOVA is > set properly, exactly the same as if it was HW injected using the RMR. Hmm, I am not sure about that, especially considering our plan to support the true 2-stage mapping: gIOVA->vITS->pITS :-/ Thanks Nicolin
On Tue, Nov 04, 2025 at 11:43:07AM -0800, Nicolin Chen wrote: > > Right, but qemu has no way to duplicate that behavior unless it walks > > the full s1 and s2 page tables, which we have said it isn't going to > > do. > > I think it could. > > The stage-1 page table is in the guest RAM. And vSMMU has already > implemented the logic to walk through a guest page table. What KVM > has already been doing today is to ask vSMMU to translate that. No, we can't. The existing vsmmu code could do it because it mediated the invalidation path. As soon as you have something like vcmdq the hypervisor cannot walk the page tables. > > So it should probably just ignore this check and assume the IOVA is > > set properly, exactly the same as if it was HW injected using the RMR. > > Hmm, I am not sure about that, especially considering our plan to > support the true 2-stage mapping: gIOVA->vITS->pITS :-/ In true mode the HW path will work perfectly and the SW path will remain deficient in not checking for invalid configuration I don't see another sensible choice. Jason
On Tue, Nov 04, 2025 at 03:45:52PM -0400, Jason Gunthorpe wrote: > On Tue, Nov 04, 2025 at 11:43:07AM -0800, Nicolin Chen wrote: > > > Right, but qemu has no way to duplicate that behavior unless it walks > > > the full s1 and s2 page tables, which we have said it isn't going to > > > do. > > > > I think it could. > > > > The stage-1 page table is in the guest RAM. And vSMMU has already > > implemented the logic to walk through a guest page table. What KVM > > has already been doing today is to ask vSMMU to translate that. > > No, we can't. The existing vsmmu code could do it because it mediated > the invalidation path. As soon as you have something like vcmdq the > hypervisor cannot walk the page tables. Hmm? It does walk through the page table (not invalidation path): https://github.com/qemu/qemu/blob/master/hw/arm/smmu-common.c#L444 And VCMDQ can work with that. We've tested it.. Nicolin
© 2016 - 2025 Red Hat, Inc.