[v6] hw/arm/virt: Add support for user-creatable accelerated SMMUv3

[PATCH v6 32/33] vfio: Synthesize vPASID capability to VM

Posted by Shameer Kolothum 2 months, 3 weeks ago

From: Yi Liu <yi.l.liu@intel.com>

If user wants to expose PASID capability in vIOMMU, then VFIO would also
need to report the PASID cap for this device if the underlying hardware
supports it as well.

As a start, this chooses to put the vPASID cap in the last 8 bytes of the
vconfig space. This is a choice in the good hope of no conflict with any
existing cap or hidden registers. For the devices that has hidden registers,
user should figure out a proper offset for the vPASID cap. This may require
an option for user to config it. Here we leave it as a future extension.
There are more discussions on the mechanism of finding the proper offset.

https://lore.kernel.org/kvm/BN9PR11MB5276318969A212AD0649C7BE8CBE2@BN9PR11MB5276.namprd11.prod.outlook.com/

Since we add a check to ensure the vIOMMU supports PASID, only devices
under those vIOMMUs can synthesize the vPASID capability. This gives
users control over which devices expose vPASID.

Signed-off-by: Yi Liu <yi.l.liu@intel.com>
Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
 hw/vfio/pci.c      | 38 ++++++++++++++++++++++++++++++++++++++
 include/hw/iommu.h |  1 +
 2 files changed, 39 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 8b8bc5a421..e11e39d667 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -24,6 +24,7 @@
 #include <sys/ioctl.h>
 
 #include "hw/hw.h"
+#include "hw/iommu.h"
 #include "hw/pci/msi.h"
 #include "hw/pci/msix.h"
 #include "hw/pci/pci_bridge.h"
@@ -2500,7 +2501,12 @@ static int vfio_setup_rebar_ecap(VFIOPCIDevice *vdev, uint16_t pos)
 
 static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
 {
+    HostIOMMUDevice *hiod = vdev->vbasedev.hiod;
+    HostIOMMUDeviceClass *hiodc = HOST_IOMMU_DEVICE_GET_CLASS(hiod);
     PCIDevice *pdev = PCI_DEVICE(vdev);
+    uint64_t max_pasid_log2 = 0;
+    bool pasid_cap_added = false;
+    uint64_t hw_caps;
     uint32_t header;
     uint16_t cap_id, next, size;
     uint8_t cap_ver;
@@ -2578,12 +2584,44 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
                 pcie_add_capability(pdev, cap_id, cap_ver, next, size);
             }
             break;
+        /*
+         * VFIO kernel does not expose the PASID CAP today. We may synthesize
+         * one later through IOMMUFD APIs. If VFIO ever starts exposing it,
+         * record its presence here so we do not create a duplicate CAP.
+         */
+        case PCI_EXT_CAP_ID_PASID:
+             pasid_cap_added = true;
+             /* fallthrough */
         default:
             pcie_add_capability(pdev, cap_id, cap_ver, next, size);
         }
 
     }
 
+#ifdef CONFIG_IOMMUFD
+    /* Try to retrieve PASID CAP through IOMMUFD APIs */
+    if (!pasid_cap_added && hiodc && hiodc->get_cap) {
+        hiodc->get_cap(hiod, HOST_IOMMU_DEVICE_CAP_GENERIC_HW, &hw_caps, NULL);
+        hiodc->get_cap(hiod, HOST_IOMMU_DEVICE_CAP_MAX_PASID_LOG2,
+                       &max_pasid_log2, NULL);
+    }
+
+    /*
+     * If supported, adds the PASID capability in the end of the PCIe config
+     * space. TODO: Add option for enabling pasid at a safe offset.
+     */
+    if (max_pasid_log2 && (pci_device_get_viommu_flags(pdev) &
+                           VIOMMU_FLAG_PASID_SUPPORTED)) {
+        bool exec_perm = (hw_caps & IOMMU_HW_CAP_PCI_PASID_EXEC);
+        bool priv_mod = (hw_caps & IOMMU_HW_CAP_PCI_PASID_PRIV);
+
+        pcie_pasid_init(pdev, PCIE_CONFIG_SPACE_SIZE - PCI_EXT_CAP_PASID_SIZEOF,
+                        max_pasid_log2, exec_perm, priv_mod);
+        /* PASID capability is fully emulated by QEMU */
+        memset(vdev->emulated_config_bits + pdev->exp.pasid_cap, 0xff, 8);
+    }
+#endif
+
     /* Cleanup chain head ID if necessary */
     if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) == 0xFFFF) {
         pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0);
diff --git a/include/hw/iommu.h b/include/hw/iommu.h
index 9b8bb94fc2..9635770bee 100644
--- a/include/hw/iommu.h
+++ b/include/hw/iommu.h
@@ -20,6 +20,7 @@
 enum viommu_flags {
     /* vIOMMU needs nesting parent HWPT to create nested HWPT */
     VIOMMU_FLAG_WANT_NESTING_PARENT = BIT_ULL(0),
+    VIOMMU_FLAG_PASID_SUPPORTED = BIT_ULL(1),
 };
 
 #endif /* HW_IOMMU_H */
-- 
2.43.0

Re: [PATCH v6 32/33] vfio: Synthesize vPASID capability to VM

Posted by Cédric Le Goater 1 month, 3 weeks ago

On 11/20/25 14:22, Shameer Kolothum wrote:
> From: Yi Liu <yi.l.liu@intel.com>
> 
> If user wants to expose PASID capability in vIOMMU, then VFIO would also
> need to report the PASID cap for this device if the underlying hardware
> supports it as well.
> 
> As a start, this chooses to put the vPASID cap in the last 8 bytes of the
> vconfig space. This is a choice in the good hope of no conflict with any
> existing cap or hidden registers. For the devices that has hidden registers,
> user should figure out a proper offset for the vPASID cap. This may require
> an option for user to config it. Here we leave it as a future extension.
> There are more discussions on the mechanism of finding the proper offset.
> 
> https://lore.kernel.org/kvm/BN9PR11MB5276318969A212AD0649C7BE8CBE2@BN9PR11MB5276.namprd11.prod.outlook.com/
> 
> Since we add a check to ensure the vIOMMU supports PASID, only devices
> under those vIOMMUs can synthesize the vPASID capability. This gives
> users control over which devices expose vPASID.
> 
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> ---
>   hw/vfio/pci.c      | 38 ++++++++++++++++++++++++++++++++++++++
>   include/hw/iommu.h |  1 +
>   2 files changed, 39 insertions(+)


I just noticed another problem with this change. It relies on the
availability of the HostIOMMUDevice which doesn't exist with VFIO
mdev devices, such as vGPU. QEMU simply coredumps :/

We will have to check/protect QEMU in some ways. I need to take
a closer look because mdev handling seems to be spread across
the code and may need to be improved first.

Thanks,

C.

> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 8b8bc5a421..e11e39d667 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -24,6 +24,7 @@
>   #include <sys/ioctl.h>
>   
>   #include "hw/hw.h"
> +#include "hw/iommu.h"
>   #include "hw/pci/msi.h"
>   #include "hw/pci/msix.h"
>   #include "hw/pci/pci_bridge.h"
> @@ -2500,7 +2501,12 @@ static int vfio_setup_rebar_ecap(VFIOPCIDevice *vdev, uint16_t pos)
>   
>   static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
>   {
> +    HostIOMMUDevice *hiod = vdev->vbasedev.hiod;
> +    HostIOMMUDeviceClass *hiodc = HOST_IOMMU_DEVICE_GET_CLASS(hiod);
>       PCIDevice *pdev = PCI_DEVICE(vdev);
> +    uint64_t max_pasid_log2 = 0;
> +    bool pasid_cap_added = false;
> +    uint64_t hw_caps;
>       uint32_t header;
>       uint16_t cap_id, next, size;
>       uint8_t cap_ver;
> @@ -2578,12 +2584,44 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
>                   pcie_add_capability(pdev, cap_id, cap_ver, next, size);
>               }
>               break;
> +        /*
> +         * VFIO kernel does not expose the PASID CAP today. We may synthesize
> +         * one later through IOMMUFD APIs. If VFIO ever starts exposing it,
> +         * record its presence here so we do not create a duplicate CAP.
> +         */
> +        case PCI_EXT_CAP_ID_PASID:
> +             pasid_cap_added = true;
> +             /* fallthrough */
>           default:
>               pcie_add_capability(pdev, cap_id, cap_ver, next, size);
>           }
>   
>       }
>   
> +#ifdef CONFIG_IOMMUFD
> +    /* Try to retrieve PASID CAP through IOMMUFD APIs */
> +    if (!pasid_cap_added && hiodc && hiodc->get_cap) {
> +        hiodc->get_cap(hiod, HOST_IOMMU_DEVICE_CAP_GENERIC_HW, &hw_caps, NULL);
> +        hiodc->get_cap(hiod, HOST_IOMMU_DEVICE_CAP_MAX_PASID_LOG2,
> +                       &max_pasid_log2, NULL);
> +    }
> +
> +    /*
> +     * If supported, adds the PASID capability in the end of the PCIe config
> +     * space. TODO: Add option for enabling pasid at a safe offset.
> +     */
> +    if (max_pasid_log2 && (pci_device_get_viommu_flags(pdev) &
> +                           VIOMMU_FLAG_PASID_SUPPORTED)) {
> +        bool exec_perm = (hw_caps & IOMMU_HW_CAP_PCI_PASID_EXEC);
> +        bool priv_mod = (hw_caps & IOMMU_HW_CAP_PCI_PASID_PRIV);
> +
> +        pcie_pasid_init(pdev, PCIE_CONFIG_SPACE_SIZE - PCI_EXT_CAP_PASID_SIZEOF,
> +                        max_pasid_log2, exec_perm, priv_mod);
> +        /* PASID capability is fully emulated by QEMU */
> +        memset(vdev->emulated_config_bits + pdev->exp.pasid_cap, 0xff, 8);
> +    }
> +#endif
> +
>       /* Cleanup chain head ID if necessary */
>       if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) == 0xFFFF) {
>           pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0);
> diff --git a/include/hw/iommu.h b/include/hw/iommu.h
> index 9b8bb94fc2..9635770bee 100644
> --- a/include/hw/iommu.h
> +++ b/include/hw/iommu.h
> @@ -20,6 +20,7 @@
>   enum viommu_flags {
>       /* vIOMMU needs nesting parent HWPT to create nested HWPT */
>       VIOMMU_FLAG_WANT_NESTING_PARENT = BIT_ULL(0),
> +    VIOMMU_FLAG_PASID_SUPPORTED = BIT_ULL(1),
>   };
>   
>   #endif /* HW_IOMMU_H */

RE: [PATCH v6 32/33] vfio: Synthesize vPASID capability to VM

Posted by Shameer Kolothum 1 month, 1 week ago

Hi Cédric,

> -----Original Message-----
> From: Cédric Le Goater <clg@redhat.com>
> Sent: 15 December 2025 10:55
> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
> arm@nongnu.org; qemu-devel@nongnu.org
> Cc: eric.auger@redhat.com; peter.maydell@linaro.org; Jason Gunthorpe
> <jgg@nvidia.com>; Nicolin Chen <nicolinc@nvidia.com>;
> ddutile@redhat.com; berrange@redhat.com; Nathan Chen
> <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
> Krishnakant Jaju <kjaju@nvidia.com>
> Subject: Re: [PATCH v6 32/33] vfio: Synthesize vPASID capability to VM
> 
> External email: Use caution opening links or attachments
> 
> 
> On 11/20/25 14:22, Shameer Kolothum wrote:
> > From: Yi Liu <yi.l.liu@intel.com>
> >
> > If user wants to expose PASID capability in vIOMMU, then VFIO would also
> > need to report the PASID cap for this device if the underlying hardware
> > supports it as well.
> >
> > As a start, this chooses to put the vPASID cap in the last 8 bytes of the
> > vconfig space. This is a choice in the good hope of no conflict with any
> > existing cap or hidden registers. For the devices that has hidden registers,
> > user should figure out a proper offset for the vPASID cap. This may require
> > an option for user to config it. Here we leave it as a future extension.
> > There are more discussions on the mechanism of finding the proper offset.
> >
> >
> https://lore.kernel.org/kvm/BN9PR11MB5276318969A212AD0649C7BE8CBE2
> @BN9PR11MB5276.namprd11.prod.outlook.com/
> >
> > Since we add a check to ensure the vIOMMU supports PASID, only devices
> > under those vIOMMUs can synthesize the vPASID capability. This gives
> > users control over which devices expose vPASID.
> >
> > Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> > Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
> > Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> > Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> > ---
> >   hw/vfio/pci.c      | 38 ++++++++++++++++++++++++++++++++++++++
> >   include/hw/iommu.h |  1 +
> >   2 files changed, 39 insertions(+)
> 
> 
> I just noticed another problem with this change. It relies on the
> availability of the HostIOMMUDevice which doesn't exist with VFIO
> mdev devices, such as vGPU. QEMU simply coredumps :/
> 
> We will have to check/protect QEMU in some ways. I need to take
> a closer look because mdev handling seems to be spread across
> the code and may need to be improved first.

I did attempt a rework on this patch and the previous one(patch #31) to
address the above issue and to avoid the #ifdef CONFIG_IOMMUFD in
vfio. Please find below:

Patch #1:
This adds get_pasid_info to  HostIOMMUDeviceClass. One thing I am not
sure, below is to use #ifdef CONFIG_LINUX or not. Please take a look
and let me know if this is the right direction or not.


From e1305b0d44b2002778059decc3d6b220414b0589 Mon Sep 17 00:00:00 2001
From: Shameer Kolothum <skolothumtho@nvidia.com>
Date: Fri, 2 Jan 2026 14:50:58 +0000
Subject: [PATCH 1/2] backends/iommufd: Add get_pasid_info

TODO:

Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
 backends/iommufd.c                 | 17 +++++++++++++++++
 include/system/host_iommu_device.h | 19 +++++++++++++++++++
 2 files changed, 36 insertions(+)

diff --git a/backends/iommufd.c b/backends/iommufd.c
index 2c9ce1a03a..7beff372ba 100644
--- a/backends/iommufd.c
+++ b/backends/iommufd.c
@@ -634,11 +634,28 @@ static int hiod_iommufd_get_cap(HostIOMMUDevice *hiod, int cap, Error **errp)
     }
 }

+static bool hiod_iommufd_get_pasid_info(HostIOMMUDevice *hiod,
+                                        HostIOMMUDevicePasidInfo *pasid_info)
+{
+    HostIOMMUDeviceCaps *caps = &hiod->caps;
+
+    if (!caps->max_pasid_log2) {
+        return false;
+    }
+
+    g_assert(pasid_info);
+    pasid_info->exec_perm = (caps->hw_caps & IOMMU_HW_CAP_PCI_PASID_EXEC);
+    pasid_info->priv_mod = (caps->hw_caps & IOMMU_HW_CAP_PCI_PASID_PRIV);
+    pasid_info->max_pasid_log2 = caps->max_pasid_log2;
+    return true;
+}
+
 static void hiod_iommufd_class_init(ObjectClass *oc, const void *data)
 {
     HostIOMMUDeviceClass *hioc = HOST_IOMMU_DEVICE_CLASS(oc);

     hioc->get_cap = hiod_iommufd_get_cap;
+    hioc->get_pasid_info = hiod_iommufd_get_pasid_info;
 };

 static const TypeInfo types[] = {
diff --git a/include/system/host_iommu_device.h b/include/system/host_iommu_device.h
index bfb2b60478..6e62f643fe 100644
--- a/include/system/host_iommu_device.h
+++ b/include/system/host_iommu_device.h
@@ -22,6 +22,13 @@ typedef union VendorCaps {
     struct iommu_hw_info_arm_smmuv3 smmuv3;
 } VendorCaps;

+
+typedef struct HostIOMMUDevicePasidInfo {
+    bool exec_perm;
+    bool priv_mod;
+    uint64_t max_pasid_log2;
+} HostIOMMUDevicePasidInfo;
+
 /**
  * struct HostIOMMUDeviceCaps - Define host IOMMU device capabilities.
  *
@@ -116,6 +123,18 @@ struct HostIOMMUDeviceClass {
      * @hiod: handle to the host IOMMU device
      */
     uint64_t (*get_page_size_mask)(HostIOMMUDevice *hiod);
+#ifdef CONFIG_LINUX
+    /**
+     * @get_pasid_info: Return the PASID information associated with the Host
+     * IOMMU device.
+     *
+     * @pasid_info: If success, returns the PASID related information.
+     *
+     * Returns: true on success, false on failure.
+     */
+    bool (*get_pasid_info)(HostIOMMUDevice *hiod,
+                           HostIOMMUDevicePasidInfo *pasid_info);
+#endif
 };

 /*
--

Patch #2: 
This adds a check for mdev to avoid the coredump mentioned above. 
Please let me know If you have a nicer/broader solution to address this
for mdev devices.

From bbb54b349fccd61d8dab6b95be11c42510db3f95 Mon Sep 17 00:00:00 2001
From: Shameer Kolothum <skolothumtho@nvidia.com>
Date: Fri, 2 Jan 2026 14:52:26 +0000
Subject: [PATCH 2/2] hw/vfio/pci: Add pasid cap synthesize

TODO:

Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
---
 hw/vfio/pci.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 8b8bc5a421..5f1a93cfc8 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -24,6 +24,7 @@
 #include <sys/ioctl.h>

 #include "hw/hw.h"
+#include "hw/iommu.h"
 #include "hw/pci/msi.h"
 #include "hw/pci/msix.h"
 #include "hw/pci/pci_bridge.h"
@@ -2498,9 +2499,41 @@ static int vfio_setup_rebar_ecap(VFIOPCIDevice *vdev, uint16_t pos)
     return 0;
 }

+/*
+ * Try to retrieve PASID CAP through IOMMUFD APIs. If available, adds the
+ * PASID capability in the end of the PCIe config space.
+ * TODO: Add support for enabling pasid at a safe offset.
+ */
+static void vfio_pci_synthesize_pasid_cap(VFIOPCIDevice *vdev)
+{
+    HostIOMMUDevice *hiod = vdev->vbasedev.hiod;
+    PCIDevice *pdev = PCI_DEVICE(vdev);
+    HostIOMMUDeviceClass *hiodc;
+    HostIOMMUDevicePasidInfo pasid_info;
+
+    if (vdev->vbasedev.mdev) {
+        return;
+    }
+
+    hiodc = HOST_IOMMU_DEVICE_GET_CLASS(hiod);
+    if (!hiodc->get_pasid_info ||
+        !(pci_device_get_viommu_flags(pdev) & VIOMMU_FLAG_PASID_SUPPORTED)) {
+        return;
+    }
+
+    if (hiodc->get_pasid_info(hiod, &pasid_info)) {
+        pcie_pasid_init(pdev, PCIE_CONFIG_SPACE_SIZE - PCI_EXT_CAP_PASID_SIZEOF,
+                        pasid_info.max_pasid_log2, pasid_info.exec_perm,
+                        pasid_info.priv_mod);
+        /* PASID capability is fully emulated by QEMU */
+        memset(vdev->emulated_config_bits + pdev->exp.pasid_cap, 0xff, 8);
+    }
+}
+
 static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
 {
     PCIDevice *pdev = PCI_DEVICE(vdev);
+    bool pasid_cap_added = false;
     uint32_t header;
     uint16_t cap_id, next, size;
     uint8_t cap_ver;
@@ -2578,12 +2611,24 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
                 pcie_add_capability(pdev, cap_id, cap_ver, next, size);
             }
             break;
+        /*
+         * VFIO kernel does not expose the PASID CAP today. We may synthesize
+         * one later through IOMMUFD APIs. If VFIO ever starts exposing it,
+         * record its presence here so we do not create a duplicate CAP.
+         */
+        case PCI_EXT_CAP_ID_PASID:
+             pasid_cap_added = true;
+             /* fallthrough */
         default:
             pcie_add_capability(pdev, cap_id, cap_ver, next, size);
         }

     }

+    if (!pasid_cap_added) {
+        vfio_pci_synthesize_pasid_cap(vdev);
+    }
+
     /* Cleanup chain head ID if necessary */
     if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) == 0xFFFF) {
         pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0);
--
2.43.0

Please let me know your thoughts.

Thanks,
Shameer

Re: [PATCH v6 32/33] vfio: Synthesize vPASID capability to VM

Posted by Cédric Le Goater 1 month ago

Hello Shameer,

On 1/2/26 16:35, Shameer Kolothum wrote:
> Hi Cédric,
> 
>> -----Original Message-----
>> From: Cédric Le Goater <clg@redhat.com>
>> Sent: 15 December 2025 10:55
>> To: Shameer Kolothum <skolothumtho@nvidia.com>; qemu-
>> arm@nongnu.org; qemu-devel@nongnu.org
>> Cc: eric.auger@redhat.com; peter.maydell@linaro.org; Jason Gunthorpe
>> <jgg@nvidia.com>; Nicolin Chen <nicolinc@nvidia.com>;
>> ddutile@redhat.com; berrange@redhat.com; Nathan Chen
>> <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
>> smostafa@google.com; wangzhou1@hisilicon.com;
>> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
>> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; yi.l.liu@intel.com;
>> Krishnakant Jaju <kjaju@nvidia.com>
>> Subject: Re: [PATCH v6 32/33] vfio: Synthesize vPASID capability to VM
>>
>> External email: Use caution opening links or attachments
>>
>>
>> On 11/20/25 14:22, Shameer Kolothum wrote:
>>> From: Yi Liu <yi.l.liu@intel.com>
>>>
>>> If user wants to expose PASID capability in vIOMMU, then VFIO would also
>>> need to report the PASID cap for this device if the underlying hardware
>>> supports it as well.
>>>
>>> As a start, this chooses to put the vPASID cap in the last 8 bytes of the
>>> vconfig space. This is a choice in the good hope of no conflict with any
>>> existing cap or hidden registers. For the devices that has hidden registers,
>>> user should figure out a proper offset for the vPASID cap. This may require
>>> an option for user to config it. Here we leave it as a future extension.
>>> There are more discussions on the mechanism of finding the proper offset.
>>>
>>>
>> https://lore.kernel.org/kvm/BN9PR11MB5276318969A212AD0649C7BE8CBE2
>> @BN9PR11MB5276.namprd11.prod.outlook.com/
>>>
>>> Since we add a check to ensure the vIOMMU supports PASID, only devices
>>> under those vIOMMUs can synthesize the vPASID capability. This gives
>>> users control over which devices expose vPASID.
>>>
>>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>>> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
>>> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
>>> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
>>> ---
>>>    hw/vfio/pci.c      | 38 ++++++++++++++++++++++++++++++++++++++
>>>    include/hw/iommu.h |  1 +
>>>    2 files changed, 39 insertions(+)
>>
>>
>> I just noticed another problem with this change. It relies on the
>> availability of the HostIOMMUDevice which doesn't exist with VFIO
>> mdev devices, such as vGPU. QEMU simply coredumps :/
>>
>> We will have to check/protect QEMU in some ways. I need to take
>> a closer look because mdev handling seems to be spread across
>> the code and may need to be improved first.
> 
> I did attempt a rework on this patch and the previous one(patch #31) to
> address the above issue and to avoid the #ifdef CONFIG_IOMMUFD in
> vfio. Please find below:
> 
> Patch #1:
> This adds get_pasid_info to  HostIOMMUDeviceClass. One thing I am not
> sure, below is to use #ifdef CONFIG_LINUX or not. Please take a look
> and let me know if this is the right direction or not.

I don't think CONFIG_LINUX is needed there because the declarations
are not specific to linux. A simple way to try a windows build is with :

   --cross-prefix=x86_64-w64-mingw32-

you might need to add :

   --disable-sdl

and targets should be aarch64-softmmu,ppc64-softmmu,x86_64-softmmu,s390x-softmmu

You should resend a v7, in whole or in parts, as you wish.

Thanks,

C.

> 
> 
>  From e1305b0d44b2002778059decc3d6b220414b0589 Mon Sep 17 00:00:00 2001
> From: Shameer Kolothum <skolothumtho@nvidia.com>
> Date: Fri, 2 Jan 2026 14:50:58 +0000
> Subject: [PATCH 1/2] backends/iommufd: Add get_pasid_info
> 
> TODO:
> 
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> ---
>   backends/iommufd.c                 | 17 +++++++++++++++++
>   include/system/host_iommu_device.h | 19 +++++++++++++++++++
>   2 files changed, 36 insertions(+)
> 
> diff --git a/backends/iommufd.c b/backends/iommufd.c
> index 2c9ce1a03a..7beff372ba 100644
> --- a/backends/iommufd.c
> +++ b/backends/iommufd.c
> @@ -634,11 +634,28 @@ static int hiod_iommufd_get_cap(HostIOMMUDevice *hiod, int cap, Error **errp)
>       }
>   }
> 
> +static bool hiod_iommufd_get_pasid_info(HostIOMMUDevice *hiod,
> +                                        HostIOMMUDevicePasidInfo *pasid_info)
> +{
> +    HostIOMMUDeviceCaps *caps = &hiod->caps;
> +
> +    if (!caps->max_pasid_log2) {
> +        return false;
> +    }
> +
> +    g_assert(pasid_info);
> +    pasid_info->exec_perm = (caps->hw_caps & IOMMU_HW_CAP_PCI_PASID_EXEC);
> +    pasid_info->priv_mod = (caps->hw_caps & IOMMU_HW_CAP_PCI_PASID_PRIV);
> +    pasid_info->max_pasid_log2 = caps->max_pasid_log2;
> +    return true;
> +}
> +
>   static void hiod_iommufd_class_init(ObjectClass *oc, const void *data)
>   {
>       HostIOMMUDeviceClass *hioc = HOST_IOMMU_DEVICE_CLASS(oc);
> 
>       hioc->get_cap = hiod_iommufd_get_cap;
> +    hioc->get_pasid_info = hiod_iommufd_get_pasid_info;
>   };
> 
>   static const TypeInfo types[] = {
> diff --git a/include/system/host_iommu_device.h b/include/system/host_iommu_device.h
> index bfb2b60478..6e62f643fe 100644
> --- a/include/system/host_iommu_device.h
> +++ b/include/system/host_iommu_device.h
> @@ -22,6 +22,13 @@ typedef union VendorCaps {
>       struct iommu_hw_info_arm_smmuv3 smmuv3;
>   } VendorCaps;
> 
> +
> +typedef struct HostIOMMUDevicePasidInfo {
> +    bool exec_perm;
> +    bool priv_mod;
> +    uint64_t max_pasid_log2;
> +} HostIOMMUDevicePasidInfo;
> +
>   /**
>    * struct HostIOMMUDeviceCaps - Define host IOMMU device capabilities.
>    *
> @@ -116,6 +123,18 @@ struct HostIOMMUDeviceClass {
>        * @hiod: handle to the host IOMMU device
>        */
>       uint64_t (*get_page_size_mask)(HostIOMMUDevice *hiod);
> +#ifdef CONFIG_LINUX
> +    /**
> +     * @get_pasid_info: Return the PASID information associated with the Host
> +     * IOMMU device.
> +     *
> +     * @pasid_info: If success, returns the PASID related information.
> +     *
> +     * Returns: true on success, false on failure.
> +     */
> +    bool (*get_pasid_info)(HostIOMMUDevice *hiod,
> +                           HostIOMMUDevicePasidInfo *pasid_info);
> +#endif
>   };
> 
>   /*
> --
> 
> Patch #2:
> This adds a check for mdev to avoid the coredump mentioned above.
> Please let me know If you have a nicer/broader solution to address this
> for mdev devices.
> 
>  From bbb54b349fccd61d8dab6b95be11c42510db3f95 Mon Sep 17 00:00:00 2001
> From: Shameer Kolothum <skolothumtho@nvidia.com>
> Date: Fri, 2 Jan 2026 14:52:26 +0000
> Subject: [PATCH 2/2] hw/vfio/pci: Add pasid cap synthesize
> 
> TODO:
> 
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> ---
>   hw/vfio/pci.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 45 insertions(+)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 8b8bc5a421..5f1a93cfc8 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -24,6 +24,7 @@
>   #include <sys/ioctl.h>
> 
>   #include "hw/hw.h"
> +#include "hw/iommu.h"
>   #include "hw/pci/msi.h"
>   #include "hw/pci/msix.h"
>   #include "hw/pci/pci_bridge.h"
> @@ -2498,9 +2499,41 @@ static int vfio_setup_rebar_ecap(VFIOPCIDevice *vdev, uint16_t pos)
>       return 0;
>   }
> 
> +/*
> + * Try to retrieve PASID CAP through IOMMUFD APIs. If available, adds the
> + * PASID capability in the end of the PCIe config space.
> + * TODO: Add support for enabling pasid at a safe offset.
> + */
> +static void vfio_pci_synthesize_pasid_cap(VFIOPCIDevice *vdev)
> +{
> +    HostIOMMUDevice *hiod = vdev->vbasedev.hiod;
> +    PCIDevice *pdev = PCI_DEVICE(vdev);
> +    HostIOMMUDeviceClass *hiodc;
> +    HostIOMMUDevicePasidInfo pasid_info;
> +
> +    if (vdev->vbasedev.mdev) {
> +        return;
> +    }
> +
> +    hiodc = HOST_IOMMU_DEVICE_GET_CLASS(hiod);
> +    if (!hiodc->get_pasid_info ||
> +        !(pci_device_get_viommu_flags(pdev) & VIOMMU_FLAG_PASID_SUPPORTED)) {
> +        return;
> +    }
> +
> +    if (hiodc->get_pasid_info(hiod, &pasid_info)) {
> +        pcie_pasid_init(pdev, PCIE_CONFIG_SPACE_SIZE - PCI_EXT_CAP_PASID_SIZEOF,
> +                        pasid_info.max_pasid_log2, pasid_info.exec_perm,
> +                        pasid_info.priv_mod);
> +        /* PASID capability is fully emulated by QEMU */
> +        memset(vdev->emulated_config_bits + pdev->exp.pasid_cap, 0xff, 8);
> +    }
> +}
> +
>   static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
>   {
>       PCIDevice *pdev = PCI_DEVICE(vdev);
> +    bool pasid_cap_added = false;
>       uint32_t header;
>       uint16_t cap_id, next, size;
>       uint8_t cap_ver;
> @@ -2578,12 +2611,24 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
>                   pcie_add_capability(pdev, cap_id, cap_ver, next, size);
>               }
>               break;
> +        /*
> +         * VFIO kernel does not expose the PASID CAP today. We may synthesize
> +         * one later through IOMMUFD APIs. If VFIO ever starts exposing it,
> +         * record its presence here so we do not create a duplicate CAP.
> +         */
> +        case PCI_EXT_CAP_ID_PASID:
> +             pasid_cap_added = true;
> +             /* fallthrough */
>           default:
>               pcie_add_capability(pdev, cap_id, cap_ver, next, size);
>           }
> 
>       }
> 
> +    if (!pasid_cap_added) {
> +        vfio_pci_synthesize_pasid_cap(vdev);
> +    }
> +
>       /* Cleanup chain head ID if necessary */
>       if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) == 0xFFFF) {
>           pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0);
> --
> 2.43.0
> 
> Please let me know your thoughts.
> 
> Thanks,
> Shameer

Re: [PATCH v6 32/33] vfio: Synthesize vPASID capability to VM

Posted by Eric Auger 2 months ago

Hi Shameer,
On 11/20/25 2:22 PM, Shameer Kolothum wrote:
> From: Yi Liu <yi.l.liu@intel.com>
>
> If user wants to expose PASID capability in vIOMMU, then VFIO would also
> need to report the PASID cap for this device if the underlying hardware
> supports it as well.
>
> As a start, this chooses to put the vPASID cap in the last 8 bytes of the
> vconfig space. This is a choice in the good hope of no conflict with any
> existing cap or hidden registers. For the devices that has hidden registers,
> user should figure out a proper offset for the vPASID cap. This may require
> an option for user to config it. Here we leave it as a future extension.
> There are more discussions on the mechanism of finding the proper offset.
>
> https://lore.kernel.org/kvm/BN9PR11MB5276318969A212AD0649C7BE8CBE2@BN9PR11MB5276.namprd11.prod.outlook.com/
>
> Since we add a check to ensure the vIOMMU supports PASID, only devices
> under those vIOMMUs can synthesize the vPASID capability. This gives
> users control over which devices expose vPASID.
>
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> ---
>  hw/vfio/pci.c      | 38 ++++++++++++++++++++++++++++++++++++++
>  include/hw/iommu.h |  1 +
>  2 files changed, 39 insertions(+)
>
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 8b8bc5a421..e11e39d667 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -24,6 +24,7 @@
>  #include <sys/ioctl.h>
>  
>  #include "hw/hw.h"
> +#include "hw/iommu.h"
>  #include "hw/pci/msi.h"
>  #include "hw/pci/msix.h"
>  #include "hw/pci/pci_bridge.h"
> @@ -2500,7 +2501,12 @@ static int vfio_setup_rebar_ecap(VFIOPCIDevice *vdev, uint16_t pos)
>  
>  static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
>  {
> +    HostIOMMUDevice *hiod = vdev->vbasedev.hiod;
> +    HostIOMMUDeviceClass *hiodc = HOST_IOMMU_DEVICE_GET_CLASS(hiod);
>      PCIDevice *pdev = PCI_DEVICE(vdev);
> +    uint64_t max_pasid_log2 = 0;
> +    bool pasid_cap_added = false;
> +    uint64_t hw_caps;
>      uint32_t header;
>      uint16_t cap_id, next, size;
>      uint8_t cap_ver;
> @@ -2578,12 +2584,44 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
>                  pcie_add_capability(pdev, cap_id, cap_ver, next, size);
>              }
>              break;
> +        /*
> +         * VFIO kernel does not expose the PASID CAP today. We may synthesize
> +         * one later through IOMMUFD APIs. If VFIO ever starts exposing it,
> +         * record its presence here so we do not create a duplicate CAP.
> +         */
> +        case PCI_EXT_CAP_ID_PASID:
> +             pasid_cap_added = true;
> +             /* fallthrough */
>          default:
>              pcie_add_capability(pdev, cap_id, cap_ver, next, size);
>          }
>  
>      }
>  
> +#ifdef CONFIG_IOMMUFD
> +    /* Try to retrieve PASID CAP through IOMMUFD APIs */
> +    if (!pasid_cap_added && hiodc && hiodc->get_cap) {
> +        hiodc->get_cap(hiod, HOST_IOMMU_DEVICE_CAP_GENERIC_HW, &hw_caps, NULL);
> +        hiodc->get_cap(hiod, HOST_IOMMU_DEVICE_CAP_MAX_PASID_LOG2,
> +                       &max_pasid_log2, NULL);
> +    }
> +
> +    /*
> +     * If supported, adds the PASID capability in the end of the PCIe config
> +     * space. TODO: Add option for enabling pasid at a safe offset.
> +     */
> +    if (max_pasid_log2 && (pci_device_get_viommu_flags(pdev) &
> +                           VIOMMU_FLAG_PASID_SUPPORTED)) {
> +        bool exec_perm = (hw_caps & IOMMU_HW_CAP_PCI_PASID_EXEC);
> +        bool priv_mod = (hw_caps & IOMMU_HW_CAP_PCI_PASID_PRIV);
> +
> +        pcie_pasid_init(pdev, PCIE_CONFIG_SPACE_SIZE - PCI_EXT_CAP_PASID_SIZEOF,
> +                        max_pasid_log2, exec_perm, priv_mod);
> +        /* PASID capability is fully emulated by QEMU */
> +        memset(vdev->emulated_config_bits + pdev->exp.pasid_cap, 0xff, 8);
> +    }
> +#endif
> +
>      /* Cleanup chain head ID if necessary */
>      if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) == 0xFFFF) {
>          pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0);
> diff --git a/include/hw/iommu.h b/include/hw/iommu.h
> index 9b8bb94fc2..9635770bee 100644
> --- a/include/hw/iommu.h
> +++ b/include/hw/iommu.h
> @@ -20,6 +20,7 @@
>  enum viommu_flags {
>      /* vIOMMU needs nesting parent HWPT to create nested HWPT */
>      VIOMMU_FLAG_WANT_NESTING_PARENT = BIT_ULL(0),
> +    VIOMMU_FLAG_PASID_SUPPORTED = BIT_ULL(1),
>  };
>  
>  #endif /* HW_IOMMU_H */
Besides the fact the offset is arbitrarily chosen so that this is the
last cap of the vconfig space, the code looks good to me.
So
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Just wondering whether we couldn't add some generic pcie code that
parses the extended cap linked list to check the offset range is not
used by another cap before allowing the insertion at a given offset?
This wouldn't prevent a subsequent addition from failing but at least we
would know if there is some collision.this could be added later on though.

Thanks

Eric

Re: [PATCH v6 32/33] vfio: Synthesize vPASID capability to VM

Posted by Yi Liu 2 months ago

On 2025/12/9 17:51, Eric Auger wrote:
> Hi Shameer,
> On 11/20/25 2:22 PM, Shameer Kolothum wrote:
>> From: Yi Liu <yi.l.liu@intel.com>
>>
>> If user wants to expose PASID capability in vIOMMU, then VFIO would also
>> need to report the PASID cap for this device if the underlying hardware
>> supports it as well.
>>
>> As a start, this chooses to put the vPASID cap in the last 8 bytes of the
>> vconfig space. This is a choice in the good hope of no conflict with any
>> existing cap or hidden registers. For the devices that has hidden registers,
>> user should figure out a proper offset for the vPASID cap. This may require
>> an option for user to config it. Here we leave it as a future extension.
>> There are more discussions on the mechanism of finding the proper offset.
>>
>> https://lore.kernel.org/kvm/BN9PR11MB5276318969A212AD0649C7BE8CBE2@BN9PR11MB5276.namprd11.prod.outlook.com/
>>
>> Since we add a check to ensure the vIOMMU supports PASID, only devices
>> under those vIOMMUs can synthesize the vPASID capability. This gives
>> users control over which devices expose vPASID.
>>
>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
>> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
>> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
>> ---
>>   hw/vfio/pci.c      | 38 ++++++++++++++++++++++++++++++++++++++
>>   include/hw/iommu.h |  1 +
>>   2 files changed, 39 insertions(+)
>>
>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>> index 8b8bc5a421..e11e39d667 100644
>> --- a/hw/vfio/pci.c
>> +++ b/hw/vfio/pci.c
>> @@ -24,6 +24,7 @@
>>   #include <sys/ioctl.h>
>>   
>>   #include "hw/hw.h"
>> +#include "hw/iommu.h"
>>   #include "hw/pci/msi.h"
>>   #include "hw/pci/msix.h"
>>   #include "hw/pci/pci_bridge.h"
>> @@ -2500,7 +2501,12 @@ static int vfio_setup_rebar_ecap(VFIOPCIDevice *vdev, uint16_t pos)
>>   
>>   static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
>>   {
>> +    HostIOMMUDevice *hiod = vdev->vbasedev.hiod;
>> +    HostIOMMUDeviceClass *hiodc = HOST_IOMMU_DEVICE_GET_CLASS(hiod);
>>       PCIDevice *pdev = PCI_DEVICE(vdev);
>> +    uint64_t max_pasid_log2 = 0;
>> +    bool pasid_cap_added = false;
>> +    uint64_t hw_caps;
>>       uint32_t header;
>>       uint16_t cap_id, next, size;
>>       uint8_t cap_ver;
>> @@ -2578,12 +2584,44 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
>>                   pcie_add_capability(pdev, cap_id, cap_ver, next, size);
>>               }
>>               break;
>> +        /*
>> +         * VFIO kernel does not expose the PASID CAP today. We may synthesize
>> +         * one later through IOMMUFD APIs. If VFIO ever starts exposing it,
>> +         * record its presence here so we do not create a duplicate CAP.
>> +         */
>> +        case PCI_EXT_CAP_ID_PASID:
>> +             pasid_cap_added = true;
>> +             /* fallthrough */
>>           default:
>>               pcie_add_capability(pdev, cap_id, cap_ver, next, size);
>>           }
>>   
>>       }
>>   
>> +#ifdef CONFIG_IOMMUFD
>> +    /* Try to retrieve PASID CAP through IOMMUFD APIs */
>> +    if (!pasid_cap_added && hiodc && hiodc->get_cap) {
>> +        hiodc->get_cap(hiod, HOST_IOMMU_DEVICE_CAP_GENERIC_HW, &hw_caps, NULL);
>> +        hiodc->get_cap(hiod, HOST_IOMMU_DEVICE_CAP_MAX_PASID_LOG2,
>> +                       &max_pasid_log2, NULL);
>> +    }
>> +
>> +    /*
>> +     * If supported, adds the PASID capability in the end of the PCIe config
>> +     * space. TODO: Add option for enabling pasid at a safe offset.
>> +     */
>> +    if (max_pasid_log2 && (pci_device_get_viommu_flags(pdev) &
>> +                           VIOMMU_FLAG_PASID_SUPPORTED)) {
>> +        bool exec_perm = (hw_caps & IOMMU_HW_CAP_PCI_PASID_EXEC);
>> +        bool priv_mod = (hw_caps & IOMMU_HW_CAP_PCI_PASID_PRIV);
>> +
>> +        pcie_pasid_init(pdev, PCIE_CONFIG_SPACE_SIZE - PCI_EXT_CAP_PASID_SIZEOF,
>> +                        max_pasid_log2, exec_perm, priv_mod);
>> +        /* PASID capability is fully emulated by QEMU */
>> +        memset(vdev->emulated_config_bits + pdev->exp.pasid_cap, 0xff, 8);
>> +    }
>> +#endif
>> +
>>       /* Cleanup chain head ID if necessary */
>>       if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) == 0xFFFF) {
>>           pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0);
>> diff --git a/include/hw/iommu.h b/include/hw/iommu.h
>> index 9b8bb94fc2..9635770bee 100644
>> --- a/include/hw/iommu.h
>> +++ b/include/hw/iommu.h
>> @@ -20,6 +20,7 @@
>>   enum viommu_flags {
>>       /* vIOMMU needs nesting parent HWPT to create nested HWPT */
>>       VIOMMU_FLAG_WANT_NESTING_PARENT = BIT_ULL(0),
>> +    VIOMMU_FLAG_PASID_SUPPORTED = BIT_ULL(1),
>>   };
>>   
>>   #endif /* HW_IOMMU_H */
> Besides the fact the offset is arbitrarily chosen so that this is the
> last cap of the vconfig space, the code looks good to me.
> So
> Reviewed-by: Eric Auger <eric.auger@redhat.com>
> 
> Just wondering whether we couldn't add some generic pcie code that
> parses the extended cap linked list to check the offset range is not
> used by another cap before allowing the insertion at a given offset?
> This wouldn't prevent a subsequent addition from failing but at least we
> would know if there is some collision.this could be added later on though.
> 

You're absolutely right. My approach of using the last 8 bytes was a
shortcut to avoid implementing proper capability parsing logic
(importing pci_regs.h and maintaining a cap_id-to-cap_size mapping 
table), and it simplified PASID capability detection by only examining
the last 8bytes by a simple dump :(. However, this approach is not
good as we cannot guarantee that the last 8bytes are unused by any
device.

Let's just implement the logic to walk the linked list of ext_caps to
find an appropriate offset for our use case.

@Shameer, apologies for the delayed response.

Regards,
Yi Liu

RE: [PATCH v6 32/33] vfio: Synthesize vPASID capability to VM

Posted by Shameer Kolothum 1 month ago

Hi Eric/ Yi,

[Cc: Alex]

> -----Original Message-----
> From: Yi Liu <yi.l.liu@intel.com>
> Sent: 09 December 2025 11:17
> To: eric.auger@redhat.com; Shameer Kolothum
> <skolothumtho@nvidia.com>; qemu-arm@nongnu.org; qemu-
> devel@nongnu.org
> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; Krishnakant Jaju
> <kjaju@nvidia.com>
> Subject: Re: [PATCH v6 32/33] vfio: Synthesize vPASID capability to VM
> 
> External email: Use caution opening links or attachments
> 
> 
> On 2025/12/9 17:51, Eric Auger wrote:
> > Hi Shameer,
> > On 11/20/25 2:22 PM, Shameer Kolothum wrote:
> >> From: Yi Liu <yi.l.liu@intel.com>
> >>
> >> If user wants to expose PASID capability in vIOMMU, then VFIO would also
> >> need to report the PASID cap for this device if the underlying hardware
> >> supports it as well.
> >>
> >> As a start, this chooses to put the vPASID cap in the last 8 bytes of the
> >> vconfig space. This is a choice in the good hope of no conflict with any
> >> existing cap or hidden registers. For the devices that has hidden registers,
> >> user should figure out a proper offset for the vPASID cap. This may require
> >> an option for user to config it. Here we leave it as a future extension.
> >> There are more discussions on the mechanism of finding the proper offset.
> >>
> >>
> https://lore.kernel.org/kvm/BN9PR11MB5276318969A212AD0649C7BE8C
> BE2@BN9PR11MB5276.namprd11.prod.outlook.com/
> >>
> >> Since we add a check to ensure the vIOMMU supports PASID, only devices
> >> under those vIOMMUs can synthesize the vPASID capability. This gives
> >> users control over which devices expose vPASID.
> >>
> >> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> >> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
> >> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> >> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
> >> ---
> >>   hw/vfio/pci.c      | 38 ++++++++++++++++++++++++++++++++++++++
> >>   include/hw/iommu.h |  1 +
> >>   2 files changed, 39 insertions(+)
> >>
> >> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> >> index 8b8bc5a421..e11e39d667 100644
> >> --- a/hw/vfio/pci.c
> >> +++ b/hw/vfio/pci.c
> >> @@ -24,6 +24,7 @@
> >>   #include <sys/ioctl.h>
> >>
> >>   #include "hw/hw.h"
> >> +#include "hw/iommu.h"
> >>   #include "hw/pci/msi.h"
> >>   #include "hw/pci/msix.h"
> >>   #include "hw/pci/pci_bridge.h"
> >> @@ -2500,7 +2501,12 @@ static int
> vfio_setup_rebar_ecap(VFIOPCIDevice *vdev, uint16_t pos)
> >>
> >>   static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
> >>   {
> >> +    HostIOMMUDevice *hiod = vdev->vbasedev.hiod;
> >> +    HostIOMMUDeviceClass *hiodc =
> HOST_IOMMU_DEVICE_GET_CLASS(hiod);
> >>       PCIDevice *pdev = PCI_DEVICE(vdev);
> >> +    uint64_t max_pasid_log2 = 0;
> >> +    bool pasid_cap_added = false;
> >> +    uint64_t hw_caps;
> >>       uint32_t header;
> >>       uint16_t cap_id, next, size;
> >>       uint8_t cap_ver;
> >> @@ -2578,12 +2584,44 @@ static void vfio_add_ext_cap(VFIOPCIDevice
> *vdev)
> >>                   pcie_add_capability(pdev, cap_id, cap_ver, next, size);
> >>               }
> >>               break;
> >> +        /*
> >> +         * VFIO kernel does not expose the PASID CAP today. We may
> synthesize
> >> +         * one later through IOMMUFD APIs. If VFIO ever starts exposing it,
> >> +         * record its presence here so we do not create a duplicate CAP.
> >> +         */
> >> +        case PCI_EXT_CAP_ID_PASID:
> >> +             pasid_cap_added = true;
> >> +             /* fallthrough */
> >>           default:
> >>               pcie_add_capability(pdev, cap_id, cap_ver, next, size);
> >>           }
> >>
> >>       }
> >>
> >> +#ifdef CONFIG_IOMMUFD
> >> +    /* Try to retrieve PASID CAP through IOMMUFD APIs */
> >> +    if (!pasid_cap_added && hiodc && hiodc->get_cap) {
> >> +        hiodc->get_cap(hiod, HOST_IOMMU_DEVICE_CAP_GENERIC_HW,
> &hw_caps, NULL);
> >> +        hiodc->get_cap(hiod,
> HOST_IOMMU_DEVICE_CAP_MAX_PASID_LOG2,
> >> +                       &max_pasid_log2, NULL);
> >> +    }
> >> +
> >> +    /*
> >> +     * If supported, adds the PASID capability in the end of the PCIe config
> >> +     * space. TODO: Add option for enabling pasid at a safe offset.
> >> +     */
> >> +    if (max_pasid_log2 && (pci_device_get_viommu_flags(pdev) &
> >> +                           VIOMMU_FLAG_PASID_SUPPORTED)) {
> >> +        bool exec_perm = (hw_caps & IOMMU_HW_CAP_PCI_PASID_EXEC);
> >> +        bool priv_mod = (hw_caps & IOMMU_HW_CAP_PCI_PASID_PRIV);
> >> +
> >> +        pcie_pasid_init(pdev, PCIE_CONFIG_SPACE_SIZE -
> PCI_EXT_CAP_PASID_SIZEOF,
> >> +                        max_pasid_log2, exec_perm, priv_mod);
> >> +        /* PASID capability is fully emulated by QEMU */
> >> +        memset(vdev->emulated_config_bits + pdev->exp.pasid_cap, 0xff, 8);
> >> +    }
> >> +#endif
> >> +
> >>       /* Cleanup chain head ID if necessary */
> >>       if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) == 0xFFFF) {
> >>           pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0);
> >> diff --git a/include/hw/iommu.h b/include/hw/iommu.h
> >> index 9b8bb94fc2..9635770bee 100644
> >> --- a/include/hw/iommu.h
> >> +++ b/include/hw/iommu.h
> >> @@ -20,6 +20,7 @@
> >>   enum viommu_flags {
> >>       /* vIOMMU needs nesting parent HWPT to create nested HWPT */
> >>       VIOMMU_FLAG_WANT_NESTING_PARENT = BIT_ULL(0),
> >> +    VIOMMU_FLAG_PASID_SUPPORTED = BIT_ULL(1),
> >>   };
> >>
> >>   #endif /* HW_IOMMU_H */
> > Besides the fact the offset is arbitrarily chosen so that this is the
> > last cap of the vconfig space, the code looks good to me.
> > So
> > Reviewed-by: Eric Auger <eric.auger@redhat.com>
> >
> > Just wondering whether we couldn't add some generic pcie code that
> > parses the extended cap linked list to check the offset range is not
> > used by another cap before allowing the insertion at a given offset?
> > This wouldn't prevent a subsequent addition from failing but at least we
> > would know if there is some collision.this could be added later on though.
> >
> 
> You're absolutely right. My approach of using the last 8 bytes was a
> shortcut to avoid implementing proper capability parsing logic
> (importing pci_regs.h and maintaining a cap_id-to-cap_size mapping
> table), and it simplified PASID capability detection by only examining
> the last 8bytes by a simple dump :(. However, this approach is not
> good as we cannot guarantee that the last 8bytes are unused by any
> device.
> 
> Let's just implement the logic to walk the linked list of ext_caps to
> find an appropriate offset for our use case.

I had a go at this. Based on my understanding, even if we walk the PCIe
extended capability linked list, we still can't easily determine the size
occupied by the last capability as the extended capability header does not
encode a length, it only provides the "next" pointer, and for the last entry
next == 0.

Given that, I tried the following approach,

-locate the last extended capability (using the existing helper),
-reserve a fixed window (512 bytes) for that final capability,
-and synthesize PASID at the end of PCIe config space (0xff8) only if it
looks like there is enough room.

Maybe I am missing something here.. Please let me know if there is a
better way to address this.

Thanks,
Shameer

diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
index b302de6419..89178f2b7e 100644
--- a/hw/pci/pcie.c
+++ b/hw/pci/pcie.c
@@ -1005,8 +1005,8 @@ bool pcie_cap_is_arifwd_enabled(const PCIDevice *dev)
  */

 /* Passing a cap_id value > 0xffff will return 0 and put end of list in prev */
-static uint16_t pcie_find_capability_list(PCIDevice *dev, uint32_t cap_id,
-                                          uint16_t *prev_p)
+uint16_t pcie_find_capability_list(PCIDevice *dev, uint32_t cap_id,
+                                   uint16_t *prev_p)
 {
     uint16_t prev = 0;
     uint16_t next;
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 8b8bc5a421..e8354e8b8d 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -24,6 +24,7 @@
 #include <sys/ioctl.h>

 #include "hw/hw.h"
+#include "hw/iommu.h"
 #include "hw/pci/msi.h"
 #include "hw/pci/msix.h"
 #include "hw/pci/pci_bridge.h"
@@ -2498,9 +2499,72 @@ static int vfio_setup_rebar_ecap(VFIOPCIDevice *vdev, uint16_t pos)
     return 0;
 }

+/*
+ * Try to retrieve PASID capability information via IOMMUFD APIs and,
+ * if supported, synthesize a PASID PCIe extended capability for the
+ * VFIO device.
+ *
+ * The PASID capability is placed at the end of the PCIe extended
+ * configuration space. Determining the exact size of the last
+ * existing extended capability is non trivial, as PCIe extended
+ * capabilities do not generically encode their total size and
+ * vendor defined extensions are permitted.
+ *
+ * For now, reserve a fixed window (512 bytes) for the last extended
+ * capability and only insert  PASID if sufficient space remains.
+ */
+static void vfio_pci_synthesize_pasid_cap(VFIOPCIDevice *vdev)
+{
+    HostIOMMUDevice *hiod = vdev->vbasedev.hiod;
+    HostIOMMUDeviceClass *hiodc;
+    HostIOMMUDevicePasidInfo pasid_info;
+    PCIDevice *pdev = PCI_DEVICE(vdev);
+    uint16_t last = 0;
+    uint16_t pasid_offset;
+
+    if (vdev->vbasedev.mdev) {
+        return;
+    }
+
+    hiodc = HOST_IOMMU_DEVICE_GET_CLASS(hiod);
+    if (!hiodc || !hiodc->get_pasid_info ||
+        !hiodc->get_pasid_info(hiod, &pasid_info) ||
+        !(pci_device_get_viommu_flags(pdev) & VIOMMU_FLAG_PASID_SUPPORTED)) {
+        return;
+    }
+
+    /*
+     * Locate the last PCIe extended capability present in the device
+     * configuration space.
+     */
+    pcie_find_capability_list(pdev, 0x1ffff, &last);
+
+    /*
+     * Reserve space at the end of PCIe configuration space for PASID.
+     * If the last extended capability appears too close to the end,
+     * refuse to insert PASID.
+     */
+    if (last + 512 > PCIE_CONFIG_SPACE_SIZE - PCI_EXT_CAP_PASID_SIZEOF) {
+        warn_report("vfio: no space to synthesize PASID extended capability");
+        return;
+    }
+
+    pasid_offset = PCIE_CONFIG_SPACE_SIZE - PCI_EXT_CAP_PASID_SIZEOF;
+
+    pcie_pasid_init(pdev, pasid_offset,
+                    pasid_info.max_pasid_log2,
+                    pasid_info.exec_perm,
+                    pasid_info.priv_mod);
+
+    /* PASID capability is fully emulated by QEMU */
+    memset(vdev->emulated_config_bits + pdev->exp.pasid_cap, 0xff,
+           PCI_EXT_CAP_PASID_SIZEOF);
+}
+
 static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
 {
     PCIDevice *pdev = PCI_DEVICE(vdev);
+    bool pasid_cap_added = false;
     uint32_t header;
     uint16_t cap_id, next, size;
     uint8_t cap_ver;
@@ -2562,6 +2626,7 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
          * accesses, exact size doesn't seem worthwhile.
          */
         size = vfio_ext_cap_max_size(config, next);
+        printf("%s: Shameer: cap_id 0x%x size 0x%x\n",__func__, cap_id, size);

         /* Use emulated next pointer to allow dropping extended caps */
         pci_long_test_and_set_mask(vdev->emulated_config_bits + next,
@@ -2578,12 +2643,24 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
                 pcie_add_capability(pdev, cap_id, cap_ver, next, size);
             }
             break;
+        /*
+         * VFIO kernel does not expose the PASID CAP today. We may synthesize
+         * one later through IOMMUFD APIs. If VFIO ever starts exposing it,
+         * record its presence here so we do not create a duplicate CAP.
+         */
+        case PCI_EXT_CAP_ID_PASID:
+             pasid_cap_added = true;
+             /* fallthrough */
         default:
             pcie_add_capability(pdev, cap_id, cap_ver, next, size);
         }

     }

+    if (!pasid_cap_added) {
+        vfio_pci_synthesize_pasid_cap(vdev);
+    }
+
     /* Cleanup chain head ID if necessary */
     if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) == 0xFFFF) {
         pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0);
diff --git a/include/hw/pci/pcie.h b/include/hw/pci/pcie.h
index 42cebcd033..1a477b05b2 100644
--- a/include/hw/pci/pcie.h
+++ b/include/hw/pci/pcie.h
@@ -130,6 +130,8 @@ bool pcie_cap_is_arifwd_enabled(const PCIDevice *dev);

 /* PCI express extended capability helper functions */
 uint16_t pcie_find_capability(PCIDevice *dev, uint16_t cap_id);
+uint16_t pcie_find_capability_list(PCIDevice *dev, uint32_t cap_id,
+                                   uint16_t *prev_p);
 void pcie_add_capability(PCIDevice *dev,
                          uint16_t cap_id, uint8_t cap_ver,
                          uint16_t offset, uint16_t size);
@@ -138,6 +140,7 @@ void pcie_sync_bridge_lnk(PCIDevice *dev);
 void pcie_acs_init(PCIDevice *dev, uint16_t offset);
 void pcie_acs_reset(PCIDevice *dev);

+uint16_t pcie_get_ext_cap_next_offset(PCIDevice *pdev, uint16_t cap_size);
 void pcie_ari_init(PCIDevice *dev, uint16_t offset);
 void pcie_dev_ser_num_init(PCIDevice *dev, uint16_t offset, uint64_t ser_num);
 void pcie_ats_init(PCIDevice *dev, uint16_t offset, bool aligned);

Re: [PATCH v6 32/33] vfio: Synthesize vPASID capability to VM

Posted by Eric Auger 1 month ago

Hi Shameer,

On 1/5/26 5:33 PM, Shameer Kolothum wrote:
> Hi Eric/ Yi,
>
> [Cc: Alex]
>
>> -----Original Message-----
>> From: Yi Liu <yi.l.liu@intel.com>
>> Sent: 09 December 2025 11:17
>> To: eric.auger@redhat.com; Shameer Kolothum
>> <skolothumtho@nvidia.com>; qemu-arm@nongnu.org; qemu-
>> devel@nongnu.org
>> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
>> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
>> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
>> smostafa@google.com; wangzhou1@hisilicon.com;
>> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
>> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; Krishnakant Jaju
>> <kjaju@nvidia.com>
>> Subject: Re: [PATCH v6 32/33] vfio: Synthesize vPASID capability to VM
>>
>> External email: Use caution opening links or attachments
>>
>>
>> On 2025/12/9 17:51, Eric Auger wrote:
>>> Hi Shameer,
>>> On 11/20/25 2:22 PM, Shameer Kolothum wrote:
>>>> From: Yi Liu <yi.l.liu@intel.com>
>>>>
>>>> If user wants to expose PASID capability in vIOMMU, then VFIO would also
>>>> need to report the PASID cap for this device if the underlying hardware
>>>> supports it as well.
>>>>
>>>> As a start, this chooses to put the vPASID cap in the last 8 bytes of the
>>>> vconfig space. This is a choice in the good hope of no conflict with any
>>>> existing cap or hidden registers. For the devices that has hidden registers,
>>>> user should figure out a proper offset for the vPASID cap. This may require
>>>> an option for user to config it. Here we leave it as a future extension.
>>>> There are more discussions on the mechanism of finding the proper offset.
>>>>
>>>>
>> https://lore.kernel.org/kvm/BN9PR11MB5276318969A212AD0649C7BE8C
>> BE2@BN9PR11MB5276.namprd11.prod.outlook.com/
>>>> Since we add a check to ensure the vIOMMU supports PASID, only devices
>>>> under those vIOMMUs can synthesize the vPASID capability. This gives
>>>> users control over which devices expose vPASID.
>>>>
>>>> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
>>>> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
>>>> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
>>>> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>
>>>> ---
>>>>   hw/vfio/pci.c      | 38 ++++++++++++++++++++++++++++++++++++++
>>>>   include/hw/iommu.h |  1 +
>>>>   2 files changed, 39 insertions(+)
>>>>
>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>> index 8b8bc5a421..e11e39d667 100644
>>>> --- a/hw/vfio/pci.c
>>>> +++ b/hw/vfio/pci.c
>>>> @@ -24,6 +24,7 @@
>>>>   #include <sys/ioctl.h>
>>>>
>>>>   #include "hw/hw.h"
>>>> +#include "hw/iommu.h"
>>>>   #include "hw/pci/msi.h"
>>>>   #include "hw/pci/msix.h"
>>>>   #include "hw/pci/pci_bridge.h"
>>>> @@ -2500,7 +2501,12 @@ static int
>> vfio_setup_rebar_ecap(VFIOPCIDevice *vdev, uint16_t pos)
>>>>   static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
>>>>   {
>>>> +    HostIOMMUDevice *hiod = vdev->vbasedev.hiod;
>>>> +    HostIOMMUDeviceClass *hiodc =
>> HOST_IOMMU_DEVICE_GET_CLASS(hiod);
>>>>       PCIDevice *pdev = PCI_DEVICE(vdev);
>>>> +    uint64_t max_pasid_log2 = 0;
>>>> +    bool pasid_cap_added = false;
>>>> +    uint64_t hw_caps;
>>>>       uint32_t header;
>>>>       uint16_t cap_id, next, size;
>>>>       uint8_t cap_ver;
>>>> @@ -2578,12 +2584,44 @@ static void vfio_add_ext_cap(VFIOPCIDevice
>> *vdev)
>>>>                   pcie_add_capability(pdev, cap_id, cap_ver, next, size);
>>>>               }
>>>>               break;
>>>> +        /*
>>>> +         * VFIO kernel does not expose the PASID CAP today. We may
>> synthesize
>>>> +         * one later through IOMMUFD APIs. If VFIO ever starts exposing it,
>>>> +         * record its presence here so we do not create a duplicate CAP.
>>>> +         */
>>>> +        case PCI_EXT_CAP_ID_PASID:
>>>> +             pasid_cap_added = true;
>>>> +             /* fallthrough */
>>>>           default:
>>>>               pcie_add_capability(pdev, cap_id, cap_ver, next, size);
>>>>           }
>>>>
>>>>       }
>>>>
>>>> +#ifdef CONFIG_IOMMUFD
>>>> +    /* Try to retrieve PASID CAP through IOMMUFD APIs */
>>>> +    if (!pasid_cap_added && hiodc && hiodc->get_cap) {
>>>> +        hiodc->get_cap(hiod, HOST_IOMMU_DEVICE_CAP_GENERIC_HW,
>> &hw_caps, NULL);
>>>> +        hiodc->get_cap(hiod,
>> HOST_IOMMU_DEVICE_CAP_MAX_PASID_LOG2,
>>>> +                       &max_pasid_log2, NULL);
>>>> +    }
>>>> +
>>>> +    /*
>>>> +     * If supported, adds the PASID capability in the end of the PCIe config
>>>> +     * space. TODO: Add option for enabling pasid at a safe offset.
>>>> +     */
>>>> +    if (max_pasid_log2 && (pci_device_get_viommu_flags(pdev) &
>>>> +                           VIOMMU_FLAG_PASID_SUPPORTED)) {
>>>> +        bool exec_perm = (hw_caps & IOMMU_HW_CAP_PCI_PASID_EXEC);
>>>> +        bool priv_mod = (hw_caps & IOMMU_HW_CAP_PCI_PASID_PRIV);
>>>> +
>>>> +        pcie_pasid_init(pdev, PCIE_CONFIG_SPACE_SIZE -
>> PCI_EXT_CAP_PASID_SIZEOF,
>>>> +                        max_pasid_log2, exec_perm, priv_mod);
>>>> +        /* PASID capability is fully emulated by QEMU */
>>>> +        memset(vdev->emulated_config_bits + pdev->exp.pasid_cap, 0xff, 8);
>>>> +    }
>>>> +#endif
>>>> +
>>>>       /* Cleanup chain head ID if necessary */
>>>>       if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) == 0xFFFF) {
>>>>           pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0);
>>>> diff --git a/include/hw/iommu.h b/include/hw/iommu.h
>>>> index 9b8bb94fc2..9635770bee 100644
>>>> --- a/include/hw/iommu.h
>>>> +++ b/include/hw/iommu.h
>>>> @@ -20,6 +20,7 @@
>>>>   enum viommu_flags {
>>>>       /* vIOMMU needs nesting parent HWPT to create nested HWPT */
>>>>       VIOMMU_FLAG_WANT_NESTING_PARENT = BIT_ULL(0),
>>>> +    VIOMMU_FLAG_PASID_SUPPORTED = BIT_ULL(1),
>>>>   };
>>>>
>>>>   #endif /* HW_IOMMU_H */
>>> Besides the fact the offset is arbitrarily chosen so that this is the
>>> last cap of the vconfig space, the code looks good to me.
>>> So
>>> Reviewed-by: Eric Auger <eric.auger@redhat.com>
>>>
>>> Just wondering whether we couldn't add some generic pcie code that
>>> parses the extended cap linked list to check the offset range is not
>>> used by another cap before allowing the insertion at a given offset?
>>> This wouldn't prevent a subsequent addition from failing but at least we
>>> would know if there is some collision.this could be added later on though.
>>>
>> You're absolutely right. My approach of using the last 8 bytes was a
>> shortcut to avoid implementing proper capability parsing logic
>> (importing pci_regs.h and maintaining a cap_id-to-cap_size mapping
>> table), and it simplified PASID capability detection by only examining
>> the last 8bytes by a simple dump :(. However, this approach is not
>> good as we cannot guarantee that the last 8bytes are unused by any
>> device.
>>
>> Let's just implement the logic to walk the linked list of ext_caps to
>> find an appropriate offset for our use case.
> I had a go at this. Based on my understanding, even if we walk the PCIe
> extended capability linked list, we still can't easily determine the size
> occupied by the last capability as the extended capability header does not
> encode a length, it only provides the "next" pointer, and for the last entry
> next == 0.
If my understanding is correct when walking the linked list, you can
enumerate the start index and the PCIe extended Capability variable size
which is made of fix header size + register block variable size which
depends on the capability ID). After that we shall be able to allocate a
slot within holes or at least check that adding the new prop at the end
of the 4kB is safe, no?. What do I miss?

Thanks

Eric
>
> Given that, I tried the following approach,
>
> -locate the last extended capability (using the existing helper),
> -reserve a fixed window (512 bytes) for that final capability,
> -and synthesize PASID at the end of PCIe config space (0xff8) only if it
> looks like there is enough room.
>
> Maybe I am missing something here.. Please let me know if there is a
> better way to address this.
>
> Thanks,
> Shameer
>
> diff --git a/hw/pci/pcie.c b/hw/pci/pcie.c
> index b302de6419..89178f2b7e 100644
> --- a/hw/pci/pcie.c
> +++ b/hw/pci/pcie.c
> @@ -1005,8 +1005,8 @@ bool pcie_cap_is_arifwd_enabled(const PCIDevice *dev)
>   */
>
>  /* Passing a cap_id value > 0xffff will return 0 and put end of list in prev */
> -static uint16_t pcie_find_capability_list(PCIDevice *dev, uint32_t cap_id,
> -                                          uint16_t *prev_p)
> +uint16_t pcie_find_capability_list(PCIDevice *dev, uint32_t cap_id,
> +                                   uint16_t *prev_p)
>  {
>      uint16_t prev = 0;
>      uint16_t next;
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 8b8bc5a421..e8354e8b8d 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -24,6 +24,7 @@
>  #include <sys/ioctl.h>
>
>  #include "hw/hw.h"
> +#include "hw/iommu.h"
>  #include "hw/pci/msi.h"
>  #include "hw/pci/msix.h"
>  #include "hw/pci/pci_bridge.h"
> @@ -2498,9 +2499,72 @@ static int vfio_setup_rebar_ecap(VFIOPCIDevice *vdev, uint16_t pos)
>      return 0;
>  }
>
> +/*
> + * Try to retrieve PASID capability information via IOMMUFD APIs and,
> + * if supported, synthesize a PASID PCIe extended capability for the
> + * VFIO device.
> + *
> + * The PASID capability is placed at the end of the PCIe extended
> + * configuration space. Determining the exact size of the last
> + * existing extended capability is non trivial, as PCIe extended
> + * capabilities do not generically encode their total size and
> + * vendor defined extensions are permitted.
> + *
> + * For now, reserve a fixed window (512 bytes) for the last extended
> + * capability and only insert  PASID if sufficient space remains.
> + */
> +static void vfio_pci_synthesize_pasid_cap(VFIOPCIDevice *vdev)
> +{
> +    HostIOMMUDevice *hiod = vdev->vbasedev.hiod;
> +    HostIOMMUDeviceClass *hiodc;
> +    HostIOMMUDevicePasidInfo pasid_info;
> +    PCIDevice *pdev = PCI_DEVICE(vdev);
> +    uint16_t last = 0;
> +    uint16_t pasid_offset;
> +
> +    if (vdev->vbasedev.mdev) {
> +        return;
> +    }
> +
> +    hiodc = HOST_IOMMU_DEVICE_GET_CLASS(hiod);
> +    if (!hiodc || !hiodc->get_pasid_info ||
> +        !hiodc->get_pasid_info(hiod, &pasid_info) ||
> +        !(pci_device_get_viommu_flags(pdev) & VIOMMU_FLAG_PASID_SUPPORTED)) {
> +        return;
> +    }
> +
> +    /*
> +     * Locate the last PCIe extended capability present in the device
> +     * configuration space.
> +     */
> +    pcie_find_capability_list(pdev, 0x1ffff, &last);
> +
> +    /*
> +     * Reserve space at the end of PCIe configuration space for PASID.
> +     * If the last extended capability appears too close to the end,
> +     * refuse to insert PASID.
> +     */
> +    if (last + 512 > PCIE_CONFIG_SPACE_SIZE - PCI_EXT_CAP_PASID_SIZEOF) {
> +        warn_report("vfio: no space to synthesize PASID extended capability");
> +        return;
> +    }
> +
> +    pasid_offset = PCIE_CONFIG_SPACE_SIZE - PCI_EXT_CAP_PASID_SIZEOF;
> +
> +    pcie_pasid_init(pdev, pasid_offset,
> +                    pasid_info.max_pasid_log2,
> +                    pasid_info.exec_perm,
> +                    pasid_info.priv_mod);
> +
> +    /* PASID capability is fully emulated by QEMU */
> +    memset(vdev->emulated_config_bits + pdev->exp.pasid_cap, 0xff,
> +           PCI_EXT_CAP_PASID_SIZEOF);
> +}
> +
>  static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
>  {
>      PCIDevice *pdev = PCI_DEVICE(vdev);
> +    bool pasid_cap_added = false;
>      uint32_t header;
>      uint16_t cap_id, next, size;
>      uint8_t cap_ver;
> @@ -2562,6 +2626,7 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
>           * accesses, exact size doesn't seem worthwhile.
>           */
>          size = vfio_ext_cap_max_size(config, next);
> +        printf("%s: Shameer: cap_id 0x%x size 0x%x\n",__func__, cap_id, size);
>
>          /* Use emulated next pointer to allow dropping extended caps */
>          pci_long_test_and_set_mask(vdev->emulated_config_bits + next,
> @@ -2578,12 +2643,24 @@ static void vfio_add_ext_cap(VFIOPCIDevice *vdev)
>                  pcie_add_capability(pdev, cap_id, cap_ver, next, size);
>              }
>              break;
> +        /*
> +         * VFIO kernel does not expose the PASID CAP today. We may synthesize
> +         * one later through IOMMUFD APIs. If VFIO ever starts exposing it,
> +         * record its presence here so we do not create a duplicate CAP.
> +         */
> +        case PCI_EXT_CAP_ID_PASID:
> +             pasid_cap_added = true;
> +             /* fallthrough */
>          default:
>              pcie_add_capability(pdev, cap_id, cap_ver, next, size);
>          }
>
>      }
>
> +    if (!pasid_cap_added) {
> +        vfio_pci_synthesize_pasid_cap(vdev);
> +    }
> +
>      /* Cleanup chain head ID if necessary */
>      if (pci_get_word(pdev->config + PCI_CONFIG_SPACE_SIZE) == 0xFFFF) {
>          pci_set_word(pdev->config + PCI_CONFIG_SPACE_SIZE, 0);
> diff --git a/include/hw/pci/pcie.h b/include/hw/pci/pcie.h
> index 42cebcd033..1a477b05b2 100644
> --- a/include/hw/pci/pcie.h
> +++ b/include/hw/pci/pcie.h
> @@ -130,6 +130,8 @@ bool pcie_cap_is_arifwd_enabled(const PCIDevice *dev);
>
>  /* PCI express extended capability helper functions */
>  uint16_t pcie_find_capability(PCIDevice *dev, uint16_t cap_id);
> +uint16_t pcie_find_capability_list(PCIDevice *dev, uint32_t cap_id,
> +                                   uint16_t *prev_p);
>  void pcie_add_capability(PCIDevice *dev,
>                           uint16_t cap_id, uint8_t cap_ver,
>                           uint16_t offset, uint16_t size);
> @@ -138,6 +140,7 @@ void pcie_sync_bridge_lnk(PCIDevice *dev);
>  void pcie_acs_init(PCIDevice *dev, uint16_t offset);
>  void pcie_acs_reset(PCIDevice *dev);
>
> +uint16_t pcie_get_ext_cap_next_offset(PCIDevice *pdev, uint16_t cap_size);
>  void pcie_ari_init(PCIDevice *dev, uint16_t offset);
>  void pcie_dev_ser_num_init(PCIDevice *dev, uint16_t offset, uint64_t ser_num);
>  void pcie_ats_init(PCIDevice *dev, uint16_t offset, bool aligned);
>
>

RE: [PATCH v6 32/33] vfio: Synthesize vPASID capability to VM

Posted by Shameer Kolothum 1 month ago

Hi Eric,

> -----Original Message-----
> From: Eric Auger <eric.auger@redhat.com>
> Sent: 06 January 2026 10:55
> To: Shameer Kolothum <skolothumtho@nvidia.com>; Yi Liu
> <yi.l.liu@intel.com>; qemu-arm@nongnu.org; qemu-devel@nongnu.org
> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> smostafa@google.com; wangzhou1@hisilicon.com;
> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; Krishnakant Jaju
> <kjaju@nvidia.com>; alex@shazbot.org
> Subject: Re: [PATCH v6 32/33] vfio: Synthesize vPASID capability to VM
> 
> External email: Use caution opening links or attachments
> 
> 
> Hi Shameer,
[...]

> >>> Besides the fact the offset is arbitrarily chosen so that this is the
> >>> last cap of the vconfig space, the code looks good to me.
> >>> So
> >>> Reviewed-by: Eric Auger <eric.auger@redhat.com>
> >>>
> >>> Just wondering whether we couldn't add some generic pcie code that
> >>> parses the extended cap linked list to check the offset range is not
> >>> used by another cap before allowing the insertion at a given offset?
> >>> This wouldn't prevent a subsequent addition from failing but at least we
> >>> would know if there is some collision.this could be added later on though.
> >>>
> >> You're absolutely right. My approach of using the last 8 bytes was a
> >> shortcut to avoid implementing proper capability parsing logic
> >> (importing pci_regs.h and maintaining a cap_id-to-cap_size mapping
> >> table), and it simplified PASID capability detection by only examining
> >> the last 8bytes by a simple dump :(. However, this approach is not
> >> good as we cannot guarantee that the last 8bytes are unused by any
> >> device.
> >>
> >> Let's just implement the logic to walk the linked list of ext_caps to
> >> find an appropriate offset for our use case.
> > I had a go at this. Based on my understanding, even if we walk the PCIe
> > extended capability linked list, we still can't easily determine the size
> > occupied by the last capability as the extended capability header does not
> > encode a length, it only provides the "next" pointer, and for the last entry
> > next == 0.
> If my understanding is correct when walking the linked list, you can
> enumerate the start index and the PCIe extended Capability variable size
> which is made of fix header size + register block variable size which
> depends on the capability ID). After that we shall be able to allocate a
> slot within holes or at least check that adding the new prop at the end
> of the 4kB is safe, no?. What do I miss?

I think the main issue is that we can't know whether the apparent "holes"
between extended capabilities are actually free. Depending on the vendor
implementation, those regions may be reserved or used for vendor specific
purposes, and I am not sure(please correct me) PCIe spec guarantee that
such gaps are available for reuse. Hence thought of relying on the “next”
pointer as a safe bet.

Even if we look at the last CAP ID and derive a size based on the
spec defined register layout, we still can;t know whether there is
any additional vendor specific data beyond that "size". It is still
a best guess and I don't think we gain much in adding this additional
check.

Perhaps, I think we could inform the user that we are placing
teh PASID at the last offset and the onus is on user to make sure
it is safe to do so. 

Thoughts?

Thanks,
Shameer

Re: [PATCH v6 32/33] vfio: Synthesize vPASID capability to VM

Posted by Eric Auger 1 month ago


On 1/6/26 12:38 PM, Shameer Kolothum wrote:
> Hi Eric,
>
>> -----Original Message-----
>> From: Eric Auger <eric.auger@redhat.com>
>> Sent: 06 January 2026 10:55
>> To: Shameer Kolothum <skolothumtho@nvidia.com>; Yi Liu
>> <yi.l.liu@intel.com>; qemu-arm@nongnu.org; qemu-devel@nongnu.org
>> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
>> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
>> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
>> smostafa@google.com; wangzhou1@hisilicon.com;
>> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
>> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; Krishnakant Jaju
>> <kjaju@nvidia.com>; alex@shazbot.org
>> Subject: Re: [PATCH v6 32/33] vfio: Synthesize vPASID capability to VM
>>
>> External email: Use caution opening links or attachments
>>
>>
>> Hi Shameer,
> [...]
>
>>>>> Besides the fact the offset is arbitrarily chosen so that this is the
>>>>> last cap of the vconfig space, the code looks good to me.
>>>>> So
>>>>> Reviewed-by: Eric Auger <eric.auger@redhat.com>
>>>>>
>>>>> Just wondering whether we couldn't add some generic pcie code that
>>>>> parses the extended cap linked list to check the offset range is not
>>>>> used by another cap before allowing the insertion at a given offset?
>>>>> This wouldn't prevent a subsequent addition from failing but at least we
>>>>> would know if there is some collision.this could be added later on though.
>>>>>
>>>> You're absolutely right. My approach of using the last 8 bytes was a
>>>> shortcut to avoid implementing proper capability parsing logic
>>>> (importing pci_regs.h and maintaining a cap_id-to-cap_size mapping
>>>> table), and it simplified PASID capability detection by only examining
>>>> the last 8bytes by a simple dump :(. However, this approach is not
>>>> good as we cannot guarantee that the last 8bytes are unused by any
>>>> device.
>>>>
>>>> Let's just implement the logic to walk the linked list of ext_caps to
>>>> find an appropriate offset for our use case.
>>> I had a go at this. Based on my understanding, even if we walk the PCIe
>>> extended capability linked list, we still can't easily determine the size
>>> occupied by the last capability as the extended capability header does not
>>> encode a length, it only provides the "next" pointer, and for the last entry
>>> next == 0.
>> If my understanding is correct when walking the linked list, you can
>> enumerate the start index and the PCIe extended Capability variable size
>> which is made of fix header size + register block variable size which
>> depends on the capability ID). After that we shall be able to allocate a
>> slot within holes or at least check that adding the new prop at the end
>> of the 4kB is safe, no?. What do I miss?
> I think the main issue is that we can't know whether the apparent "holes"
> between extended capabilities are actually free. Depending on the vendor
> implementation, those regions may be reserved or used for vendor specific
> purposes, and I am not sure(please correct me) PCIe spec guarantee that
> such gaps are available for reuse. Hence thought of relying on the “next”
> pointer as a safe bet.
>
> Even if we look at the last CAP ID and derive a size based on the
> spec defined register layout, we still can;t know whether there is
> any additional vendor specific data beyond that "size". It is still
> a best guess and I don't think we gain much in adding this additional
> check.

Ah OK I see what you mean (you may have discussed that earlier in other
threads sorry). So you may have vendor specific private data in the
holes. In that case I guess we cannot do much :-/
>
> Perhaps, I think we could inform the user that we are placing
> teh PASID at the last offset and the onus is on user to make sure
> it is safe to do so. 
or another solution is to let the user opt-in for this hasardous
placement using an explicit x- prefixed option? Dunno

Thanks

Eric
>
> Thoughts?
>
> Thanks,
> Shameer
>

Re: [PATCH v6 32/33] vfio: Synthesize vPASID capability to VM

Posted by Alex Williamson 1 month ago

On Tue, 6 Jan 2026 14:22:57 +0100
Eric Auger <eric.auger@redhat.com> wrote:

> On 1/6/26 12:38 PM, Shameer Kolothum wrote:
> > Hi Eric,
> >  
> >> -----Original Message-----
> >> From: Eric Auger <eric.auger@redhat.com>
> >> Sent: 06 January 2026 10:55
> >> To: Shameer Kolothum <skolothumtho@nvidia.com>; Yi Liu
> >> <yi.l.liu@intel.com>; qemu-arm@nongnu.org; qemu-devel@nongnu.org
> >> Cc: peter.maydell@linaro.org; Jason Gunthorpe <jgg@nvidia.com>; Nicolin
> >> Chen <nicolinc@nvidia.com>; ddutile@redhat.com; berrange@redhat.com;
> >> Nathan Chen <nathanc@nvidia.com>; Matt Ochs <mochs@nvidia.com>;
> >> smostafa@google.com; wangzhou1@hisilicon.com;
> >> jiangkunkun@huawei.com; jonathan.cameron@huawei.com;
> >> zhangfei.gao@linaro.org; zhenzhong.duan@intel.com; Krishnakant Jaju
> >> <kjaju@nvidia.com>; alex@shazbot.org
> >> Subject: Re: [PATCH v6 32/33] vfio: Synthesize vPASID capability to VM
> >>
> >> External email: Use caution opening links or attachments
> >>
> >>
> >> Hi Shameer,  
> > [...]
> >  
> >>>>> Besides the fact the offset is arbitrarily chosen so that this is the
> >>>>> last cap of the vconfig space, the code looks good to me.
> >>>>> So
> >>>>> Reviewed-by: Eric Auger <eric.auger@redhat.com>
> >>>>>
> >>>>> Just wondering whether we couldn't add some generic pcie code that
> >>>>> parses the extended cap linked list to check the offset range is not
> >>>>> used by another cap before allowing the insertion at a given offset?
> >>>>> This wouldn't prevent a subsequent addition from failing but at least we
> >>>>> would know if there is some collision.this could be added later on though.
> >>>>>  
> >>>> You're absolutely right. My approach of using the last 8 bytes was a
> >>>> shortcut to avoid implementing proper capability parsing logic
> >>>> (importing pci_regs.h and maintaining a cap_id-to-cap_size mapping
> >>>> table), and it simplified PASID capability detection by only examining
> >>>> the last 8bytes by a simple dump :(. However, this approach is not
> >>>> good as we cannot guarantee that the last 8bytes are unused by any
> >>>> device.
> >>>>
> >>>> Let's just implement the logic to walk the linked list of ext_caps to
> >>>> find an appropriate offset for our use case.  
> >>> I had a go at this. Based on my understanding, even if we walk the PCIe
> >>> extended capability linked list, we still can't easily determine the size
> >>> occupied by the last capability as the extended capability header does not
> >>> encode a length, it only provides the "next" pointer, and for the last entry
> >>> next == 0.  
> >> If my understanding is correct when walking the linked list, you can
> >> enumerate the start index and the PCIe extended Capability variable size
> >> which is made of fix header size + register block variable size which
> >> depends on the capability ID). After that we shall be able to allocate a
> >> slot within holes or at least check that adding the new prop at the end
> >> of the 4kB is safe, no?. What do I miss?  
> > I think the main issue is that we can't know whether the apparent "holes"
> > between extended capabilities are actually free. Depending on the vendor
> > implementation, those regions may be reserved or used for vendor specific
> > purposes, and I am not sure(please correct me) PCIe spec guarantee that
> > such gaps are available for reuse. Hence thought of relying on the “next”
> > pointer as a safe bet.
> >
> > Even if we look at the last CAP ID and derive a size based on the
> > spec defined register layout, we still can;t know whether there is
> > any additional vendor specific data beyond that "size". It is still
> > a best guess and I don't think we gain much in adding this additional
> > check.  
> 
> Ah OK I see what you mean (you may have discussed that earlier in other
> threads sorry). So you may have vendor specific private data in the
> holes. In that case I guess we cannot do much :-/

Also, we can only know the size of capabilities that are currently
defined, we don't do a great job of keeping up with the latest ECNs.

Unless we have device specific knowledge, the best we can do is hope
that a gap between capabilities is unused.  It might be a helpful
indicator to verify the config space we intend to overlap is zero,
though we can get false positives with such a method if we overlap a
capability that kernel vfio-pci has disconnected from the capability
chain an marked read-only.

> >
> > Perhaps, I think we could inform the user that we are placing
> > teh PASID at the last offset and the onus is on user to make sure
> > it is safe to do so.   
> or another solution is to let the user opt-in for this hasardous
> placement using an explicit x- prefixed option? Dunno

Yeah, this is probably one of those cases where we expect we don't have
a foolproof solution and we should allow an override.  I think we're
defining the initial 'auto' algorithm that a -x-vpasid_cap_offset=
might use by default.  It should also take a numerical value as an
override though.  We'll need a table so that when we find a device that
requires a different value, that becomes the auto value for that
device.

Also consider how adding subsequent purely virtual capability would
work.  We don't want to have each capability defining a new algorithm
and default offset.  To that extent, we might be better served with a
command line override that specifies available ranges of config space
rather than an offset for a specific vcap.  I think our discussions
related to a kernel interface were headed more in this direction.
Thanks,

Alex

Re: [PATCH v6 32/33] vfio: Synthesize vPASID capability to VM

Posted by Nicolin Chen 2 months, 2 weeks ago

On Thu, Nov 20, 2025 at 01:22:12PM +0000, Shameer Kolothum wrote:
> From: Yi Liu <yi.l.liu@intel.com>
> 
> If user wants to expose PASID capability in vIOMMU, then VFIO would also
> need to report the PASID cap for this device if the underlying hardware
> supports it as well.
> 
> As a start, this chooses to put the vPASID cap in the last 8 bytes of the
> vconfig space. This is a choice in the good hope of no conflict with any
> existing cap or hidden registers. For the devices that has hidden registers,
> user should figure out a proper offset for the vPASID cap. This may require
> an option for user to config it. Here we leave it as a future extension.
> There are more discussions on the mechanism of finding the proper offset.
> 
> https://lore.kernel.org/kvm/BN9PR11MB5276318969A212AD0649C7BE8CBE2@BN9PR11MB5276.namprd11.prod.outlook.com/
> 
> Since we add a check to ensure the vIOMMU supports PASID, only devices
> under those vIOMMUs can synthesize the vPASID capability. This gives
> users control over which devices expose vPASID.
> 
> Signed-off-by: Yi Liu <yi.l.liu@intel.com>
> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org>
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> Signed-off-by: Shameer Kolothum <skolothumtho@nvidia.com>

Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>