[v9] RE: [PATCH v9 00/19] intel_iommu: Enable first stage translation for passthrough device

RE: [PATCH v9 00/19] intel_iommu: Enable first stage translation for passthrough device

Posted by Duan, Zhenzhong 1 month, 3 weeks ago

Hi Cédric,

>-----Original Message-----
>From: Cédric Le Goater <clg@redhat.com>
>Subject: Re: [PATCH v9 00/19] intel_iommu: Enable first stage translation for
>passthrough device
>
>Hello Zhenzhong
>
>On 12/15/25 07:50, Zhenzhong Duan wrote:
>> Hi,
>>
>> Based on Cédric's suggestions[1], The nesting series v8 is split to
>> "base nesting series" + "ERRATA_772415_SPR17 quirk series", this is the
>> base nesting series.
>>
>> For passthrough device with intel_iommu.x-flts=on, we don't do shadowing
>of
>> guest page table but pass first stage page table to host side to construct a
>> nested HWPT. There was some effort to enable this feature in old days, see
>> [2] for details.
>>
>> The key design is to utilize the dual-stage IOMMU translation (also known as
>> IOMMU nested translation) capability in host IOMMU. As the below
>diagram shows,
>> guest I/O page table pointer in GPA (guest physical address) is passed to
>host
>> and be used to perform the first stage address translation. Along with it,
>> modifications to present mappings in the guest I/O page table should be
>followed
>> with an IOTLB invalidation.
>>
>>          .-------------.  .---------------------------.
>>          |   vIOMMU    |  | Guest I/O page table      |
>>          |             |  '---------------------------'
>>          .----------------/
>>          | PASID Entry |--- PASID cache flush --+
>>          '-------------'                        |
>>          |             |                        V
>>          |             |           I/O page table pointer in GPA
>>          '-------------'
>>      Guest
>>      ------| Shadow |---------------------------|--------
>>            v        v                           v
>>      Host
>>          .-------------.  .-----------------------------.
>>          |   pIOMMU    |  | First stage for GIOVA->GPA  |
>>          |             |  '-----------------------------'
>>          .----------------/  |
>>          | PASID Entry |     V (Nested xlate)
>>          '----------------\.--------------------------------------------.
>>          |             |   | Second stage for GPA->HPA, unmanaged
>domain|
>>          |             |   '--------------------------------------------'
>>          '-------------'
>> <Intel VT-d Nested translation>
>>
>> This series reuse VFIO device's default HWPT as nesting parent instead of
>> creating new one. This way avoids duplicate code of a new memory
>listener,
>> all existing feature from VFIO listener can be shared, e.g., ram discard,
>> dirty tracking, etc. Two limitations are: 1) not supporting VFIO device
>> under a PCI bridge with emulated device, because emulated device wants
>> IOMMU AS and VFIO device stick to system AS; 2) not supporting kexec or
>> reboot from "intel_iommu=on,sm_on" to "intel_iommu=on,sm_off" on
>platform
>> with ERRATA_772415_SPR17, because VFIO device's default HWPT is
>created
>> with NEST_PARENT flag, kernel inhibit RO mappings when switch to shadow
>> mode.
>>
>> This series is also a prerequisite work for vSVA, i.e. Sharing guest
>> application address space with passthrough devices.
>>
>> There are some interactions between VFIO and vIOMMU
>> * vIOMMU registers PCIIOMMUOps [set|unset]_iommu_device to PCI
>>    subsystem. VFIO calls them to register/unregister HostIOMMUDevice
>>    instance to vIOMMU at vfio device realize stage.
>> * vIOMMU registers PCIIOMMUOps get_viommu_flags to PCI subsystem.
>>    VFIO calls it to get vIOMMU exposed flags.
>> * vIOMMU calls HostIOMMUDeviceIOMMUFD interface [at|de]tach_hwpt
>>    to bind/unbind device to IOMMUFD backed domains, either nested
>>    domain or not.
>>
>> See below diagram:
>>
>>          VFIO Device                                 Intel
>IOMMU
>>      .-----------------.                         .-------------------.
>>      |                 |                         |
>|
>>      |       .---------|PCIIOMMUOps              |.-------------.
>|
>>      |       | IOMMUFD |(set/unset_iommu_device) || Host IOMMU
>|    |
>>      |       | Device  |------------------------>|| Device list |    |
>>      |       .---------|(get_viommu_flags)       |.-------------.    |
>>      |                 |                         |       |
>|
>>      |                 |                         |       V
>|
>>      |       .---------|  HostIOMMUDeviceIOMMUFD |  .-------------.
>|
>>      |       | IOMMUFD |            (attach_hwpt)|  | Host
>IOMMU  |  |
>>      |       | link    |<------------------------|  |   Device    |  |
>>      |       .---------|            (detach_hwpt)|  .-------------.  |
>>      |                 |                         |       |
>|
>>      |                 |                         |       ...
>|
>>      .-----------------.                         .-------------------.
>>
>> Below is an example to enable first stage translation for passthrough
>device:
>>
>>      -M q35,...
>>      -device intel-iommu,x-scalable-mode=on,x-flts=on...
>>      -object iommufd,id=iommufd0 -device
>vfio-pci,iommufd=iommufd0,...
>
>What about libvirt support ? There are patches to enable IOMMUFD
>support with device assignment but I don't see anything related
>to first stage translation. Is there a plan ?

I think IOMMUFD support in libvirt is non-trivial, good to know there is progress.
But I didn't find a match in libvirt mailing list, https://lists.libvirt.org/archives/search?q=iommufd
Do you have a link?

I think first stage support is trivial, only to support a new property <...x-flts=on/off>.
I can apply a few time resource from my manager to work on it after this series is merged.
It's also welcome if anyone is interested to take it.

>
>This raises a question. Should ftls support be automatically enabled
>based on the availability of an IOMMUFD backend ?

Yes, if user doesn't force it off, like <...iommufd='off'> and IOMMUFD backend available, we can enable it automatically.

>
>>
>> Test done:
>> - VFIO devices hotplug/unplug
>> - different VFIO devices linked to different iommufds
>> - vhost net device ping test
>> - migration with QAT passthrough
>
>Did you do any experiments with active mlx5 VFs ?

No, there are only a few device drivers supporting VFIO migration and we only have QAT.
Let me know if you see issue on other devices.

Thanks
Zhenzhong

Re: [PATCH v9 00/19] intel_iommu: Enable first stage translation for passthrough device

Posted by Cédric Le Goater 1 month, 3 weeks ago

Hello Zhenzhong,

On 12/16/25 04:24, Duan, Zhenzhong wrote:
> Hi Cédric,
> 
>> -----Original Message-----
>> From: Cédric Le Goater <clg@redhat.com>
>> Subject: Re: [PATCH v9 00/19] intel_iommu: Enable first stage translation for
>> passthrough device
>>
>> Hello Zhenzhong
>>
>> On 12/15/25 07:50, Zhenzhong Duan wrote:
>>> Hi,
>>>
>>> Based on Cédric's suggestions[1], The nesting series v8 is split to
>>> "base nesting series" + "ERRATA_772415_SPR17 quirk series", this is the
>>> base nesting series.
>>>
>>> For passthrough device with intel_iommu.x-flts=on, we don't do shadowing
>> of
>>> guest page table but pass first stage page table to host side to construct a
>>> nested HWPT. There was some effort to enable this feature in old days, see
>>> [2] for details.
>>>
>>> The key design is to utilize the dual-stage IOMMU translation (also known as
>>> IOMMU nested translation) capability in host IOMMU. As the below
>> diagram shows,
>>> guest I/O page table pointer in GPA (guest physical address) is passed to
>> host
>>> and be used to perform the first stage address translation. Along with it,
>>> modifications to present mappings in the guest I/O page table should be
>> followed
>>> with an IOTLB invalidation.
>>>
>>>           .-------------.  .---------------------------.
>>>           |   vIOMMU    |  | Guest I/O page table      |
>>>           |             |  '---------------------------'
>>>           .----------------/
>>>           | PASID Entry |--- PASID cache flush --+
>>>           '-------------'                        |
>>>           |             |                        V
>>>           |             |           I/O page table pointer in GPA
>>>           '-------------'
>>>       Guest
>>>       ------| Shadow |---------------------------|--------
>>>             v        v                           v
>>>       Host
>>>           .-------------.  .-----------------------------.
>>>           |   pIOMMU    |  | First stage for GIOVA->GPA  |
>>>           |             |  '-----------------------------'
>>>           .----------------/  |
>>>           | PASID Entry |     V (Nested xlate)
>>>           '----------------\.--------------------------------------------.
>>>           |             |   | Second stage for GPA->HPA, unmanaged
>> domain|
>>>           |             |   '--------------------------------------------'
>>>           '-------------'
>>> <Intel VT-d Nested translation>
>>>
>>> This series reuse VFIO device's default HWPT as nesting parent instead of
>>> creating new one. This way avoids duplicate code of a new memory
>> listener,
>>> all existing feature from VFIO listener can be shared, e.g., ram discard,
>>> dirty tracking, etc. Two limitations are: 1) not supporting VFIO device
>>> under a PCI bridge with emulated device, because emulated device wants
>>> IOMMU AS and VFIO device stick to system AS; 2) not supporting kexec or
>>> reboot from "intel_iommu=on,sm_on" to "intel_iommu=on,sm_off" on
>> platform
>>> with ERRATA_772415_SPR17, because VFIO device's default HWPT is
>> created
>>> with NEST_PARENT flag, kernel inhibit RO mappings when switch to shadow
>>> mode.
>>>
>>> This series is also a prerequisite work for vSVA, i.e. Sharing guest
>>> application address space with passthrough devices.
>>>
>>> There are some interactions between VFIO and vIOMMU
>>> * vIOMMU registers PCIIOMMUOps [set|unset]_iommu_device to PCI
>>>     subsystem. VFIO calls them to register/unregister HostIOMMUDevice
>>>     instance to vIOMMU at vfio device realize stage.
>>> * vIOMMU registers PCIIOMMUOps get_viommu_flags to PCI subsystem.
>>>     VFIO calls it to get vIOMMU exposed flags.
>>> * vIOMMU calls HostIOMMUDeviceIOMMUFD interface [at|de]tach_hwpt
>>>     to bind/unbind device to IOMMUFD backed domains, either nested
>>>     domain or not.
>>>
>>> See below diagram:
>>>
>>>           VFIO Device                                 Intel
>> IOMMU
>>>       .-----------------.                         .-------------------.
>>>       |                 |                         |
>> |
>>>       |       .---------|PCIIOMMUOps              |.-------------.
>> |
>>>       |       | IOMMUFD |(set/unset_iommu_device) || Host IOMMU
>> |    |
>>>       |       | Device  |------------------------>|| Device list |    |
>>>       |       .---------|(get_viommu_flags)       |.-------------.    |
>>>       |                 |                         |       |
>> |
>>>       |                 |                         |       V
>> |
>>>       |       .---------|  HostIOMMUDeviceIOMMUFD |  .-------------.
>> |
>>>       |       | IOMMUFD |            (attach_hwpt)|  | Host
>> IOMMU  |  |
>>>       |       | link    |<------------------------|  |   Device    |  |
>>>       |       .---------|            (detach_hwpt)|  .-------------.  |
>>>       |                 |                         |       |
>> |
>>>       |                 |                         |       ...
>> |
>>>       .-----------------.                         .-------------------.
>>>
>>> Below is an example to enable first stage translation for passthrough
>> device:
>>>
>>>       -M q35,...
>>>       -device intel-iommu,x-scalable-mode=on,x-flts=on...
>>>       -object iommufd,id=iommufd0 -device
>> vfio-pci,iommufd=iommufd0,...
>>
>> What about libvirt support ? There are patches to enable IOMMUFD
>> support with device assignment but I don't see anything related
>> to first stage translation. Is there a plan ?
> 
> I think IOMMUFD support in libvirt is non-trivial, good to know there is progress.
> But I didn't find a match in libvirt mailing list, https://lists.libvirt.org/archives/search?q=iommufd
> Do you have a link?

Here  :

   https://lists.libvirt.org/archives/list/devel@lists.libvirt.org/thread/KFYUQGMXWV64QPI245H66GKRNAYL7LGB/

There might be an update. We should ask Nathan.

> I think first stage support is trivial, only to support a new property <...x-flts=on/off>.
> I can apply a few time resource from my manager to work on it after this series is merged.
> It's also welcome if anyone is interested to take it.

ok. So, currently, we have no way to benefit from translation
acceleration on the host unless we directly set the 'x-flts'
property on the QEMU command line.

>> This raises a question. Should ftls support be automatically enabled
>> based on the availability of an IOMMUFD backend ?
> 
> Yes, if user doesn't force it off, like <...iommufd='off'> and IOMMUFD backend available, we can enable it automatically.

The plan is to keep VFIO IOMMU Type1 as the default host IOMMU
backend to maintain a consistent behavior. If an IOMMUFD backend
is required, it should be set explicitly. One day we might revisit
this choice and change the default. Not yet.


>>> Test done:
>>> - VFIO devices hotplug/unplug
>>> - different VFIO devices linked to different iommufds
>>> - vhost net device ping test
>>> - migration with QAT passthrough
>>
>> Did you do any experiments with active mlx5 VFs ?
> 
> No, there are only a few device drivers supporting VFIO migration and we only have QAT.
> Let me know if you see issue on other devices.
Since we lack libvirt integration (of flts), the tests need
to be run manually which is more complex for QE. IOW, it will
take more time but we should definitely evaluate other devices.


Thanks,

C.

RE: [PATCH v9 00/19] intel_iommu: Enable first stage translation for passthrough device

Posted by Duan, Zhenzhong 1 month, 3 weeks ago


>-----Original Message-----
>From: Cédric Le Goater <clg@redhat.com>
>Subject: Re: [PATCH v9 00/19] intel_iommu: Enable first stage translation for
>passthrough device

...

>>>> Below is an example to enable first stage translation for passthrough
>>> device:
>>>>
>>>>       -M q35,...
>>>>       -device intel-iommu,x-scalable-mode=on,x-flts=on...
>>>>       -object iommufd,id=iommufd0 -device
>>> vfio-pci,iommufd=iommufd0,...
>>>
>>> What about libvirt support ? There are patches to enable IOMMUFD
>>> support with device assignment but I don't see anything related
>>> to first stage translation. Is there a plan ?
>>
>> I think IOMMUFD support in libvirt is non-trivial, good to know there is
>progress.
>> But I didn't find a match in libvirt mailing list,
>https://lists.libvirt.org/archives/search?q=iommufd
>> Do you have a link?
>
>Here  :
>
>
>https://lists.libvirt.org/archives/list/devel@lists.libvirt.org/thread/KFYUQGMX
>WV64QPI245H66GKRNAYL7LGB/

Thanks

>
>There might be an update. We should ask Nathan.
>
>> I think first stage support is trivial, only to support a new property
><...x-flts=on/off>.
>> I can apply a few time resource from my manager to work on it after this
>series is merged.
>> It's also welcome if anyone is interested to take it.
>
>ok. So, currently, we have no way to benefit from translation
>acceleration on the host unless we directly set the 'x-flts'
>property on the QEMU command line.

Yes, thanks for reminding.
I'll try add 'x-flts' support to libvirt to fill the gap recently,
I will take one week vacation starting this Friday, may try it after vacation.

>
>>> This raises a question. Should ftls support be automatically enabled
>>> based on the availability of an IOMMUFD backend ?
>>
>> Yes, if user doesn't force it off, like <...iommufd='off'> and IOMMUFD
>backend available, we can enable it automatically.
>
>The plan is to keep VFIO IOMMU Type1 as the default host IOMMU
>backend to maintain a consistent behavior. If an IOMMUFD backend
>is required, it should be set explicitly. One day we might revisit
>this choice and change the default. Not yet.

OK, maybe we need to maintain consistent behavior for intel_iommu too,
if first-stage is required, it should be set explicitly, if not set, default to second stage(shadow page).

>
>
>>>> Test done:
>>>> - VFIO devices hotplug/unplug
>>>> - different VFIO devices linked to different iommufds
>>>> - vhost net device ping test
>>>> - migration with QAT passthrough
>>>
>>> Did you do any experiments with active mlx5 VFs ?
>>
>> No, there are only a few device drivers supporting VFIO migration and we
>only have QAT.
>> Let me know if you see issue on other devices.
>Since we lack libvirt integration (of flts), the tests need
>to be run manually which is more complex for QE. IOW, it will
>take more time but we should definitely evaluate other devices.

Oh, if you mean nesting feature test, we did play with different devices we had,
ixgbevf, ICE vf, DSA and QAT. For VFIO migration with nesting, we only tested QAT.

Thanks
Zhenzhong