MAINTAINERS | 1 + hw/i386/intel_iommu_internal.h | 68 +- include/hw/i386/intel_iommu.h | 9 +- include/hw/iommu.h | 17 + include/hw/pci/pci.h | 27 + include/hw/vfio/vfio-container-base.h | 1 + hw/i386/intel_iommu.c | 941 +++++++++++++++++++++++++- hw/pci/pci.c | 23 +- hw/vfio/iommufd.c | 22 +- hw/vfio/listener.c | 13 +- hw/i386/trace-events | 8 + 11 files changed, 1088 insertions(+), 42 deletions(-) create mode 100644 include/hw/iommu.h
Hi,
For passthrough device with intel_iommu.x-flts=on, we don't do shadowing of
guest page table for passthrough device but pass stage-1 page table to host
side to construct a nested domain. There was some effort to enable this feature
in old days, see [1] for details.
The key design is to utilize the dual-stage IOMMU translation (also known as
IOMMU nested translation) capability in host IOMMU. As the below diagram shows,
guest I/O page table pointer in GPA (guest physical address) is passed to host
and be used to perform the stage-1 address translation. Along with it,
modifications to present mappings in the guest I/O page table should be followed
with an IOTLB invalidation.
.-------------. .---------------------------.
| vIOMMU | | Guest I/O page table |
| | '---------------------------'
.----------------/
| PASID Entry |--- PASID cache flush --+
'-------------' |
| | V
| | I/O page table pointer in GPA
'-------------'
Guest
------| Shadow |---------------------------|--------
v v v
Host
.-------------. .------------------------.
| pIOMMU | | Stage1 for GIOVA->GPA |
| | '------------------------'
.----------------/ |
| PASID Entry | V (Nested xlate)
'----------------\.--------------------------------------.
| | | Stage2 for GPA->HPA, unmanaged domain|
| | '--------------------------------------'
'-------------'
For history reason, there are different namings in different VTD spec rev,
Where:
- Stage1 = First stage = First level = flts
- Stage2 = Second stage = Second level = slts
<Intel VT-d Nested translation>
This series reuse VFIO device's default hwpt as nested parent instead of
creating new one. This way avoids duplicate code of a new memory listener,
all existing feature from VFIO listener can be shared, e.g., ram discard,
dirty tracking, etc. Two limitations are: 1) not supporting VFIO device
under a PCI bridge with emulated device, because emulated device wants
IOMMU AS and VFIO device stick to system AS; 2) not supporting kexec or
reboot from "intel_iommu=on,sm_on" to "intel_iommu=on,sm_off", because
VFIO device's default hwpt is created with NEST_PARENT flag, kernel
inhibit RO mappings when switch to shadow mode.
This series is also a prerequisite work for vSVA, i.e. Sharing guest
application address space with passthrough devices.
There are some interactions between VFIO and vIOMMU
* vIOMMU registers PCIIOMMUOps [set|unset]_iommu_device to PCI
subsystem. VFIO calls them to register/unregister HostIOMMUDevice
instance to vIOMMU at vfio device realize stage.
* vIOMMU registers PCIIOMMUOps get_viommu_cap to PCI subsystem.
VFIO calls it to get vIOMMU exposed capabilities.
* vIOMMU calls HostIOMMUDeviceIOMMUFD interface [at|de]tach_hwpt
to bind/unbind device to IOMMUFD backed domains, either nested
domain or not.
See below diagram:
VFIO Device Intel IOMMU
.-----------------. .-------------------.
| | | |
| .---------|PCIIOMMUOps |.-------------. |
| | IOMMUFD |(set/unset_iommu_device) || Host IOMMU | |
| | Device |------------------------>|| Device list | |
| .---------|(get_viommu_cap) |.-------------. |
| | | | |
| | | V |
| .---------| HostIOMMUDeviceIOMMUFD | .-------------. |
| | IOMMUFD | (attach_hwpt)| | Host IOMMU | |
| | link |<------------------------| | Device | |
| .---------| (detach_hwpt)| .-------------. |
| | | | |
| | | ... |
.-----------------. .-------------------.
Below is an example to enable stage-1 translation for passthrough device:
-M q35,...
-device intel-iommu,x-scalable-mode=on,x-flts=on...
-object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,...
Test done:
- VFIO devices hotplug/unplug
- different VFIO devices linked to different iommufds
- vhost net device ping test
PATCH1-6: Some preparing work
PATCH7-8: Compatibility check between vIOMMU and Host IOMMU
PATCH9-17: Implement stage-1 page table for passthrough device
PATCH18-19:Workaround for ERRATA_772415_SPR17
PATCH20: Enable stage-1 translation for passthrough device
Qemu code can be found at [2]
Fault report isn't supported in this series, we presume guest kernel always
construct correct stage1 page table for passthrough device. For emulated
devices, the emulation code already provided stage1 fault injection.
TODO:
- Fault report to guest when HW stage1 faults
[1] https://patchwork.kernel.org/project/kvm/cover/20210302203827.437645-1-yi.l.liu@intel.com/
[2] https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_nesting.v4
Thanks
Zhenzhong
Changelog:
v4:
- s/VIOMMU_CAP_STAGE1/VIOMMU_CAP_HW_NESTED (Eric, Nicolin, Donald, Shameer)
- clarify get_viommu_cap() return pure emulated caps and explain reason in commit log (Eric)
- retrieve the ce only if vtd_as->pasid in vtd_as_to_iommu_pasid_locked (Eric)
- refine doc comment and commit log in patch10-11 (Eric)
v3:
- define enum type for VIOMMU_CAP_* (Eric)
- drop inline flag in the patch which uses the helper (Eric)
- use extract64 in new introduced MACRO (Eric)
- polish comments and fix typo error (Eric)
- split workaround patch for ERRATA_772415_SPR17 to two patches (Eric)
- optimize bind/unbind error path processing
v2:
- introduce get_viommu_cap() to get STAGE1 flag to create nested parent hwpt (Liuyi)
- reuse VFIO's default hwpt as parent hwpt of nested translation (Nicolin, Liuyi)
- abandon support of VFIO device under pcie-to-pci bridge to simplify design (Liuyi)
- bypass RO mapping in VFIO's default hwpt if ERRATA_772415_SPR17 (Liuyi)
- drop vtd_dev_to_context_entry optimization (Liuyi)
v1:
- simplify vendor specific checking in vtd_check_hiod (Cedric, Nicolin)
- rebase to master
rfcv3:
- s/hwpt_id/id in iommufd_backend_invalidate_cache()'s parameter (Shameer)
- hide vtd vendor specific caps in a wrapper union (Eric, Nicolin)
- simplify return value check of get_cap() (Eric)
- drop realize_late (Cedric, Eric)
- split patch13:intel_iommu: Add PASID cache management infrastructure (Eric)
- s/vtd_pasid_cache_reset/vtd_pasid_cache_reset_locked (Eric)
- s/vtd_pe_get_domain_id/vtd_pe_get_did (Eric)
- refine comments (Eric, Donald)
rfcv2:
- Drop VTDPASIDAddressSpace and use VTDAddressSpace (Eric, Liuyi)
- Move HWPT uAPI patches ahead(patch1-8) so arm nesting could easily rebase
- add two cleanup patches(patch9-10)
- VFIO passes iommufd/devid/hwpt_id to vIOMMU instead of iommufd/devid/ioas_id
- add vtd_as_[from|to]_iommu_pasid() helper to translate between vtd_as and
iommu pasid, this is important for dropping VTDPASIDAddressSpace
Yi Liu (3):
intel_iommu: Replay pasid bindings after context cache invalidation
intel_iommu: Propagate PASID-based iotlb invalidation to host
intel_iommu: Replay all pasid bindings when either SRTP or TE bit is
changed
Zhenzhong Duan (17):
intel_iommu: Rename vtd_ce_get_rid2pasid_entry to
vtd_ce_get_pasid_entry
hw/pci: Introduce pci_device_get_viommu_cap()
intel_iommu: Implement get_viommu_cap() callback
vfio/iommufd: Force creating nested parent domain
hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool
intel_iommu: Introduce a new structure VTDHostIOMMUDevice
intel_iommu: Check for compatibility with IOMMUFD backed device when
x-flts=on
intel_iommu: Fail passthrough device under PCI bridge if x-flts=on
intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked
intel_iommu: Handle PASID entry removal and update
intel_iommu: Handle PASID entry addition
intel_iommu: Introduce a new pasid cache invalidation type FORCE_RESET
intel_iommu: Stick to system MR for IOMMUFD backed host device when
x-fls=on
intel_iommu: Bind/unbind guest page table to host
vfio: Add a new element bypass_ro in VFIOContainerBase
Workaround for ERRATA_772415_SPR17
intel_iommu: Enable host device when x-flts=on in scalable mode
MAINTAINERS | 1 +
hw/i386/intel_iommu_internal.h | 68 +-
include/hw/i386/intel_iommu.h | 9 +-
include/hw/iommu.h | 17 +
include/hw/pci/pci.h | 27 +
include/hw/vfio/vfio-container-base.h | 1 +
hw/i386/intel_iommu.c | 941 +++++++++++++++++++++++++-
hw/pci/pci.c | 23 +-
hw/vfio/iommufd.c | 22 +-
hw/vfio/listener.c | 13 +-
hw/i386/trace-events | 8 +
11 files changed, 1088 insertions(+), 42 deletions(-)
create mode 100644 include/hw/iommu.h
base-commit: 92c05be4dfb59a71033d4c57dac944b29f7dabf0
--
2.47.1
Kindly ping, any more comments? Thanks Zhenzhong >-----Original Message----- >From: Duan, Zhenzhong <zhenzhong.duan@intel.com> >Subject: [PATCH v4 00/20] intel_iommu: Enable stage-1 translation for >passthrough device > >Hi, > >For passthrough device with intel_iommu.x-flts=on, we don't do shadowing >of >guest page table for passthrough device but pass stage-1 page table to host >side to construct a nested domain. There was some effort to enable this >feature >in old days, see [1] for details. > >The key design is to utilize the dual-stage IOMMU translation (also known as >IOMMU nested translation) capability in host IOMMU. As the below diagram >shows, >guest I/O page table pointer in GPA (guest physical address) is passed to host >and be used to perform the stage-1 address translation. Along with it, >modifications to present mappings in the guest I/O page table should be >followed >with an IOTLB invalidation. > > .-------------. .---------------------------. > | vIOMMU | | Guest I/O page table | > | | '---------------------------' > .----------------/ > | PASID Entry |--- PASID cache flush --+ > '-------------' | > | | V > | | I/O page table pointer in GPA > '-------------' > Guest > ------| Shadow |---------------------------|-------- > v v v > Host > .-------------. .------------------------. > | pIOMMU | | Stage1 for GIOVA->GPA | > | | '------------------------' > .----------------/ | > | PASID Entry | V (Nested xlate) > '----------------\.--------------------------------------. > | | | Stage2 for GPA->HPA, unmanaged domain| > | | '--------------------------------------' > '-------------' >For history reason, there are different namings in different VTD spec rev, >Where: > - Stage1 = First stage = First level = flts > - Stage2 = Second stage = Second level = slts ><Intel VT-d Nested translation> > >This series reuse VFIO device's default hwpt as nested parent instead of >creating new one. This way avoids duplicate code of a new memory listener, >all existing feature from VFIO listener can be shared, e.g., ram discard, >dirty tracking, etc. Two limitations are: 1) not supporting VFIO device >under a PCI bridge with emulated device, because emulated device wants >IOMMU AS and VFIO device stick to system AS; 2) not supporting kexec or >reboot from "intel_iommu=on,sm_on" to "intel_iommu=on,sm_off", because >VFIO device's default hwpt is created with NEST_PARENT flag, kernel >inhibit RO mappings when switch to shadow mode. > >This series is also a prerequisite work for vSVA, i.e. Sharing guest >application address space with passthrough devices. > >There are some interactions between VFIO and vIOMMU >* vIOMMU registers PCIIOMMUOps [set|unset]_iommu_device to PCI > subsystem. VFIO calls them to register/unregister HostIOMMUDevice > instance to vIOMMU at vfio device realize stage. >* vIOMMU registers PCIIOMMUOps get_viommu_cap to PCI subsystem. > VFIO calls it to get vIOMMU exposed capabilities. >* vIOMMU calls HostIOMMUDeviceIOMMUFD interface [at|de]tach_hwpt > to bind/unbind device to IOMMUFD backed domains, either nested > domain or not. > >See below diagram: > > VFIO Device Intel IOMMU > .-----------------. .-------------------. > | | | >| > | .---------|PCIIOMMUOps |.-------------. | > | | IOMMUFD |(set/unset_iommu_device) || Host IOMMU | >| > | | Device |------------------------>|| Device list | | > | .---------|(get_viommu_cap) |.-------------. | > | | | | >| > | | | V >| > | .---------| HostIOMMUDeviceIOMMUFD | .-------------. | > | | IOMMUFD | (attach_hwpt)| | Host IOMMU >| | > | | link |<------------------------| | Device | | > | .---------| (detach_hwpt)| .-------------. | > | | | | >| > | | | ... >| > .-----------------. .-------------------. > >Below is an example to enable stage-1 translation for passthrough device: > > -M q35,... > -device intel-iommu,x-scalable-mode=on,x-flts=on... > -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,... > >Test done: >- VFIO devices hotplug/unplug >- different VFIO devices linked to different iommufds >- vhost net device ping test > >PATCH1-6: Some preparing work >PATCH7-8: Compatibility check between vIOMMU and Host IOMMU >PATCH9-17: Implement stage-1 page table for passthrough device >PATCH18-19:Workaround for ERRATA_772415_SPR17 >PATCH20: Enable stage-1 translation for passthrough device > >Qemu code can be found at [2] > >Fault report isn't supported in this series, we presume guest kernel always >construct correct stage1 page table for passthrough device. For emulated >devices, the emulation code already provided stage1 fault injection. > >TODO: >- Fault report to guest when HW stage1 faults > >[1] >https://patchwork.kernel.org/project/kvm/cover/20210302203827.437645-1 >-yi.l.liu@intel.com/ >[2] https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_nesting.v4 > >Thanks >Zhenzhong > >Changelog: >v4: >- s/VIOMMU_CAP_STAGE1/VIOMMU_CAP_HW_NESTED (Eric, Nicolin, >Donald, Shameer) >- clarify get_viommu_cap() return pure emulated caps and explain reason in >commit log (Eric) >- retrieve the ce only if vtd_as->pasid in vtd_as_to_iommu_pasid_locked (Eric) >- refine doc comment and commit log in patch10-11 (Eric) > >v3: >- define enum type for VIOMMU_CAP_* (Eric) >- drop inline flag in the patch which uses the helper (Eric) >- use extract64 in new introduced MACRO (Eric) >- polish comments and fix typo error (Eric) >- split workaround patch for ERRATA_772415_SPR17 to two patches (Eric) >- optimize bind/unbind error path processing > >v2: >- introduce get_viommu_cap() to get STAGE1 flag to create nested parent >hwpt (Liuyi) >- reuse VFIO's default hwpt as parent hwpt of nested translation (Nicolin, >Liuyi) >- abandon support of VFIO device under pcie-to-pci bridge to simplify design >(Liuyi) >- bypass RO mapping in VFIO's default hwpt if ERRATA_772415_SPR17 (Liuyi) >- drop vtd_dev_to_context_entry optimization (Liuyi) > >v1: >- simplify vendor specific checking in vtd_check_hiod (Cedric, Nicolin) >- rebase to master > >rfcv3: >- s/hwpt_id/id in iommufd_backend_invalidate_cache()'s parameter >(Shameer) >- hide vtd vendor specific caps in a wrapper union (Eric, Nicolin) >- simplify return value check of get_cap() (Eric) >- drop realize_late (Cedric, Eric) >- split patch13:intel_iommu: Add PASID cache management infrastructure >(Eric) >- s/vtd_pasid_cache_reset/vtd_pasid_cache_reset_locked (Eric) >- s/vtd_pe_get_domain_id/vtd_pe_get_did (Eric) >- refine comments (Eric, Donald) > >rfcv2: >- Drop VTDPASIDAddressSpace and use VTDAddressSpace (Eric, Liuyi) >- Move HWPT uAPI patches ahead(patch1-8) so arm nesting could easily >rebase >- add two cleanup patches(patch9-10) >- VFIO passes iommufd/devid/hwpt_id to vIOMMU instead of >iommufd/devid/ioas_id >- add vtd_as_[from|to]_iommu_pasid() helper to translate between vtd_as >and > iommu pasid, this is important for dropping VTDPASIDAddressSpace > > >Yi Liu (3): > intel_iommu: Replay pasid bindings after context cache invalidation > intel_iommu: Propagate PASID-based iotlb invalidation to host > intel_iommu: Replay all pasid bindings when either SRTP or TE bit is > changed > >Zhenzhong Duan (17): > intel_iommu: Rename vtd_ce_get_rid2pasid_entry to > vtd_ce_get_pasid_entry > hw/pci: Introduce pci_device_get_viommu_cap() > intel_iommu: Implement get_viommu_cap() callback > vfio/iommufd: Force creating nested parent domain > hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool > intel_iommu: Introduce a new structure VTDHostIOMMUDevice > intel_iommu: Check for compatibility with IOMMUFD backed device when > x-flts=on > intel_iommu: Fail passthrough device under PCI bridge if x-flts=on > intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked > intel_iommu: Handle PASID entry removal and update > intel_iommu: Handle PASID entry addition > intel_iommu: Introduce a new pasid cache invalidation type FORCE_RESET > intel_iommu: Stick to system MR for IOMMUFD backed host device when > x-fls=on > intel_iommu: Bind/unbind guest page table to host > vfio: Add a new element bypass_ro in VFIOContainerBase > Workaround for ERRATA_772415_SPR17 > intel_iommu: Enable host device when x-flts=on in scalable mode > > MAINTAINERS | 1 + > hw/i386/intel_iommu_internal.h | 68 +- > include/hw/i386/intel_iommu.h | 9 +- > include/hw/iommu.h | 17 + > include/hw/pci/pci.h | 27 + > include/hw/vfio/vfio-container-base.h | 1 + > hw/i386/intel_iommu.c | 941 >+++++++++++++++++++++++++- > hw/pci/pci.c | 23 +- > hw/vfio/iommufd.c | 22 +- > hw/vfio/listener.c | 13 +- > hw/i386/trace-events | 8 + > 11 files changed, 1088 insertions(+), 42 deletions(-) > create mode 100644 include/hw/iommu.h > > >base-commit: 92c05be4dfb59a71033d4c57dac944b29f7dabf0 >-- >2.47.1
On Thu, Aug 21, 2025 at 07:19:43AM +0000, Duan, Zhenzhong wrote: > Kindly ping, any more comments? > > Thanks > Zhenzhong I think there's been enough comments to spin v5. -- MST
On 2025/8/21 15:19, Duan, Zhenzhong wrote: > Kindly ping, any more comments? Do you have enough comments for a new version. I plan to have a look either this version or a new version next week. :) Regards, Yi Liu > Thanks > Zhenzhong > >> -----Original Message----- >> From: Duan, Zhenzhong <zhenzhong.duan@intel.com> >> Subject: [PATCH v4 00/20] intel_iommu: Enable stage-1 translation for >> passthrough device >> >> Hi, >> >> For passthrough device with intel_iommu.x-flts=on, we don't do shadowing >> of >> guest page table for passthrough device but pass stage-1 page table to host >> side to construct a nested domain. There was some effort to enable this >> feature >> in old days, see [1] for details. >> >> The key design is to utilize the dual-stage IOMMU translation (also known as >> IOMMU nested translation) capability in host IOMMU. As the below diagram >> shows, >> guest I/O page table pointer in GPA (guest physical address) is passed to host >> and be used to perform the stage-1 address translation. Along with it, >> modifications to present mappings in the guest I/O page table should be >> followed >> with an IOTLB invalidation. >> >> .-------------. .---------------------------. >> | vIOMMU | | Guest I/O page table | >> | | '---------------------------' >> .----------------/ >> | PASID Entry |--- PASID cache flush --+ >> '-------------' | >> | | V >> | | I/O page table pointer in GPA >> '-------------' >> Guest >> ------| Shadow |---------------------------|-------- >> v v v >> Host >> .-------------. .------------------------. >> | pIOMMU | | Stage1 for GIOVA->GPA | >> | | '------------------------' >> .----------------/ | >> | PASID Entry | V (Nested xlate) >> '----------------\.--------------------------------------. >> | | | Stage2 for GPA->HPA, unmanaged domain| >> | | '--------------------------------------' >> '-------------' >> For history reason, there are different namings in different VTD spec rev, >> Where: >> - Stage1 = First stage = First level = flts >> - Stage2 = Second stage = Second level = slts >> <Intel VT-d Nested translation> >> >> This series reuse VFIO device's default hwpt as nested parent instead of >> creating new one. This way avoids duplicate code of a new memory listener, >> all existing feature from VFIO listener can be shared, e.g., ram discard, >> dirty tracking, etc. Two limitations are: 1) not supporting VFIO device >> under a PCI bridge with emulated device, because emulated device wants >> IOMMU AS and VFIO device stick to system AS; 2) not supporting kexec or >> reboot from "intel_iommu=on,sm_on" to "intel_iommu=on,sm_off", because >> VFIO device's default hwpt is created with NEST_PARENT flag, kernel >> inhibit RO mappings when switch to shadow mode. >> >> This series is also a prerequisite work for vSVA, i.e. Sharing guest >> application address space with passthrough devices. >> >> There are some interactions between VFIO and vIOMMU >> * vIOMMU registers PCIIOMMUOps [set|unset]_iommu_device to PCI >> subsystem. VFIO calls them to register/unregister HostIOMMUDevice >> instance to vIOMMU at vfio device realize stage. >> * vIOMMU registers PCIIOMMUOps get_viommu_cap to PCI subsystem. >> VFIO calls it to get vIOMMU exposed capabilities. >> * vIOMMU calls HostIOMMUDeviceIOMMUFD interface [at|de]tach_hwpt >> to bind/unbind device to IOMMUFD backed domains, either nested >> domain or not. >> >> See below diagram: >> >> VFIO Device Intel IOMMU >> .-----------------. .-------------------. >> | | | >> | >> | .---------|PCIIOMMUOps |.-------------. | >> | | IOMMUFD |(set/unset_iommu_device) || Host IOMMU | >> | >> | | Device |------------------------>|| Device list | | >> | .---------|(get_viommu_cap) |.-------------. | >> | | | | >> | >> | | | V >> | >> | .---------| HostIOMMUDeviceIOMMUFD | .-------------. | >> | | IOMMUFD | (attach_hwpt)| | Host IOMMU >> | | >> | | link |<------------------------| | Device | | >> | .---------| (detach_hwpt)| .-------------. | >> | | | | >> | >> | | | ... >> | >> .-----------------. .-------------------. >> >> Below is an example to enable stage-1 translation for passthrough device: >> >> -M q35,... >> -device intel-iommu,x-scalable-mode=on,x-flts=on... >> -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,... >> >> Test done: >> - VFIO devices hotplug/unplug >> - different VFIO devices linked to different iommufds >> - vhost net device ping test >> >> PATCH1-6: Some preparing work >> PATCH7-8: Compatibility check between vIOMMU and Host IOMMU >> PATCH9-17: Implement stage-1 page table for passthrough device >> PATCH18-19:Workaround for ERRATA_772415_SPR17 >> PATCH20: Enable stage-1 translation for passthrough device >> >> Qemu code can be found at [2] >> >> Fault report isn't supported in this series, we presume guest kernel always >> construct correct stage1 page table for passthrough device. For emulated >> devices, the emulation code already provided stage1 fault injection. >> >> TODO: >> - Fault report to guest when HW stage1 faults >> >> [1] >> https://patchwork.kernel.org/project/kvm/cover/20210302203827.437645-1 >> -yi.l.liu@intel.com/ >> [2] https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_nesting.v4 >> >> Thanks >> Zhenzhong >> >> Changelog: >> v4: >> - s/VIOMMU_CAP_STAGE1/VIOMMU_CAP_HW_NESTED (Eric, Nicolin, >> Donald, Shameer) >> - clarify get_viommu_cap() return pure emulated caps and explain reason in >> commit log (Eric) >> - retrieve the ce only if vtd_as->pasid in vtd_as_to_iommu_pasid_locked (Eric) >> - refine doc comment and commit log in patch10-11 (Eric) >> >> v3: >> - define enum type for VIOMMU_CAP_* (Eric) >> - drop inline flag in the patch which uses the helper (Eric) >> - use extract64 in new introduced MACRO (Eric) >> - polish comments and fix typo error (Eric) >> - split workaround patch for ERRATA_772415_SPR17 to two patches (Eric) >> - optimize bind/unbind error path processing >> >> v2: >> - introduce get_viommu_cap() to get STAGE1 flag to create nested parent >> hwpt (Liuyi) >> - reuse VFIO's default hwpt as parent hwpt of nested translation (Nicolin, >> Liuyi) >> - abandon support of VFIO device under pcie-to-pci bridge to simplify design >> (Liuyi) >> - bypass RO mapping in VFIO's default hwpt if ERRATA_772415_SPR17 (Liuyi) >> - drop vtd_dev_to_context_entry optimization (Liuyi) >> >> v1: >> - simplify vendor specific checking in vtd_check_hiod (Cedric, Nicolin) >> - rebase to master >> >> rfcv3: >> - s/hwpt_id/id in iommufd_backend_invalidate_cache()'s parameter >> (Shameer) >> - hide vtd vendor specific caps in a wrapper union (Eric, Nicolin) >> - simplify return value check of get_cap() (Eric) >> - drop realize_late (Cedric, Eric) >> - split patch13:intel_iommu: Add PASID cache management infrastructure >> (Eric) >> - s/vtd_pasid_cache_reset/vtd_pasid_cache_reset_locked (Eric) >> - s/vtd_pe_get_domain_id/vtd_pe_get_did (Eric) >> - refine comments (Eric, Donald) >> >> rfcv2: >> - Drop VTDPASIDAddressSpace and use VTDAddressSpace (Eric, Liuyi) >> - Move HWPT uAPI patches ahead(patch1-8) so arm nesting could easily >> rebase >> - add two cleanup patches(patch9-10) >> - VFIO passes iommufd/devid/hwpt_id to vIOMMU instead of >> iommufd/devid/ioas_id >> - add vtd_as_[from|to]_iommu_pasid() helper to translate between vtd_as >> and >> iommu pasid, this is important for dropping VTDPASIDAddressSpace >> >> >> Yi Liu (3): >> intel_iommu: Replay pasid bindings after context cache invalidation >> intel_iommu: Propagate PASID-based iotlb invalidation to host >> intel_iommu: Replay all pasid bindings when either SRTP or TE bit is >> changed >> >> Zhenzhong Duan (17): >> intel_iommu: Rename vtd_ce_get_rid2pasid_entry to >> vtd_ce_get_pasid_entry >> hw/pci: Introduce pci_device_get_viommu_cap() >> intel_iommu: Implement get_viommu_cap() callback >> vfio/iommufd: Force creating nested parent domain >> hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool >> intel_iommu: Introduce a new structure VTDHostIOMMUDevice >> intel_iommu: Check for compatibility with IOMMUFD backed device when >> x-flts=on >> intel_iommu: Fail passthrough device under PCI bridge if x-flts=on >> intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked >> intel_iommu: Handle PASID entry removal and update >> intel_iommu: Handle PASID entry addition >> intel_iommu: Introduce a new pasid cache invalidation type FORCE_RESET >> intel_iommu: Stick to system MR for IOMMUFD backed host device when >> x-fls=on >> intel_iommu: Bind/unbind guest page table to host >> vfio: Add a new element bypass_ro in VFIOContainerBase >> Workaround for ERRATA_772415_SPR17 >> intel_iommu: Enable host device when x-flts=on in scalable mode >> >> MAINTAINERS | 1 + >> hw/i386/intel_iommu_internal.h | 68 +- >> include/hw/i386/intel_iommu.h | 9 +- >> include/hw/iommu.h | 17 + >> include/hw/pci/pci.h | 27 + >> include/hw/vfio/vfio-container-base.h | 1 + >> hw/i386/intel_iommu.c | 941 >> +++++++++++++++++++++++++- >> hw/pci/pci.c | 23 +- >> hw/vfio/iommufd.c | 22 +- >> hw/vfio/listener.c | 13 +- >> hw/i386/trace-events | 8 + >> 11 files changed, 1088 insertions(+), 42 deletions(-) >> create mode 100644 include/hw/iommu.h >> >> >> base-commit: 92c05be4dfb59a71033d4c57dac944b29f7dabf0 >> -- >> 2.47.1 >
On 8/21/25 10:50 AM, Yi Liu wrote: > On 2025/8/21 15:19, Duan, Zhenzhong wrote: >> Kindly ping, any more comments? > > Do you have enough comments for a new version. I plan to have a look > either this version or a new version next week. :) same for me ;-) Eric > > Regards, > Yi Liu > >> Thanks >> Zhenzhong >> >>> -----Original Message----- >>> From: Duan, Zhenzhong <zhenzhong.duan@intel.com> >>> Subject: [PATCH v4 00/20] intel_iommu: Enable stage-1 translation for >>> passthrough device >>> >>> Hi, >>> >>> For passthrough device with intel_iommu.x-flts=on, we don't do >>> shadowing >>> of >>> guest page table for passthrough device but pass stage-1 page table >>> to host >>> side to construct a nested domain. There was some effort to enable this >>> feature >>> in old days, see [1] for details. >>> >>> The key design is to utilize the dual-stage IOMMU translation (also >>> known as >>> IOMMU nested translation) capability in host IOMMU. As the below >>> diagram >>> shows, >>> guest I/O page table pointer in GPA (guest physical address) is >>> passed to host >>> and be used to perform the stage-1 address translation. Along with it, >>> modifications to present mappings in the guest I/O page table should be >>> followed >>> with an IOTLB invalidation. >>> >>> .-------------. .---------------------------. >>> | vIOMMU | | Guest I/O page table | >>> | | '---------------------------' >>> .----------------/ >>> | PASID Entry |--- PASID cache flush --+ >>> '-------------' | >>> | | V >>> | | I/O page table pointer in GPA >>> '-------------' >>> Guest >>> ------| Shadow |---------------------------|-------- >>> v v v >>> Host >>> .-------------. .------------------------. >>> | pIOMMU | | Stage1 for GIOVA->GPA | >>> | | '------------------------' >>> .----------------/ | >>> | PASID Entry | V (Nested xlate) >>> '----------------\.--------------------------------------. >>> | | | Stage2 for GPA->HPA, unmanaged domain| >>> | | '--------------------------------------' >>> '-------------' >>> For history reason, there are different namings in different VTD >>> spec rev, >>> Where: >>> - Stage1 = First stage = First level = flts >>> - Stage2 = Second stage = Second level = slts >>> <Intel VT-d Nested translation> >>> >>> This series reuse VFIO device's default hwpt as nested parent >>> instead of >>> creating new one. This way avoids duplicate code of a new memory >>> listener, >>> all existing feature from VFIO listener can be shared, e.g., ram >>> discard, >>> dirty tracking, etc. Two limitations are: 1) not supporting VFIO device >>> under a PCI bridge with emulated device, because emulated device wants >>> IOMMU AS and VFIO device stick to system AS; 2) not supporting kexec or >>> reboot from "intel_iommu=on,sm_on" to "intel_iommu=on,sm_off", because >>> VFIO device's default hwpt is created with NEST_PARENT flag, kernel >>> inhibit RO mappings when switch to shadow mode. >>> >>> This series is also a prerequisite work for vSVA, i.e. Sharing guest >>> application address space with passthrough devices. >>> >>> There are some interactions between VFIO and vIOMMU >>> * vIOMMU registers PCIIOMMUOps [set|unset]_iommu_device to PCI >>> subsystem. VFIO calls them to register/unregister HostIOMMUDevice >>> instance to vIOMMU at vfio device realize stage. >>> * vIOMMU registers PCIIOMMUOps get_viommu_cap to PCI subsystem. >>> VFIO calls it to get vIOMMU exposed capabilities. >>> * vIOMMU calls HostIOMMUDeviceIOMMUFD interface [at|de]tach_hwpt >>> to bind/unbind device to IOMMUFD backed domains, either nested >>> domain or not. >>> >>> See below diagram: >>> >>> VFIO Device Intel IOMMU >>> .-----------------. .-------------------. >>> | | | >>> | >>> | .---------|PCIIOMMUOps |.-------------. | >>> | | IOMMUFD |(set/unset_iommu_device) || Host IOMMU | >>> | >>> | | Device |------------------------>|| Device list | | >>> | .---------|(get_viommu_cap) |.-------------. | >>> | | | | >>> | >>> | | | V >>> | >>> | .---------| HostIOMMUDeviceIOMMUFD | .-------------. | >>> | | IOMMUFD | (attach_hwpt)| | Host IOMMU >>> | | >>> | | link |<------------------------| | Device | | >>> | .---------| (detach_hwpt)| .-------------. | >>> | | | | >>> | >>> | | | ... >>> | >>> .-----------------. .-------------------. >>> >>> Below is an example to enable stage-1 translation for passthrough >>> device: >>> >>> -M q35,... >>> -device intel-iommu,x-scalable-mode=on,x-flts=on... >>> -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,... >>> >>> Test done: >>> - VFIO devices hotplug/unplug >>> - different VFIO devices linked to different iommufds >>> - vhost net device ping test >>> >>> PATCH1-6: Some preparing work >>> PATCH7-8: Compatibility check between vIOMMU and Host IOMMU >>> PATCH9-17: Implement stage-1 page table for passthrough device >>> PATCH18-19:Workaround for ERRATA_772415_SPR17 >>> PATCH20: Enable stage-1 translation for passthrough device >>> >>> Qemu code can be found at [2] >>> >>> Fault report isn't supported in this series, we presume guest kernel >>> always >>> construct correct stage1 page table for passthrough device. For >>> emulated >>> devices, the emulation code already provided stage1 fault injection. >>> >>> TODO: >>> - Fault report to guest when HW stage1 faults >>> >>> [1] >>> https://patchwork.kernel.org/project/kvm/cover/20210302203827.437645-1 >>> -yi.l.liu@intel.com/ >>> [2] https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_nesting.v4 >>> >>> Thanks >>> Zhenzhong >>> >>> Changelog: >>> v4: >>> - s/VIOMMU_CAP_STAGE1/VIOMMU_CAP_HW_NESTED (Eric, Nicolin, >>> Donald, Shameer) >>> - clarify get_viommu_cap() return pure emulated caps and explain >>> reason in >>> commit log (Eric) >>> - retrieve the ce only if vtd_as->pasid in >>> vtd_as_to_iommu_pasid_locked (Eric) >>> - refine doc comment and commit log in patch10-11 (Eric) >>> >>> v3: >>> - define enum type for VIOMMU_CAP_* (Eric) >>> - drop inline flag in the patch which uses the helper (Eric) >>> - use extract64 in new introduced MACRO (Eric) >>> - polish comments and fix typo error (Eric) >>> - split workaround patch for ERRATA_772415_SPR17 to two patches (Eric) >>> - optimize bind/unbind error path processing >>> >>> v2: >>> - introduce get_viommu_cap() to get STAGE1 flag to create nested parent >>> hwpt (Liuyi) >>> - reuse VFIO's default hwpt as parent hwpt of nested translation >>> (Nicolin, >>> Liuyi) >>> - abandon support of VFIO device under pcie-to-pci bridge to >>> simplify design >>> (Liuyi) >>> - bypass RO mapping in VFIO's default hwpt if ERRATA_772415_SPR17 >>> (Liuyi) >>> - drop vtd_dev_to_context_entry optimization (Liuyi) >>> >>> v1: >>> - simplify vendor specific checking in vtd_check_hiod (Cedric, Nicolin) >>> - rebase to master >>> >>> rfcv3: >>> - s/hwpt_id/id in iommufd_backend_invalidate_cache()'s parameter >>> (Shameer) >>> - hide vtd vendor specific caps in a wrapper union (Eric, Nicolin) >>> - simplify return value check of get_cap() (Eric) >>> - drop realize_late (Cedric, Eric) >>> - split patch13:intel_iommu: Add PASID cache management infrastructure >>> (Eric) >>> - s/vtd_pasid_cache_reset/vtd_pasid_cache_reset_locked (Eric) >>> - s/vtd_pe_get_domain_id/vtd_pe_get_did (Eric) >>> - refine comments (Eric, Donald) >>> >>> rfcv2: >>> - Drop VTDPASIDAddressSpace and use VTDAddressSpace (Eric, Liuyi) >>> - Move HWPT uAPI patches ahead(patch1-8) so arm nesting could easily >>> rebase >>> - add two cleanup patches(patch9-10) >>> - VFIO passes iommufd/devid/hwpt_id to vIOMMU instead of >>> iommufd/devid/ioas_id >>> - add vtd_as_[from|to]_iommu_pasid() helper to translate between vtd_as >>> and >>> iommu pasid, this is important for dropping VTDPASIDAddressSpace >>> >>> >>> Yi Liu (3): >>> intel_iommu: Replay pasid bindings after context cache invalidation >>> intel_iommu: Propagate PASID-based iotlb invalidation to host >>> intel_iommu: Replay all pasid bindings when either SRTP or TE bit is >>> changed >>> >>> Zhenzhong Duan (17): >>> intel_iommu: Rename vtd_ce_get_rid2pasid_entry to >>> vtd_ce_get_pasid_entry >>> hw/pci: Introduce pci_device_get_viommu_cap() >>> intel_iommu: Implement get_viommu_cap() callback >>> vfio/iommufd: Force creating nested parent domain >>> hw/pci: Export pci_device_get_iommu_bus_devfn() and return bool >>> intel_iommu: Introduce a new structure VTDHostIOMMUDevice >>> intel_iommu: Check for compatibility with IOMMUFD backed device when >>> x-flts=on >>> intel_iommu: Fail passthrough device under PCI bridge if x-flts=on >>> intel_iommu: Introduce two helpers vtd_as_from/to_iommu_pasid_locked >>> intel_iommu: Handle PASID entry removal and update >>> intel_iommu: Handle PASID entry addition >>> intel_iommu: Introduce a new pasid cache invalidation type >>> FORCE_RESET >>> intel_iommu: Stick to system MR for IOMMUFD backed host device when >>> x-fls=on >>> intel_iommu: Bind/unbind guest page table to host >>> vfio: Add a new element bypass_ro in VFIOContainerBase >>> Workaround for ERRATA_772415_SPR17 >>> intel_iommu: Enable host device when x-flts=on in scalable mode >>> >>> MAINTAINERS | 1 + >>> hw/i386/intel_iommu_internal.h | 68 +- >>> include/hw/i386/intel_iommu.h | 9 +- >>> include/hw/iommu.h | 17 + >>> include/hw/pci/pci.h | 27 + >>> include/hw/vfio/vfio-container-base.h | 1 + >>> hw/i386/intel_iommu.c | 941 >>> +++++++++++++++++++++++++- >>> hw/pci/pci.c | 23 +- >>> hw/vfio/iommufd.c | 22 +- >>> hw/vfio/listener.c | 13 +- >>> hw/i386/trace-events | 8 + >>> 11 files changed, 1088 insertions(+), 42 deletions(-) >>> create mode 100644 include/hw/iommu.h >>> >>> >>> base-commit: 92c05be4dfb59a71033d4c57dac944b29f7dabf0 >>> -- >>> 2.47.1 >> >
Hi Eric, Yi, >-----Original Message----- >From: Eric Auger <eric.auger@redhat.com> >Subject: Re: [PATCH v4 00/20] intel_iommu: Enable stage-1 translation for >passthrough device > > > >On 8/21/25 10:50 AM, Yi Liu wrote: >> On 2025/8/21 15:19, Duan, Zhenzhong wrote: >>> Kindly ping, any more comments? >> >> Do you have enough comments for a new version. I plan to have a look >> either this version or a new version next week. :) >same for me ;-) I'll send v5 per Michael's suggestion by end of this week. Thanks Zhenzhong
>-----Original Message----- >From: Liu, Yi L <yi.l.liu@intel.com> >Subject: Re: [PATCH v4 00/20] intel_iommu: Enable stage-1 translation for >passthrough device > >On 2025/8/21 15:19, Duan, Zhenzhong wrote: >> Kindly ping, any more comments? > >Do you have enough comments for a new version. I plan to have a look >either this version or a new version next week. :) That's appreciated, thanks Yi. I think not, there are only a few comments from Cedric for VFIO related patches. Zhenzhong
© 2016 - 2026 Red Hat, Inc.