>> |
Hi Eric, >-----Original Message----- >From: Eric Auger <eric.auger@redhat.com> >Subject: Re: [PATCH rfcv2 00/20] intel_iommu: Enable stage-1 translation for >passthrough device > > >Hi Zhenzhong > >On 2/19/25 9:22 AM, Zhenzhong Duan wrote: >> Hi, >> >> Per Jason Wang's suggestion, iommufd nesting series[1] is split into >> "Enable stage-1 translation for emulated device" series and >> "Enable stage-1 translation for passthrough device" series. >> >> This series is 2nd part focusing on passthrough device. We don't do >> shadowing of guest page table for passthrough device but pass stage-1 >> page table to host side to construct a nested domain. There was some >> effort to enable this feature in old days, see [2] for details. >> >> The key design is to utilize the dual-stage IOMMU translation >> (also known as IOMMU nested translation) capability in host IOMMU. >> As the below diagram shows, guest I/O page table pointer in GPA >> (guest physical address) is passed to host and be used to perform >s/be/is >> the stage-1 address translation. Along with it, modifications to >> present mappings in the guest I/O page table should be followed >> with an IOTLB invalidation. >> >> .-------------. .---------------------------. >> | vIOMMU | | Guest I/O page table | >> | | '---------------------------' >> .----------------/ >> | PASID Entry |--- PASID cache flush --+ >> '-------------' | >> | | V >> | | I/O page table pointer in GPA >> '-------------' >> Guest >> ------| Shadow |---------------------------|-------- >> v v v >> Host >> .-------------. .------------------------. >> | pIOMMU | | FS for GIOVA->GPA | >> | | '------------------------' >> .----------------/ | >> | PASID Entry | V (Nested xlate) >> '----------------\.----------------------------------. >> | | | SS for GPA->HPA, unmanaged domain| >> | | '----------------------------------' >> '-------------' >> Where: >> - FS = First stage page tables >> - SS = Second stage page tables >> <Intel VT-d Nested translation> >> >> There are some interactions between VFIO and vIOMMU >> * vIOMMU registers PCIIOMMUOps [set|unset]_iommu_device to PCI >> subsystem. VFIO calls them to register/unregister HostIOMMUDevice >> instance to vIOMMU at vfio device realize stage. >> * vIOMMU calls HostIOMMUDeviceIOMMUFD interface [at|de]tach_hwpt >> to bind/unbind device to IOMMUFD backed domains, either nested >> domain or not. >> >> See below diagram: >> >> VFIO Device Intel IOMMU >> .-----------------. .-------------------. >> | | | | >> | .---------|PCIIOMMUOps |.-------------. | >> | | IOMMUFD |(set_iommu_device) || Host IOMMU | | >> | | Device |------------------------>|| Device list | | >> | .---------|(unset_iommu_device) |.-------------. | >> | | | | | >> | | | V | >> | .---------| HostIOMMUDeviceIOMMUFD | .-------------. | >> | | IOMMUFD | (attach_hwpt)| | Host IOMMU | | >> | | link |<------------------------| | Device | | >> | .---------| (detach_hwpt)| .-------------. | >> | | | | | >> | | | ... | >> .-----------------. .-------------------. >> >> Based on Yi's suggestion, this design is optimal in sharing ioas/hwpt >> whenever possible and create new one on demand, also supports multiple >> iommufd objects and ERRATA_772415. >> >> E.g., Stage-2 page table could be shared by different devices if there >> is no conflict and devices link to same iommufd object, i.e. devices >> under same host IOMMU can share same stage-2 page table. If there is >> conflict, i.e. there is one device under non cache coherency mode >> which is different from others, it requires a separate stage-2 page >> table in non-CC mode. >> >> SPR platform has ERRATA_772415 which requires no readonly mappings >> in stage-2 page table. This series supports creating VTDIOASContainer >> with no readonly mappings. If there is a rare case that some IOMMUs >> on a multiple IOMMU host have ERRATA_772415 and others not, this >> design can still survive. >> >> See below example diagram for a full view: >> >> IntelIOMMUState >> | >> V >> .------------------. .------------------. .-------------------. >> | VTDIOASContainer |--->| VTDIOASContainer |--->| VTDIOASContainer |-- >>... >> | (iommufd0,RW&RO) | | (iommufd1,RW&RO) | | (iommufd0,RW only)| >> .------------------. .------------------. .-------------------. >> | | | >> | .-->... | >> V V >> .-------------------. .-------------------. .---------------. >> | VTDS2Hwpt(CC) |--->| VTDS2Hwpt(non-CC) |-->... | VTDS2Hwpt(CC) |- >->... >> .-------------------. .-------------------. .---------------. >> | | | | >> | | | | >> .-----------. .-----------. .------------. .------------. >> | IOMMUFD | | IOMMUFD | | IOMMUFD | | IOMMUFD | >> | Device(CC)| | Device(CC)| | Device | | Device(CC) | >> | (iommufd0)| | (iommufd0)| | (non-CC) | | (errata) | >> | | | | | (iommufd0) | | (iommufd0) | >> .-----------. .-----------. .------------. .------------. >> >> This series is also a prerequisite work for vSVA, i.e. Sharing >> guest application address space with passthrough devices. >> >> To enable stage-1 translation, only need to add "x-scalable-mode=on,x-flts=on". >> i.e. -device intel-iommu,x-scalable-mode=on,x-flts=on... >> >> Passthrough device should use iommufd backend to work with stage-1 >translation. >> i.e. -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,... >> >> If host doesn't support nested translation, qemu will fail with an unsupported >> report. > >you're not mentioning lack of error reporting from HW S1 faults to >guests. Are there other deps missing? Good question, this will be in future series. Plan is: 1) vtd nesting 2) pasid support 3) PRQ support(this includes S1 faults passing) So to play with this series, we have to presume guest kernel always construct correct S1 page table for passthrough device, for emulated devices, the emulation code already provided S1 fault injection. Thanks Zhenzhong
© 2016 - 2025 Red Hat, Inc.