RE: [PATCH rfcv1 00/23] intel_iommu: Enable stage-1 translation

Duan, Zhenzhong posted 23 patches 10 months, 1 week ago
Only 0 patches received!
>only)|
RE: [PATCH rfcv1 00/23] intel_iommu: Enable stage-1 translation
Posted by Duan, Zhenzhong 10 months, 1 week ago

>-----Original Message-----
>From: Jason Wang <jasowang@redhat.com>
>Subject: Re: [PATCH rfcv1 00/23] intel_iommu: Enable stage-1 translation
>
>On Mon, Jan 15, 2024 at 6:39 PM Zhenzhong Duan
><zhenzhong.duan@intel.com> wrote:
>>
>> Hi,
>>
>> This series enables stage-1 translation support in intel iommu which
>> we called "modern" mode. In this mode, we don't do shadowing of
>> guest page table for passthrough device but pass stage-1 page table
>> to host side to construct a nested domain; we also support emulated
>> device by translating the stage-1 page table. There was some effort
>> to enable this feature in old days, see [1] for details.
>>
>> The key design is to utilize the dual-stage IOMMU translation
>> (also known as IOMMU nested translation) capability in host IOMMU.
>> As the below diagram shows, guest I/O page table pointer in GPA
>> (guest physical address) is passed to host and be used to perform
>> the stage-1 address translation. Along with it, modifications to
>> present mappings in the guest I/O page table should be followed
>> with an IOTLB invalidation.
>>
>>         .-------------.  .---------------------------.
>>         |   vIOMMU    |  | Guest I/O page table      |
>>         |             |  '---------------------------'
>>         .----------------/
>>         | PASID Entry |--- PASID cache flush --+
>>         '-------------'                        |
>>         |             |                        V
>>         |             |           I/O page table pointer in GPA
>>         '-------------'
>>     Guest
>>     ------| Shadow |---------------------------|--------
>>           v        v                           v
>>     Host
>>         .-------------.  .------------------------.
>>         |   pIOMMU    |  |  FS for GIOVA->GPA     |
>>         |             |  '------------------------'
>>         .----------------/  |
>>         | PASID Entry |     V (Nested xlate)
>>         '----------------\.----------------------------------.
>>         |             |   | SS for GPA->HPA, unmanaged domain|
>>         |             |   '----------------------------------'
>>         '-------------'
>> Where:
>>  - FS = First stage page tables
>>  - SS = Second stage page tables
>> <Intel VT-d Nested translation>
>>
>> There are some interactions between VFIO and vIOMMU.
>> * vIOMMU registers PCIIOMMUOps to PCI subsystem which VFIO can
>>   use to registers/unregisters IOMMUDevice object.
>> * VFIO registers an IOMMUFDDevice object at vfio device realize
>>   stage to vIOMMU, this is implemented as a prerequisite series[2].
>> * vIOMMU calls IOMMUFDDevice interface callback IOMMUFDDeviceOps
>>   to bind/unbind device to IOMMUFD backed domains, either nested
>>   domain or not.
>>
>> See below diagram:
>>
>>         VFIO Device                                 Intel IOMMU
>>     .-----------------.                         .-------------------.
>>     |                 |                         |                   |
>>     |       .---------|PCIIOMMUOps              |.-------------.    |
>>     |       | IOMMUFD |(set_iommu_device)       || IOMMUFD     |    |
>>     |       | Device  |------------------------>|| Device list |    |
>>     |       .---------|(unset_iommu_device)     |.-------------.    |
>>     |                 |                         |       |           |
>>     |                 |                         |       V           |
>>     |       .---------|         IOMMUFDDeviceOps|  .---------.      |
>>     |       | IOMMUFD |            (attach_hwpt)|  | IOMMUFD |      |
>>     |       | link    |<------------------------|  | Device  |      |
>>     |       .---------|            (detach_hwpt)|  .---------.      |
>>     |                 |                         |       |           |
>>     |                 |                         |       ...         |
>>     .-----------------.                         .-------------------.
>>
>> Based on Yi's suggestion, we updated a new design of managing ioas and
>> hwpt, made it support multiple iommufd objects and the ERRATA_772415
>> case, meanwhile tried to be optimal to share ioas and hwpt whenever
>> possible.
>>
>> Stage-2 page table could be shared by different devices if there is
>> no conflict and devices link to same iommufd object, i.e. devices
>> under same host IOMMU can share same stage-2 page table. If there
>> is conflict, i.e. there is one device under non cache coherency
>> mode which is different from others, it requires a seperate
>> stage-2 page table in non-CC mode.
>>
>> SPR platform has ERRATA_772415 which requires no readonly mappings
>> in stage-2 page table. This series supports creating VTDIOASContainer
>> with no readonly mappings. I'm not clear if there is a rare case that
>> some IOMMUs on a multiple IOMMUs host have ERRATA_772415, this
>design
>> can survive even in that case.
>>
>> See below example diagram for a full view:
>>
>>       IntelIOMMUState
>>              |
>>              V
>>     .------------------.    .------------------.    .-------------------.
>>     | VTDIOASContainer |--->| VTDIOASContainer |--->| VTDIOASContainer
>|-->...
>>     | (iommufd0,RW&RO) |    | (iommufd1,RW&RO) |    | (iommufd0,RW
>only)|
>>     .------------------.    .------------------.    .-------------------.
>>              |                       |                              |
>>              |                       .-->...                        |
>>              V                                                      V
>>       .-------------------.    .-------------------.          .---------------.
>>       |   VTDS2Hwpt(CC)   |--->| VTDS2Hwpt(non-CC) |-->...    |
>VTDS2Hwpt(CC) |-->...
>>       .-------------------.    .-------------------.          .---------------.
>>           |            |               |                            |
>>           |            |               |                            |
>>     .-----------.  .-----------.  .------------.              .------------.
>>     | IOMMUFD   |  | IOMMUFD   |  | IOMMUFD    |              | IOMMUFD    |
>>     | Device(CC)|  | Device(CC)|  | Device     |              | Device(CC) |
>>     | (iommufd0)|  | (iommufd0)|  | (non-CC)   |              | (errata)   |
>>     |           |  |           |  | (iommufd0) |              | (iommufd0) |
>>     .-----------.  .-----------.  .------------.              .------------.
>>
>> This series is also a prerequisite work for vSVA, i.e. Sharing
>> guest application address space with passthrough devices.
>>
>> To enable "modern" mode, only need to add "x-scalable-mode=modern".
>> i.e. -device intel-iommu,x-scalable-mode=modern,...
>>
>> Passthrough device should use iommufd backend to work in "modern"
>mode.
>> i.e. -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,...
>>
>> If host doens't support nested translation, qemu will fail
>> with an unsupported report.
>>
>> Test done:
>> - devices hotplug/unplug
>> - different devices linked to different iommufds
>>
>> PATCH1-2:  Some preparing work to update header and IOMMUFD uAPI
>> PATCH3-4:  Initialize vfio IOMMUFDDevice interface and pass to vIOMMU
>> PATCH5:    Introduce a placeholder variable for scalable modern mode
>> PATCH6:    Sync host cap/ecap with vIOMMU default cap/ecap in modern
>mode
>> PATCH7-22: Implement first stage page table for passthrough and
>emulated device
>
>Can we split the series and start from the emulated devices (and have
>a qtest for that)? This might help for reviewing.

Sure, will do in rfcv2.

Thanks
Zhenzhong