[PATCH rfcv1 00/23] intel_iommu: Enable stage-1 translation

Zhenzhong Duan posted 23 patches 10 months, 2 weeks ago
Failed in applying to current master (apply log)
hw/i386/intel_iommu_internal.h                |  109 +-
include/hw/i386/intel_iommu.h                 |   63 +-
include/standard-headers/drm/drm_fourcc.h     |    2 +
include/standard-headers/linux/fuse.h         |   10 +-
include/standard-headers/linux/pci_regs.h     |   24 +-
include/standard-headers/linux/vhost_types.h  |    7 +
.../standard-headers/linux/virtio_config.h    |    5 +
include/standard-headers/linux/virtio_pci.h   |   11 +
include/sysemu/iommufd.h                      |    7 +
include/sysemu/iommufd_device.h               |   12 +-
linux-headers/asm-arm64/kvm.h                 |   32 +
linux-headers/asm-generic/unistd.h            |   14 +-
linux-headers/asm-loongarch/bitsperlong.h     |    1 +
linux-headers/asm-loongarch/kvm.h             |  108 +
linux-headers/asm-loongarch/mman.h            |    1 +
linux-headers/asm-loongarch/unistd.h          |    5 +
linux-headers/asm-mips/unistd_n32.h           |    4 +
linux-headers/asm-mips/unistd_n64.h           |    4 +
linux-headers/asm-mips/unistd_o32.h           |    4 +
linux-headers/asm-powerpc/unistd_32.h         |    4 +
linux-headers/asm-powerpc/unistd_64.h         |    4 +
linux-headers/asm-riscv/kvm.h                 |   12 +
linux-headers/asm-s390/unistd_32.h            |    4 +
linux-headers/asm-s390/unistd_64.h            |    4 +
linux-headers/asm-x86/unistd_32.h             |    4 +
linux-headers/asm-x86/unistd_64.h             |    3 +
linux-headers/asm-x86/unistd_x32.h            |    3 +
linux-headers/linux/iommufd.h                 |  259 +-
linux-headers/linux/kvm.h                     |   11 +
linux-headers/linux/psp-sev.h                 |    1 +
linux-headers/linux/stddef.h                  |    9 +-
linux-headers/linux/userfaultfd.h             |    9 +-
linux-headers/linux/vfio.h                    |   47 +-
linux-headers/linux/vhost.h                   |    8 +
backends/iommufd.c                            |   61 +
backends/iommufd_device.c                     |   17 +-
hw/i386/intel_iommu.c                         | 2822 ++++++++++++++---
hw/vfio/iommufd.c                             |   37 +-
backends/trace-events                         |    2 +
hw/i386/trace-events                          |   16 +
40 files changed, 3256 insertions(+), 504 deletions(-)
create mode 100644 linux-headers/asm-loongarch/bitsperlong.h
create mode 100644 linux-headers/asm-loongarch/kvm.h
create mode 100644 linux-headers/asm-loongarch/mman.h
create mode 100644 linux-headers/asm-loongarch/unistd.h
[PATCH rfcv1 00/23] intel_iommu: Enable stage-1 translation
Posted by Zhenzhong Duan 10 months, 2 weeks ago
Hi,

This series enables stage-1 translation support in intel iommu which
we called "modern" mode. In this mode, we don't do shadowing of
guest page table for passthrough device but pass stage-1 page table
to host side to construct a nested domain; we also support emulated
device by translating the stage-1 page table. There was some effort
to enable this feature in old days, see [1] for details.

The key design is to utilize the dual-stage IOMMU translation
(also known as IOMMU nested translation) capability in host IOMMU.
As the below diagram shows, guest I/O page table pointer in GPA
(guest physical address) is passed to host and be used to perform
the stage-1 address translation. Along with it, modifications to
present mappings in the guest I/O page table should be followed
with an IOTLB invalidation.

        .-------------.  .---------------------------.
        |   vIOMMU    |  | Guest I/O page table      |
        |             |  '---------------------------'
        .----------------/
        | PASID Entry |--- PASID cache flush --+
        '-------------'                        |
        |             |                        V
        |             |           I/O page table pointer in GPA
        '-------------'
    Guest
    ------| Shadow |---------------------------|--------
          v        v                           v
    Host
        .-------------.  .------------------------.
        |   pIOMMU    |  |  FS for GIOVA->GPA     |
        |             |  '------------------------'
        .----------------/  |
        | PASID Entry |     V (Nested xlate)
        '----------------\.----------------------------------.
        |             |   | SS for GPA->HPA, unmanaged domain|
        |             |   '----------------------------------'
        '-------------'
Where:
 - FS = First stage page tables
 - SS = Second stage page tables
<Intel VT-d Nested translation>

There are some interactions between VFIO and vIOMMU.
* vIOMMU registers PCIIOMMUOps to PCI subsystem which VFIO can
  use to registers/unregisters IOMMUDevice object.
* VFIO registers an IOMMUFDDevice object at vfio device realize
  stage to vIOMMU, this is implemented as a prerequisite series[2].
* vIOMMU calls IOMMUFDDevice interface callback IOMMUFDDeviceOps
  to bind/unbind device to IOMMUFD backed domains, either nested
  domain or not.

See below diagram:

        VFIO Device                                 Intel IOMMU
    .-----------------.                         .-------------------.
    |                 |                         |                   |
    |       .---------|PCIIOMMUOps              |.-------------.    |
    |       | IOMMUFD |(set_iommu_device)       || IOMMUFD     |    |
    |       | Device  |------------------------>|| Device list |    |
    |       .---------|(unset_iommu_device)     |.-------------.    |
    |                 |                         |       |           |
    |                 |                         |       V           |
    |       .---------|         IOMMUFDDeviceOps|  .---------.      |
    |       | IOMMUFD |            (attach_hwpt)|  | IOMMUFD |      |
    |       | link    |<------------------------|  | Device  |      |
    |       .---------|            (detach_hwpt)|  .---------.      |
    |                 |                         |       |           |
    |                 |                         |       ...         |
    .-----------------.                         .-------------------.

Based on Yi's suggestion, we updated a new design of managing ioas and
hwpt, made it support multiple iommufd objects and the ERRATA_772415
case, meanwhile tried to be optimal to share ioas and hwpt whenever
possible.

Stage-2 page table could be shared by different devices if there is
no conflict and devices link to same iommufd object, i.e. devices
under same host IOMMU can share same stage-2 page table. If there
is conflict, i.e. there is one device under non cache coherency
mode which is different from others, it requires a seperate
stage-2 page table in non-CC mode.

SPR platform has ERRATA_772415 which requires no readonly mappings
in stage-2 page table. This series supports creating VTDIOASContainer
with no readonly mappings. I'm not clear if there is a rare case that
some IOMMUs on a multiple IOMMUs host have ERRATA_772415, this design
can survive even in that case.

See below example diagram for a full view:

      IntelIOMMUState
             |
             V
    .------------------.    .------------------.    .-------------------.
    | VTDIOASContainer |--->| VTDIOASContainer |--->| VTDIOASContainer  |-->...
    | (iommufd0,RW&RO) |    | (iommufd1,RW&RO) |    | (iommufd0,RW only)|
    .------------------.    .------------------.    .-------------------.
             |                       |                              |
             |                       .-->...                        |
             V                                                      V
      .-------------------.    .-------------------.          .---------------.
      |   VTDS2Hwpt(CC)   |--->| VTDS2Hwpt(non-CC) |-->...    | VTDS2Hwpt(CC) |-->...
      .-------------------.    .-------------------.          .---------------.
          |            |               |                            |
          |            |               |                            |
    .-----------.  .-----------.  .------------.              .------------.
    | IOMMUFD   |  | IOMMUFD   |  | IOMMUFD    |              | IOMMUFD    |
    | Device(CC)|  | Device(CC)|  | Device     |              | Device(CC) |
    | (iommufd0)|  | (iommufd0)|  | (non-CC)   |              | (errata)   |
    |           |  |           |  | (iommufd0) |              | (iommufd0) |
    .-----------.  .-----------.  .------------.              .------------.

This series is also a prerequisite work for vSVA, i.e. Sharing
guest application address space with passthrough devices.

To enable "modern" mode, only need to add "x-scalable-mode=modern".
i.e. -device intel-iommu,x-scalable-mode=modern,...

Passthrough device should use iommufd backend to work in "modern" mode.
i.e. -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,...

If host doens't support nested translation, qemu will fail
with an unsupported report.

Test done:
- devices hotplug/unplug
- different devices linked to different iommufds

PATCH1-2:  Some preparing work to update header and IOMMUFD uAPI
PATCH3-4:  Initialize vfio IOMMUFDDevice interface and pass to vIOMMU
PATCH5:    Introduce a placeholder variable for scalable modern mode
PATCH6:    Sync host cap/ecap with vIOMMU default cap/ecap in modern mode
PATCH7-22: Implement first stage page table for passthrough and emulated device
PATCH23:   Introduce "modern" mode to distinguish with legacy mode

Qemu code can be found at [3]
Matching kernel code can be found at [4]

TODO:
- RAM discard
- dirty tracking on stage-2 page table

THOUGHTS:
This design is optimal in sharing ioas/hwpt whenever posssible, but it also
bring some overhead for vIOMMU to implement a simliar memory listener as
vfio_memory_listener, i.e., this memory listener should also support ram
discard and dirty tracking.

We have also implemented another design internally, by reusing ioas from vfio
to create s2hwpt, this way each device has its own s2hwpt and sharing vfio's
ioas, so vfio_memory_listener is reused, no code redundency. But shis have
three flaws,
 1. address space switch should be bypassed for vfio device which means vfio
    device and emulated device can't share same address space.
 2. still need to create seperate ioas/hwpt if ERRATA_772415.
 3. no ioas/hwpt sharing.

Not clear which design is prefered in community, internally we like current
design a bit more, welcome comments and suggestions.

[1] https://patchwork.kernel.org/project/kvm/cover/20210302203827.437645-1-yi.l.liu@intel.com/
[2] https://lists.gnu.org/archive/html/qemu-devel/2024-01/msg02730.html
[3] https://github.com/yiliu1765/qemu/tree/zhenzhong/iommufd_nesting_rfcv1
[4] https://github.com/yiliu1765/iommufd/tree/iommufd_nesting

Thanks
Zhenzhong


Yi Liu (11):
  intel_iommu: process PASID cache invalidation
  intel_iommu: add PASID cache management infrastructure
  intel_iommu: replay pasid binds after context cache invalidation
  intel_iommu: process PASID-based iotlb invalidation
  intel_iommu: propagate PASID-based iotlb invalidation to host
  intel_iommu: process PASID-based Device-TLB invalidation
  intel_iommu: rename slpte in iotlb_entry to pte
  intel_iommu: implement firt level translation
  intel_iommu: introduce pasid iotlb cache
  intel_iommu: refresh pasid bind after pasid cache force reset
  intel_iommu: modify x-scalable-mode to be string option

Yi Sun (2):
  intel_iommu: piotlb invalidation should notify unmap
  intel_iommu: invalidate piotlb when flush pasid

Yu Zhang (1):
  intel_iommu: fix the fault reason report

Zhenzhong Duan (9):
  Update linux header to support nested hwpt alloc
  backends/iommufd: add helpers for allocating user-managed HWPT
  backends/iommufd_device: introduce IOMMUFDDevice targeted interface
  vfio: implement IOMMUFDDevice interface callbacks
  intel_iommu: add a placeholder variable for scalable modern mode
  intel_iommu: check and sync host IOMMU cap/ecap in scalable modern
    mode
  vfio/iommufd_device: Add ioas_id in IOMMUFDDevice and pass to vIOMMU
  intel_iommu: bind/unbind guest page table to host
  intel_iommu: ERRATA_772415 workaround

 hw/i386/intel_iommu_internal.h                |  109 +-
 include/hw/i386/intel_iommu.h                 |   63 +-
 include/standard-headers/drm/drm_fourcc.h     |    2 +
 include/standard-headers/linux/fuse.h         |   10 +-
 include/standard-headers/linux/pci_regs.h     |   24 +-
 include/standard-headers/linux/vhost_types.h  |    7 +
 .../standard-headers/linux/virtio_config.h    |    5 +
 include/standard-headers/linux/virtio_pci.h   |   11 +
 include/sysemu/iommufd.h                      |    7 +
 include/sysemu/iommufd_device.h               |   12 +-
 linux-headers/asm-arm64/kvm.h                 |   32 +
 linux-headers/asm-generic/unistd.h            |   14 +-
 linux-headers/asm-loongarch/bitsperlong.h     |    1 +
 linux-headers/asm-loongarch/kvm.h             |  108 +
 linux-headers/asm-loongarch/mman.h            |    1 +
 linux-headers/asm-loongarch/unistd.h          |    5 +
 linux-headers/asm-mips/unistd_n32.h           |    4 +
 linux-headers/asm-mips/unistd_n64.h           |    4 +
 linux-headers/asm-mips/unistd_o32.h           |    4 +
 linux-headers/asm-powerpc/unistd_32.h         |    4 +
 linux-headers/asm-powerpc/unistd_64.h         |    4 +
 linux-headers/asm-riscv/kvm.h                 |   12 +
 linux-headers/asm-s390/unistd_32.h            |    4 +
 linux-headers/asm-s390/unistd_64.h            |    4 +
 linux-headers/asm-x86/unistd_32.h             |    4 +
 linux-headers/asm-x86/unistd_64.h             |    3 +
 linux-headers/asm-x86/unistd_x32.h            |    3 +
 linux-headers/linux/iommufd.h                 |  259 +-
 linux-headers/linux/kvm.h                     |   11 +
 linux-headers/linux/psp-sev.h                 |    1 +
 linux-headers/linux/stddef.h                  |    9 +-
 linux-headers/linux/userfaultfd.h             |    9 +-
 linux-headers/linux/vfio.h                    |   47 +-
 linux-headers/linux/vhost.h                   |    8 +
 backends/iommufd.c                            |   61 +
 backends/iommufd_device.c                     |   17 +-
 hw/i386/intel_iommu.c                         | 2822 ++++++++++++++---
 hw/vfio/iommufd.c                             |   37 +-
 backends/trace-events                         |    2 +
 hw/i386/trace-events                          |   16 +
 40 files changed, 3256 insertions(+), 504 deletions(-)
 create mode 100644 linux-headers/asm-loongarch/bitsperlong.h
 create mode 100644 linux-headers/asm-loongarch/kvm.h
 create mode 100644 linux-headers/asm-loongarch/mman.h
 create mode 100644 linux-headers/asm-loongarch/unistd.h

-- 
2.34.1
Re: [PATCH rfcv1 00/23] intel_iommu: Enable stage-1 translation
Posted by Jason Wang 10 months, 1 week ago
On Mon, Jan 15, 2024 at 6:39 PM Zhenzhong Duan <zhenzhong.duan@intel.com> wrote:
>
> Hi,
>
> This series enables stage-1 translation support in intel iommu which
> we called "modern" mode. In this mode, we don't do shadowing of
> guest page table for passthrough device but pass stage-1 page table
> to host side to construct a nested domain; we also support emulated
> device by translating the stage-1 page table. There was some effort
> to enable this feature in old days, see [1] for details.
>
> The key design is to utilize the dual-stage IOMMU translation
> (also known as IOMMU nested translation) capability in host IOMMU.
> As the below diagram shows, guest I/O page table pointer in GPA
> (guest physical address) is passed to host and be used to perform
> the stage-1 address translation. Along with it, modifications to
> present mappings in the guest I/O page table should be followed
> with an IOTLB invalidation.
>
>         .-------------.  .---------------------------.
>         |   vIOMMU    |  | Guest I/O page table      |
>         |             |  '---------------------------'
>         .----------------/
>         | PASID Entry |--- PASID cache flush --+
>         '-------------'                        |
>         |             |                        V
>         |             |           I/O page table pointer in GPA
>         '-------------'
>     Guest
>     ------| Shadow |---------------------------|--------
>           v        v                           v
>     Host
>         .-------------.  .------------------------.
>         |   pIOMMU    |  |  FS for GIOVA->GPA     |
>         |             |  '------------------------'
>         .----------------/  |
>         | PASID Entry |     V (Nested xlate)
>         '----------------\.----------------------------------.
>         |             |   | SS for GPA->HPA, unmanaged domain|
>         |             |   '----------------------------------'
>         '-------------'
> Where:
>  - FS = First stage page tables
>  - SS = Second stage page tables
> <Intel VT-d Nested translation>
>
> There are some interactions between VFIO and vIOMMU.
> * vIOMMU registers PCIIOMMUOps to PCI subsystem which VFIO can
>   use to registers/unregisters IOMMUDevice object.
> * VFIO registers an IOMMUFDDevice object at vfio device realize
>   stage to vIOMMU, this is implemented as a prerequisite series[2].
> * vIOMMU calls IOMMUFDDevice interface callback IOMMUFDDeviceOps
>   to bind/unbind device to IOMMUFD backed domains, either nested
>   domain or not.
>
> See below diagram:
>
>         VFIO Device                                 Intel IOMMU
>     .-----------------.                         .-------------------.
>     |                 |                         |                   |
>     |       .---------|PCIIOMMUOps              |.-------------.    |
>     |       | IOMMUFD |(set_iommu_device)       || IOMMUFD     |    |
>     |       | Device  |------------------------>|| Device list |    |
>     |       .---------|(unset_iommu_device)     |.-------------.    |
>     |                 |                         |       |           |
>     |                 |                         |       V           |
>     |       .---------|         IOMMUFDDeviceOps|  .---------.      |
>     |       | IOMMUFD |            (attach_hwpt)|  | IOMMUFD |      |
>     |       | link    |<------------------------|  | Device  |      |
>     |       .---------|            (detach_hwpt)|  .---------.      |
>     |                 |                         |       |           |
>     |                 |                         |       ...         |
>     .-----------------.                         .-------------------.
>
> Based on Yi's suggestion, we updated a new design of managing ioas and
> hwpt, made it support multiple iommufd objects and the ERRATA_772415
> case, meanwhile tried to be optimal to share ioas and hwpt whenever
> possible.
>
> Stage-2 page table could be shared by different devices if there is
> no conflict and devices link to same iommufd object, i.e. devices
> under same host IOMMU can share same stage-2 page table. If there
> is conflict, i.e. there is one device under non cache coherency
> mode which is different from others, it requires a seperate
> stage-2 page table in non-CC mode.
>
> SPR platform has ERRATA_772415 which requires no readonly mappings
> in stage-2 page table. This series supports creating VTDIOASContainer
> with no readonly mappings. I'm not clear if there is a rare case that
> some IOMMUs on a multiple IOMMUs host have ERRATA_772415, this design
> can survive even in that case.
>
> See below example diagram for a full view:
>
>       IntelIOMMUState
>              |
>              V
>     .------------------.    .------------------.    .-------------------.
>     | VTDIOASContainer |--->| VTDIOASContainer |--->| VTDIOASContainer  |-->...
>     | (iommufd0,RW&RO) |    | (iommufd1,RW&RO) |    | (iommufd0,RW only)|
>     .------------------.    .------------------.    .-------------------.
>              |                       |                              |
>              |                       .-->...                        |
>              V                                                      V
>       .-------------------.    .-------------------.          .---------------.
>       |   VTDS2Hwpt(CC)   |--->| VTDS2Hwpt(non-CC) |-->...    | VTDS2Hwpt(CC) |-->...
>       .-------------------.    .-------------------.          .---------------.
>           |            |               |                            |
>           |            |               |                            |
>     .-----------.  .-----------.  .------------.              .------------.
>     | IOMMUFD   |  | IOMMUFD   |  | IOMMUFD    |              | IOMMUFD    |
>     | Device(CC)|  | Device(CC)|  | Device     |              | Device(CC) |
>     | (iommufd0)|  | (iommufd0)|  | (non-CC)   |              | (errata)   |
>     |           |  |           |  | (iommufd0) |              | (iommufd0) |
>     .-----------.  .-----------.  .------------.              .------------.
>
> This series is also a prerequisite work for vSVA, i.e. Sharing
> guest application address space with passthrough devices.
>
> To enable "modern" mode, only need to add "x-scalable-mode=modern".
> i.e. -device intel-iommu,x-scalable-mode=modern,...
>
> Passthrough device should use iommufd backend to work in "modern" mode.
> i.e. -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,...
>
> If host doens't support nested translation, qemu will fail
> with an unsupported report.
>
> Test done:
> - devices hotplug/unplug
> - different devices linked to different iommufds
>
> PATCH1-2:  Some preparing work to update header and IOMMUFD uAPI
> PATCH3-4:  Initialize vfio IOMMUFDDevice interface and pass to vIOMMU
> PATCH5:    Introduce a placeholder variable for scalable modern mode
> PATCH6:    Sync host cap/ecap with vIOMMU default cap/ecap in modern mode
> PATCH7-22: Implement first stage page table for passthrough and emulated device

Can we split the series and start from the emulated devices (and have
a qtest for that)? This might help for reviewing.

Thanks
RE: [PATCH rfcv1 00/23] intel_iommu: Enable stage-1 translation
Posted by Duan, Zhenzhong 10 months, 1 week ago

>-----Original Message-----
>From: Jason Wang <jasowang@redhat.com>
>Subject: Re: [PATCH rfcv1 00/23] intel_iommu: Enable stage-1 translation
>
>On Mon, Jan 15, 2024 at 6:39 PM Zhenzhong Duan
><zhenzhong.duan@intel.com> wrote:
>>
>> Hi,
>>
>> This series enables stage-1 translation support in intel iommu which
>> we called "modern" mode. In this mode, we don't do shadowing of
>> guest page table for passthrough device but pass stage-1 page table
>> to host side to construct a nested domain; we also support emulated
>> device by translating the stage-1 page table. There was some effort
>> to enable this feature in old days, see [1] for details.
>>
>> The key design is to utilize the dual-stage IOMMU translation
>> (also known as IOMMU nested translation) capability in host IOMMU.
>> As the below diagram shows, guest I/O page table pointer in GPA
>> (guest physical address) is passed to host and be used to perform
>> the stage-1 address translation. Along with it, modifications to
>> present mappings in the guest I/O page table should be followed
>> with an IOTLB invalidation.
>>
>>         .-------------.  .---------------------------.
>>         |   vIOMMU    |  | Guest I/O page table      |
>>         |             |  '---------------------------'
>>         .----------------/
>>         | PASID Entry |--- PASID cache flush --+
>>         '-------------'                        |
>>         |             |                        V
>>         |             |           I/O page table pointer in GPA
>>         '-------------'
>>     Guest
>>     ------| Shadow |---------------------------|--------
>>           v        v                           v
>>     Host
>>         .-------------.  .------------------------.
>>         |   pIOMMU    |  |  FS for GIOVA->GPA     |
>>         |             |  '------------------------'
>>         .----------------/  |
>>         | PASID Entry |     V (Nested xlate)
>>         '----------------\.----------------------------------.
>>         |             |   | SS for GPA->HPA, unmanaged domain|
>>         |             |   '----------------------------------'
>>         '-------------'
>> Where:
>>  - FS = First stage page tables
>>  - SS = Second stage page tables
>> <Intel VT-d Nested translation>
>>
>> There are some interactions between VFIO and vIOMMU.
>> * vIOMMU registers PCIIOMMUOps to PCI subsystem which VFIO can
>>   use to registers/unregisters IOMMUDevice object.
>> * VFIO registers an IOMMUFDDevice object at vfio device realize
>>   stage to vIOMMU, this is implemented as a prerequisite series[2].
>> * vIOMMU calls IOMMUFDDevice interface callback IOMMUFDDeviceOps
>>   to bind/unbind device to IOMMUFD backed domains, either nested
>>   domain or not.
>>
>> See below diagram:
>>
>>         VFIO Device                                 Intel IOMMU
>>     .-----------------.                         .-------------------.
>>     |                 |                         |                   |
>>     |       .---------|PCIIOMMUOps              |.-------------.    |
>>     |       | IOMMUFD |(set_iommu_device)       || IOMMUFD     |    |
>>     |       | Device  |------------------------>|| Device list |    |
>>     |       .---------|(unset_iommu_device)     |.-------------.    |
>>     |                 |                         |       |           |
>>     |                 |                         |       V           |
>>     |       .---------|         IOMMUFDDeviceOps|  .---------.      |
>>     |       | IOMMUFD |            (attach_hwpt)|  | IOMMUFD |      |
>>     |       | link    |<------------------------|  | Device  |      |
>>     |       .---------|            (detach_hwpt)|  .---------.      |
>>     |                 |                         |       |           |
>>     |                 |                         |       ...         |
>>     .-----------------.                         .-------------------.
>>
>> Based on Yi's suggestion, we updated a new design of managing ioas and
>> hwpt, made it support multiple iommufd objects and the ERRATA_772415
>> case, meanwhile tried to be optimal to share ioas and hwpt whenever
>> possible.
>>
>> Stage-2 page table could be shared by different devices if there is
>> no conflict and devices link to same iommufd object, i.e. devices
>> under same host IOMMU can share same stage-2 page table. If there
>> is conflict, i.e. there is one device under non cache coherency
>> mode which is different from others, it requires a seperate
>> stage-2 page table in non-CC mode.
>>
>> SPR platform has ERRATA_772415 which requires no readonly mappings
>> in stage-2 page table. This series supports creating VTDIOASContainer
>> with no readonly mappings. I'm not clear if there is a rare case that
>> some IOMMUs on a multiple IOMMUs host have ERRATA_772415, this
>design
>> can survive even in that case.
>>
>> See below example diagram for a full view:
>>
>>       IntelIOMMUState
>>              |
>>              V
>>     .------------------.    .------------------.    .-------------------.
>>     | VTDIOASContainer |--->| VTDIOASContainer |--->| VTDIOASContainer
>|-->...
>>     | (iommufd0,RW&RO) |    | (iommufd1,RW&RO) |    | (iommufd0,RW
>only)|
>>     .------------------.    .------------------.    .-------------------.
>>              |                       |                              |
>>              |                       .-->...                        |
>>              V                                                      V
>>       .-------------------.    .-------------------.          .---------------.
>>       |   VTDS2Hwpt(CC)   |--->| VTDS2Hwpt(non-CC) |-->...    |
>VTDS2Hwpt(CC) |-->...
>>       .-------------------.    .-------------------.          .---------------.
>>           |            |               |                            |
>>           |            |               |                            |
>>     .-----------.  .-----------.  .------------.              .------------.
>>     | IOMMUFD   |  | IOMMUFD   |  | IOMMUFD    |              | IOMMUFD    |
>>     | Device(CC)|  | Device(CC)|  | Device     |              | Device(CC) |
>>     | (iommufd0)|  | (iommufd0)|  | (non-CC)   |              | (errata)   |
>>     |           |  |           |  | (iommufd0) |              | (iommufd0) |
>>     .-----------.  .-----------.  .------------.              .------------.
>>
>> This series is also a prerequisite work for vSVA, i.e. Sharing
>> guest application address space with passthrough devices.
>>
>> To enable "modern" mode, only need to add "x-scalable-mode=modern".
>> i.e. -device intel-iommu,x-scalable-mode=modern,...
>>
>> Passthrough device should use iommufd backend to work in "modern"
>mode.
>> i.e. -object iommufd,id=iommufd0 -device vfio-pci,iommufd=iommufd0,...
>>
>> If host doens't support nested translation, qemu will fail
>> with an unsupported report.
>>
>> Test done:
>> - devices hotplug/unplug
>> - different devices linked to different iommufds
>>
>> PATCH1-2:  Some preparing work to update header and IOMMUFD uAPI
>> PATCH3-4:  Initialize vfio IOMMUFDDevice interface and pass to vIOMMU
>> PATCH5:    Introduce a placeholder variable for scalable modern mode
>> PATCH6:    Sync host cap/ecap with vIOMMU default cap/ecap in modern
>mode
>> PATCH7-22: Implement first stage page table for passthrough and
>emulated device
>
>Can we split the series and start from the emulated devices (and have
>a qtest for that)? This might help for reviewing.

Sure, will do in rfcv2.

Thanks
Zhenzhong