Hi all,
This is a draft solution adding multiple nested SMMU instances to VM.
The main goal of the series is to collect opinions, to figure out a
reasonable solution that would fit our needs.
I understood that there are concerns regarding this support, from our
previous discussion:
https://lore.kernel.org/all/ZEcT%2F7erkhHDaNvD@Asurada-Nvidia/
Yet, some followup discussion in the kernel maillist has shifted the
direction of regular SMMU nesting to potentially having multiple vSMMU
instances as well:
https://lore.kernel.org/all/20240611121756.GT19897@nvidia.com/
I will summerize all the dots in the following paragraphs:
[ Why do we need multiple nested SMMUs? ]
1, This is a must feature for NVIDIA's Grace SoC to support its CMDQV
(an extension HW for SMMU). It allows to assgin a Command Queue HW
dedicatedly to a VM. Then VM controls it via an mmap'd MMIO page:
https://lore.kernel.org/all/f00da8df12a154204e53b343b2439bf31517241f.1712978213.git.nicolinc@nvidia.com/
Each Grace SoC has 5 SMMUs (i.e. 5 CMDQVs), meaning there can be 5
MMIO pages. If QEMU only supports one vSMMU and all passthrough
devices attach to one shared vSMMU, it technically cannot mmap 5
MMIO pages, nor assign devices to use corresponding pages.
2, This is optional for nested SMMU, and essentially a design choice
between a single vSMMU design and a multiple-vSMMU design. Here're
the pros and cons:
+ Pros for single vSMMU design
a) It is easy and clean, by all means.
- Cons for single vSMMU design
b) It can have complications if underlying pSMMUs are different.
c) Emulated devices might have to be added to the nested SMMU,
since "iommu=nested-smmuv3" enables for the whole VM. This
means the vSMMU instance has to act at the same time as both
a nested SMMU and a para-virt SMMU.
d) IOTLB inefficiency. Since devices behind different pSMMUs are
attached to a single vSMMU, the vSMMU code traps invalidation
commands in a shared guest CMDQ, meaning it needs to dispatch
those commands correctly to pSMMUs, by either a broadcast or
walking through a lookup table. Note that if a command is not
tied to any hwpt or device, it still has to be broadcasted.
+ Pros for multiple vSMMU design
e) Emulated devices can be isolated from any nested SMMU.
f) Cache invalidation commands will always be forwarded to the
corresponding pSMMU, reducing the overhead from vSMMU walking
through a lookup table or broadcasting.
g) It will adapt CMDQV very easily.
- Cons for multiple vSMMU diesgn
h) Complications in VIRT and IORT design
i) Difficulty to support device hotplugs
j) Potential of running out of PCI bus number, as QEMU doesn't
support multiple PCI domains.
[ How is it implemented with this series? ]
* As an experimental series, this is all done in VIRT and ACPI code.
* Scan iommu sysfs nodes and build an SMMU node list (PATCH-03).
* Create one PCIe Expander Bridge (+ one vSMMU) from the top of bus
number (0xFF) with intervals for root-ports (PATCH-05). E.g. host
system with three pSMMUs:
[ pcie.0 bus ]
-----------------------------------------------------------------------------
| | | |
----------------- ------------------ ------------------ ------------------
| emulated devs | | smmu_bridge.e5 | | smmu_bridge.ee | | smmu_bridge.f7 |
----------------- ------------------ ------------------ ------------------
* Loop vfio-pci devices against the SMMU node list and assign them
automatically in the VIRT code to the corresponding smmu_bridges,
and then attach them by creating a root port (PATCH-06):
[ pcie.0 bus ]
-----------------------------------------------------------------------------
| | | |
----------------- ------------------ ------------------ ------------------
| emulated devs | | smmu_bridge.e5 | | smmu_bridge.ee | | smmu_bridge.f7 |
----------------- ------------------ ------------------ ------------------
|
---------------- -----------
| root_port.ef |---| PCI dev |
---------------- -----------
* Set the "pcie.0" root bus to iommu bypass, so its entire ID space
will be directed to ITS in IORT. If a vfio-pci device chooses to
bypass 2-stage translation, it can be added to "pcie.0" (PATCH-07):
--------------build_iort: its_idmaps
build_iort_id_mapping: input_base=0x0, id_count=0xe4ff, out_ref=0x30
* Map IDs of smmu_bridges to corresponding vSMMUs (PATCH-09):
--------------build_iort: smmu_idmaps
build_iort_id_mapping: input_base=0xe500, id_count=0x8ff, out_ref=0x48
build_iort_id_mapping: input_base=0xee00, id_count=0x8ff, out_ref=0xa0
build_iort_id_mapping: input_base=0xf700, id_count=0x8ff, out_ref=0xf8
* Finally, "lspci -tv" in the guest looks like this:
-+-[0000:ee]---00.0-[ef]----00.0 [vfio-pci passthrough]
\-[0000:00]-+-00.0 Red Hat, Inc. QEMU PCIe Host bridge
+-01.0 Red Hat, Inc. QEMU PCIe Expander bridge
+-02.0 Red Hat, Inc. QEMU PCIe Expander bridge
+-03.0 Red Hat, Inc. QEMU PCIe Expander bridge
+-04.0 Red Hat, Inc. QEMU NVM Express Controller [emulated]
\-05.0 Intel Corporation 82540EM Gigabit Ethernet [emulated]
[ Topics for discussion ]
* Some of the bits can be moved to backends/iommufd.c, e.g.
-object iommufd,id=iommufd0,[nesting=smmu3,[max-hotplugs=1]]
And I was hoping that the vfio-pci device could take the iommufd
BE pointer so it can redirect the PCI bus. Yet, seems to be more
complicated than I thought...
* Possiblity of adding nesting support for vfio-pci-nohotplug only?
The kernel uAPI (even for nesting cap detection) requires a dev
handler. If a VM boots without a vfio-pci and then gets a hotplug
after boot-to-console, a vSMMU that has already finished a reset
cycle will need to sync the idr/idrr bits and will have to reset
again?
This series is on Github:
https://github.com/nicolinc/qemu/commits/iommufd_multi_vsmmu-rfcv1
Thanks!
Nicolin
Eric Auger (1):
hw/arm/virt-acpi-build: Add IORT RMR regions to handle MSI nested
binding
Nicolin Chen (9):
hw/arm/virt: Add iommufd link to virt-machine
hw/arm/virt: Get the number of host-level SMMUv3 instances
hw/arm/virt: Add an SMMU_IO_LEN macro
hw/arm/virt: Add VIRT_NESTED_SMMU
hw/arm/virt: Assign vfio-pci devices to nested SMMUs
hw/arm/virt: Bypass iommu for default PCI bus
hw/arm/virt-acpi-build: Handle reserved bus number of pxb buses
hw/arm/virt-acpi-build: Build IORT with multiple SMMU nodes
hw/arm/virt-acpi-build: Enable ATS for nested SMMUv3
hw/arm/virt-acpi-build.c | 144 ++++++++++++++++----
hw/arm/virt.c | 277 +++++++++++++++++++++++++++++++++++++--
include/hw/arm/virt.h | 63 +++++++++
3 files changed, 449 insertions(+), 35 deletions(-)
--
2.43.0