[RFC PATCH 0/5] qemu: Route hostdevs to multiple nested SMMUs

Nathan Chen via Devel posted 5 patches 1 week, 2 days ago
docs/formatdomain.rst                         |  25 ++-
src/ch/ch_domain.c                            |   1 +
src/conf/domain_addr.c                        |  26 ++-
src/conf/domain_addr.h                        |   4 +-
src/conf/domain_conf.c                        | 188 +++++++++++++++++
src/conf/domain_conf.h                        |  15 ++
src/conf/domain_postparse.c                   |   1 +
src/conf/domain_validate.c                    |  24 +++
src/conf/schemas/domaincommon.rng             |  17 ++
src/conf/virconftypes.h                       |   2 +
src/libvirt_private.syms                      |   2 +
src/lxc/lxc_driver.c                          |   6 +
src/qemu/qemu_command.c                       |  64 +++++-
src/qemu/qemu_command.h                       |   4 +
src/qemu/qemu_domain.c                        |   2 +
src/qemu/qemu_domain_address.c                | 193 ++++++++++++++++++
src/qemu/qemu_driver.c                        |   3 +
src/qemu/qemu_hotplug.c                       |   5 +
src/qemu/qemu_postparse.c                     |   1 +
src/qemu/qemu_validate.c                      |  16 ++
src/test/test_driver.c                        |   4 +
tests/meson.build                             |   1 +
.../iommu-nestedsmmuv3.aarch64-latest.args    |  38 ++++
.../iommu-nestedsmmuv3.aarch64-latest.xml     |  61 ++++++
tests/qemuxmlconfdata/iommu-nestedsmmuv3.xml  |  29 +++
tests/qemuxmlconftest.c                       |   4 +-
tests/schemas/device.rng.in                   |   1 +
tests/virnestedsmmuv3mock.c                   |  57 ++++++
28 files changed, 788 insertions(+), 6 deletions(-)
create mode 100644 tests/qemuxmlconfdata/iommu-nestedsmmuv3.aarch64-latest.args
create mode 100644 tests/qemuxmlconfdata/iommu-nestedsmmuv3.aarch64-latest.xml
create mode 100644 tests/qemuxmlconfdata/iommu-nestedsmmuv3.xml
create mode 100644 tests/virnestedsmmuv3mock.c
[RFC PATCH 0/5] qemu: Route hostdevs to multiple nested SMMUs
Posted by Nathan Chen via Devel 1 week, 2 days ago
Hi,

This is a draft solution for supporting multiple vSMMU instances in a qemu VM.

Based on discussions/suggestions received for a previous RFC by Nicolin here[0],
the association of vSMMUs to VFIO devices in VM PCIe topology should be moved
out of qemu into libvirt. In addition, the nested SMMU nodes should be passed
to qemu as pluggable devices.

To address these changes, this patch series introduces a new "nestedSmmuv3"
IOMMU model and "nestedSmmuv3" device type. Upon specifying the nestedSmmuv3
IOMMU model, nestedSmmuv3 devices will be auto-added to the VM definition based
on the available SMMU nodes in the host's sysfs. The nestedSmmuv3 devices will
each be attached to a separate PXB controller, and VFIO devices will be routed
to PXBs based on their association with host SMMU nodes. This will maintain a VM
PCIe topology that allows for multiple nested SMMUs per Nicolin's original qemu
patch series in [0] and Shameer's work in [1] to remove VM topology changes from
qemu and allow the nested SMMUs to be specified as pluggable devices.

For instance, if we specify the nestedSmmuv3 IOMMU model and a hostdev for
passthrough:

  <devices>
    <hostdev mode='subsystem' type='pci' managed='no'>
      <source>
        <address domain='0x0009' bus='0x01' slot='0x00' function='0x0'/>
      </source>
    </hostdev>
    <iommu model='nestedSmmuv3'/>
  </devices>

Libvirt will scan sysfs and populate the VM definition with controllers and
nestedSmmuv3 devices based on host config. So if
/sys/bus/pci/devices/0009:01:00.0/iommu is a symlink to the host SMMU node
represented by
/sys/devices/platform/arm-smmu-v3.8.auto/iommu/smmu3.0x0000000016000000
and there are 3 host SMMU nodes under /sys/class/iommu/, we'll see three
auto-added nestedSmmuv3 devices, each routed to a pcie-expander-bus controller.
Then the hostdev will be routed to a PXB controller that has a matching host
SMMU node associated with it:

  <devices>
    ...
    <controller type='pci' index='1' model='pcie-expander-bus'>
      <model name='pxb-pcie'/>
      <target busNr='254'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/>
    </controller>
    <controller type='pci' index='2' model='pcie-expander-bus'>
      <model name='pxb-pcie'/>
      <target busNr='251'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </controller>
    <controller type='pci' index='3' model='pcie-expander-bus'>
      <model name='pxb-pcie'/>
      <target busNr='249'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </controller>
    <controller type='pci' index='4' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='7' port='0x8'/>
      <address type='pci' domain='0x0000' bus='0x02' slot='0x01' function='0x0'/>
    </controller>
    <hostdev mode='subsystem' type='pci' managed='no'>
      <source>
        <address domain='0x0009' bus='0x01' slot='0x00' function='0x0'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
    </hostdev>
    <iommu model='nestedSmmuv3'/>
    <nestedSmmuv3>
      <name>smmu3.0x0000000012000000</name>
      <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
    </nestedSmmuv3>
    <nestedSmmuv3>
      <name>smmu3.0x0000000016000000</name>
      <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
    </nestedSmmuv3>
    <nestedSmmuv3>
      <name>smmu3.0x0000000011000000</name>
      <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x0'/>
    </nestedSmmuv3>
    <iommu model='nestedSmmuv3'/>
  </devices>

TODO:
- No DMA mapping can found by UEFI when specifying multiple passthrough devices
in the VM definition, and VM boot is subsequently blocked. We need to
investigate this for the next revision, but we don't encounter this issue when
passing through a single device. We'll include iommufd support in the next
revision to narrow down whether the required fix would be outside of libvirt.

- Shameer's qemu branch specifies nestedSmmuv3 bus number with "pci-bus"
instead of "bus", so the libvirt compilation test args and qemu args in
qemuBuildPCINestedSmmuv3DevProps() need to be modified to match this revision
of qemu. It will be reverted to using "bus" in the next qemu revision.

- This patchset decrements PXB busNr based on how many devices are attached
downstream, and the libvirt documentation states we must reserve busNr for the
PXB itself in addition to any devices attached downstream. When I launch a VM
and a PXB has a pcie-root-port and hostdev attached downstream, busNrs 253,
252, and 251 are reserved. But the PXB itself already has a bus number
assigned via the <address/> attribute, and I see 253 and 252 assigned to the
hostdev and pcie-root-port in the VM but not 251. Should we decrement busNr
based on libvirt documentation or do we only need two busNrs 253 and 252 in
the example here?

This series is on Github:
https://github.com/NathanChenNVIDIA/libvirt/tree/nested-smmuv3-12-05-24

Thanks,
Nathan

[0] https://lore.kernel.org/qemu-devel/cover.1719361174.git.nicolinc@nvidia.com/
[1] https://lore.kernel.org/qemu-devel/20241108125242.60136-1-shameerali.kolothum.thodi@huawei.com/

Signed-off-by: Nathan Chen <nathanc@nvidia.com>

Nathan Chen (5):
  conf: Add a nestedSmmuv3 IOMMU model
  qemu: Implement and auto-add a nestedSmmuv3 device type
  qemu: Create PXBs and auto-assign VFIO devs and nested SMMUs
  qemu: Update PXB busNr for nestedSmmuv3 controllers
  qemu: Add test case for specifying multiple nested SMMUs

 docs/formatdomain.rst                         |  25 ++-
 src/ch/ch_domain.c                            |   1 +
 src/conf/domain_addr.c                        |  26 ++-
 src/conf/domain_addr.h                        |   4 +-
 src/conf/domain_conf.c                        | 188 +++++++++++++++++
 src/conf/domain_conf.h                        |  15 ++
 src/conf/domain_postparse.c                   |   1 +
 src/conf/domain_validate.c                    |  24 +++
 src/conf/schemas/domaincommon.rng             |  17 ++
 src/conf/virconftypes.h                       |   2 +
 src/libvirt_private.syms                      |   2 +
 src/lxc/lxc_driver.c                          |   6 +
 src/qemu/qemu_command.c                       |  64 +++++-
 src/qemu/qemu_command.h                       |   4 +
 src/qemu/qemu_domain.c                        |   2 +
 src/qemu/qemu_domain_address.c                | 193 ++++++++++++++++++
 src/qemu/qemu_driver.c                        |   3 +
 src/qemu/qemu_hotplug.c                       |   5 +
 src/qemu/qemu_postparse.c                     |   1 +
 src/qemu/qemu_validate.c                      |  16 ++
 src/test/test_driver.c                        |   4 +
 tests/meson.build                             |   1 +
 .../iommu-nestedsmmuv3.aarch64-latest.args    |  38 ++++
 .../iommu-nestedsmmuv3.aarch64-latest.xml     |  61 ++++++
 tests/qemuxmlconfdata/iommu-nestedsmmuv3.xml  |  29 +++
 tests/qemuxmlconftest.c                       |   4 +-
 tests/schemas/device.rng.in                   |   1 +
 tests/virnestedsmmuv3mock.c                   |  57 ++++++
 28 files changed, 788 insertions(+), 6 deletions(-)
 create mode 100644 tests/qemuxmlconfdata/iommu-nestedsmmuv3.aarch64-latest.args
 create mode 100644 tests/qemuxmlconfdata/iommu-nestedsmmuv3.aarch64-latest.xml
 create mode 100644 tests/qemuxmlconfdata/iommu-nestedsmmuv3.xml
 create mode 100644 tests/virnestedsmmuv3mock.c

-- 
2.34.1
Re: [RFC PATCH 0/5] qemu: Route hostdevs to multiple nested SMMUs
Posted by Daniel P. Berrangé 1 week, 1 day ago
On Wed, Dec 11, 2024 at 04:24:18PM -0800, Nathan Chen via Devel wrote:
> Hi,
> 
> This is a draft solution for supporting multiple vSMMU instances in a qemu VM.
> 
> Based on discussions/suggestions received for a previous RFC by Nicolin here[0],
> the association of vSMMUs to VFIO devices in VM PCIe topology should be moved
> out of qemu into libvirt. In addition, the nested SMMU nodes should be passed
> to qemu as pluggable devices.
> 
> To address these changes, this patch series introduces a new "nestedSmmuv3"
> IOMMU model and "nestedSmmuv3" device type. Upon specifying the nestedSmmuv3
> IOMMU model, nestedSmmuv3 devices will be auto-added to the VM definition based
> on the available SMMU nodes in the host's sysfs. The nestedSmmuv3 devices will
> each be attached to a separate PXB controller, and VFIO devices will be routed
> to PXBs based on their association with host SMMU nodes. This will maintain a VM
> PCIe topology that allows for multiple nested SMMUs per Nicolin's original qemu
> patch series in [0] and Shameer's work in [1] to remove VM topology changes from
> qemu and allow the nested SMMUs to be specified as pluggable devices.
> 
> For instance, if we specify the nestedSmmuv3 IOMMU model and a hostdev for
> passthrough:
> 
>   <devices>
>     <hostdev mode='subsystem' type='pci' managed='no'>
>       <source>
>         <address domain='0x0009' bus='0x01' slot='0x00' function='0x0'/>
>       </source>
>     </hostdev>
>     <iommu model='nestedSmmuv3'/>
>   </devices>
> 
> Libvirt will scan sysfs and populate the VM definition with controllers and
> nestedSmmuv3 devices based on host config. So if
> /sys/bus/pci/devices/0009:01:00.0/iommu is a symlink to the host SMMU node
> represented by
> /sys/devices/platform/arm-smmu-v3.8.auto/iommu/smmu3.0x0000000016000000
> and there are 3 host SMMU nodes under /sys/class/iommu/, we'll see three
> auto-added nestedSmmuv3 devices, each routed to a pcie-expander-bus controller.
> Then the hostdev will be routed to a PXB controller that has a matching host
> SMMU node associated with it:
> 
>   <devices>
>     ...
>     <controller type='pci' index='1' model='pcie-expander-bus'>
>       <model name='pxb-pcie'/>
>       <target busNr='254'/>
>       <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/>
>     </controller>
>     <controller type='pci' index='2' model='pcie-expander-bus'>
>       <model name='pxb-pcie'/>
>       <target busNr='251'/>
>       <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
>     </controller>
>     <controller type='pci' index='3' model='pcie-expander-bus'>
>       <model name='pxb-pcie'/>
>       <target busNr='249'/>
>       <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
>     </controller>
>     <controller type='pci' index='4' model='pcie-root-port'>
>       <model name='pcie-root-port'/>
>       <target chassis='7' port='0x8'/>
>       <address type='pci' domain='0x0000' bus='0x02' slot='0x01' function='0x0'/>
>     </controller>
>     <hostdev mode='subsystem' type='pci' managed='no'>
>       <source>
>         <address domain='0x0009' bus='0x01' slot='0x00' function='0x0'/>
>       </source>
>       <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
>     </hostdev>
>     <iommu model='nestedSmmuv3'/>
>     <nestedSmmuv3>
>       <name>smmu3.0x0000000012000000</name>
>       <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
>     </nestedSmmuv3>
>     <nestedSmmuv3>
>       <name>smmu3.0x0000000016000000</name>
>       <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
>     </nestedSmmuv3>
>     <nestedSmmuv3>
>       <name>smmu3.0x0000000011000000</name>
>       <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x0'/>
>     </nestedSmmuv3>
>     <iommu model='nestedSmmuv3'/>
>   </devices>

Top level libvirt device representation in XML is based on the device
*class*, not the specific device impl. Adding a <nestedSmmuv3> device
type XML element in libvirt is totally inappropriate. Any configuration
must be done beneath the <iommu> element.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
Re: [RFC PATCH 0/5] qemu: Route hostdevs to multiple nested SMMUs
Posted by Nathan Chen via Devel 5 days, 20 hours ago
Hi Daniel,

 >Top level libvirt device representation in XML is based on the device
 >*class*, not the specific device impl. Adding a <nestedSmmuv3> device
 >type XML element in libvirt is totally inappropriate. Any configuration
 >must be done beneath the <iommu> element.

Would keeping track of PXB <=> host SMMU nodes be better represented 
with a <nestedSmmuv3> PXB attribute like below, when the "nestedSmmuv3" 
IOMMU model is specified? This method would be simplest IMO because we 
could omit keeping track of the nestedSmmuv3 bus number in the 
virDomainDef struct since its association with a PXB controller would 
already be baked-in.

<devices>
   ...
   <controller type='pci' index='1' model='pcie-expander-bus'>
     <model name='pxb-pcie'/>
   <target busNr='254'/>
   <nestedSmmuv3>smmu3.0x0000000012000000</nestedSmmuv3>
   <address type='pci' domain='0x0000' bus='0x00' slot='0x01' 
function='0x0'/>
   </controller>
   ...
   <iommu model='nestedSmmuv3'/>
</devices>

Or would it still be preferred to purely contain the host SMMU node 
names and bus numbers within the <iommu> element? If so, the 
virDomainDef struct is setup to only have a single virDomainIOMMUDef 
member, but we need to keep track of multiple host SMMU node names and 
bus numbers. Would we setup multiple virDomainIOMMUDef members? Or add 
"char** nestedSmmuv3" and "size_t *nestedSmmuv3Bus" members to the 
virDomainIOMMUDef struct to keep track of these multiple host SMMU node 
names and bus numbers?

Or we could modify the device info member of virDomainIOMMUDef to be a 
variable-length array of device info structs instead of the "size_t 
*nestedSmmuv3Bus" member. But I'm not convinced this would be the 
cleanest approach, and the qemu command line doesn't specify a slot and 
function to arm-smmuv3-nested devices - it just specifies bus number to 
keep track of which PXB is associated with each SMMU node.

If we need a way to associate PXB to SMMU node, we have multiple 
possible approaches, listed below in order of best to worst in my opinion:

1. Adding a <nestedSmmuv3> attribute for PXB controller.
2. Having a single virDomainIOMMUDef struct for virDomainDef. Adding 
variable-length array members to virDomainIOMMUDef for multiple SMMU 
node names and bus numbers.
3. Having a single virDomainIOMMUDef struct for virDomainDef. Adding a 
variable-length array member to virDomainIOMMUDef for SMMU node names. 
Changing the single virDomainDeviceInfo struct to a variable-length 
array of virDomainDeviceInfo structs for multiple nested SMMU bus numbers.
4. Supporting multiple virDomainIOMMUDef structs for virDomainDef.

I would appreciate your thoughts on which method to go with.

Thanks,
Nathan
Re: [RFC PATCH 0/5] qemu: Route hostdevs to multiple nested SMMUs
Posted by Daniel P. Berrangé 5 days, 3 hours ago
On Sun, Dec 15, 2024 at 11:45:56AM -0800, Nathan Chen wrote:
> Hi Daniel,
> 
> >Top level libvirt device representation in XML is based on the device
> >*class*, not the specific device impl. Adding a <nestedSmmuv3> device
> >type XML element in libvirt is totally inappropriate. Any configuration
> >must be done beneath the <iommu> element.
> 
> Would keeping track of PXB <=> host SMMU nodes be better represented with a
> <nestedSmmuv3> PXB attribute like below, when the "nestedSmmuv3" IOMMU model
> is specified? This method would be simplest IMO because we could omit
> keeping track of the nestedSmmuv3 bus number in the virDomainDef struct
> since its association with a PXB controller would already be baked-in.
> 
> <devices>
>   ...
>   <controller type='pci' index='1' model='pcie-expander-bus'>
>     <model name='pxb-pcie'/>
>   <target busNr='254'/>
>   <nestedSmmuv3>smmu3.0x0000000012000000</nestedSmmuv3>
>   <address type='pci' domain='0x0000' bus='0x00' slot='0x01'
> function='0x0'/>
>   </controller>
>   ...
>   <iommu model='nestedSmmuv3'/>
> </devices>
> 
> Or would it still be preferred to purely contain the host SMMU node names
> and bus numbers within the <iommu> element? If so, the virDomainDef struct
> is setup to only have a single virDomainIOMMUDef member, but we need to keep
> track of multiple host SMMU node names and bus numbers. Would we setup
> multiple virDomainIOMMUDef members? Or add "char** nestedSmmuv3" and "size_t
> *nestedSmmuv3Bus" members to the virDomainIOMMUDef struct to keep track of
> these multiple host SMMU node names and bus numbers?
> 
> Or we could modify the device info member of virDomainIOMMUDef to be a
> variable-length array of device info structs instead of the "size_t
> *nestedSmmuv3Bus" member. But I'm not convinced this would be the cleanest
> approach, and the qemu command line doesn't specify a slot and function to
> arm-smmuv3-nested devices - it just specifies bus number to keep track of
> which PXB is associated with each SMMU node.
> 
> If we need a way to associate PXB to SMMU node, we have multiple possible
> approaches, listed below in order of best to worst in my opinion:
> 
> 1. Adding a <nestedSmmuv3> attribute for PXB controller.
> 2. Having a single virDomainIOMMUDef struct for virDomainDef. Adding
> variable-length array members to virDomainIOMMUDef for multiple SMMU node
> names and bus numbers.
> 3. Having a single virDomainIOMMUDef struct for virDomainDef. Adding a
> variable-length array member to virDomainIOMMUDef for SMMU node names.
> Changing the single virDomainDeviceInfo struct to a variable-length array of
> virDomainDeviceInfo structs for multiple nested SMMU bus numbers.
> 4. Supporting multiple virDomainIOMMUDef structs for virDomainDef.
> 
> I would appreciate your thoughts on which method to go with.

I'm finding it a little hard to give a recommendation, as I'm not confident
I fully understand the relationship between all the pieces.

Ignoring the specific QEMU impl, can you give a general outline of the
relationship between host SMMU(s), guest SMMU(s) and PCI controllers,
especially the M:N values of the relations.

Also, if this is getting associated with the host SMMU in some way,
does this have implications for live migration compatibility ?

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
Re: [RFC PATCH 0/5] qemu: Route hostdevs to multiple nested SMMUs
Posted by nathanc--- via Devel 1 week ago
> Top level libvirt device representation in XML is based on the device
> *class*, not the specific device impl. Adding a <nestedSmmuv3> device
> type XML element in libvirt is totally inappropriate. Any configuration
> must be done beneath the <iommu> element.

Hi Daniel,

Thanks for the feedback - I will remove the <nestedSmmuv3> device
type in the next iteration. I had implemented it in order to associate
PXB controllers with host SMMU nodes, but instead I will keep track
of host SMMU node names in a new PXB controller attribute.

Best,
Nathan