docs/formatdomain.rst | 25 ++- src/ch/ch_domain.c | 1 + src/conf/domain_addr.c | 26 ++- src/conf/domain_addr.h | 4 +- src/conf/domain_conf.c | 188 +++++++++++++++++ src/conf/domain_conf.h | 15 ++ src/conf/domain_postparse.c | 1 + src/conf/domain_validate.c | 24 +++ src/conf/schemas/domaincommon.rng | 17 ++ src/conf/virconftypes.h | 2 + src/libvirt_private.syms | 2 + src/lxc/lxc_driver.c | 6 + src/qemu/qemu_command.c | 64 +++++- src/qemu/qemu_command.h | 4 + src/qemu/qemu_domain.c | 2 + src/qemu/qemu_domain_address.c | 193 ++++++++++++++++++ src/qemu/qemu_driver.c | 3 + src/qemu/qemu_hotplug.c | 5 + src/qemu/qemu_postparse.c | 1 + src/qemu/qemu_validate.c | 16 ++ src/test/test_driver.c | 4 + tests/meson.build | 1 + .../iommu-nestedsmmuv3.aarch64-latest.args | 38 ++++ .../iommu-nestedsmmuv3.aarch64-latest.xml | 61 ++++++ tests/qemuxmlconfdata/iommu-nestedsmmuv3.xml | 29 +++ tests/qemuxmlconftest.c | 4 +- tests/schemas/device.rng.in | 1 + tests/virnestedsmmuv3mock.c | 57 ++++++ 28 files changed, 788 insertions(+), 6 deletions(-) create mode 100644 tests/qemuxmlconfdata/iommu-nestedsmmuv3.aarch64-latest.args create mode 100644 tests/qemuxmlconfdata/iommu-nestedsmmuv3.aarch64-latest.xml create mode 100644 tests/qemuxmlconfdata/iommu-nestedsmmuv3.xml create mode 100644 tests/virnestedsmmuv3mock.c
Hi, This is a draft solution for supporting multiple vSMMU instances in a qemu VM. Based on discussions/suggestions received for a previous RFC by Nicolin here[0], the association of vSMMUs to VFIO devices in VM PCIe topology should be moved out of qemu into libvirt. In addition, the nested SMMU nodes should be passed to qemu as pluggable devices. To address these changes, this patch series introduces a new "nestedSmmuv3" IOMMU model and "nestedSmmuv3" device type. Upon specifying the nestedSmmuv3 IOMMU model, nestedSmmuv3 devices will be auto-added to the VM definition based on the available SMMU nodes in the host's sysfs. The nestedSmmuv3 devices will each be attached to a separate PXB controller, and VFIO devices will be routed to PXBs based on their association with host SMMU nodes. This will maintain a VM PCIe topology that allows for multiple nested SMMUs per Nicolin's original qemu patch series in [0] and Shameer's work in [1] to remove VM topology changes from qemu and allow the nested SMMUs to be specified as pluggable devices. For instance, if we specify the nestedSmmuv3 IOMMU model and a hostdev for passthrough: <devices> <hostdev mode='subsystem' type='pci' managed='no'> <source> <address domain='0x0009' bus='0x01' slot='0x00' function='0x0'/> </source> </hostdev> <iommu model='nestedSmmuv3'/> </devices> Libvirt will scan sysfs and populate the VM definition with controllers and nestedSmmuv3 devices based on host config. So if /sys/bus/pci/devices/0009:01:00.0/iommu is a symlink to the host SMMU node represented by /sys/devices/platform/arm-smmu-v3.8.auto/iommu/smmu3.0x0000000016000000 and there are 3 host SMMU nodes under /sys/class/iommu/, we'll see three auto-added nestedSmmuv3 devices, each routed to a pcie-expander-bus controller. Then the hostdev will be routed to a PXB controller that has a matching host SMMU node associated with it: <devices> ... <controller type='pci' index='1' model='pcie-expander-bus'> <model name='pxb-pcie'/> <target busNr='254'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/> </controller> <controller type='pci' index='2' model='pcie-expander-bus'> <model name='pxb-pcie'/> <target busNr='251'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> </controller> <controller type='pci' index='3' model='pcie-expander-bus'> <model name='pxb-pcie'/> <target busNr='249'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> </controller> <controller type='pci' index='4' model='pcie-root-port'> <model name='pcie-root-port'/> <target chassis='7' port='0x8'/> <address type='pci' domain='0x0000' bus='0x02' slot='0x01' function='0x0'/> </controller> <hostdev mode='subsystem' type='pci' managed='no'> <source> <address domain='0x0009' bus='0x01' slot='0x00' function='0x0'/> </source> <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/> </hostdev> <iommu model='nestedSmmuv3'/> <nestedSmmuv3> <name>smmu3.0x0000000012000000</name> <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> </nestedSmmuv3> <nestedSmmuv3> <name>smmu3.0x0000000016000000</name> <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/> </nestedSmmuv3> <nestedSmmuv3> <name>smmu3.0x0000000011000000</name> <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x0'/> </nestedSmmuv3> <iommu model='nestedSmmuv3'/> </devices> TODO: - No DMA mapping can found by UEFI when specifying multiple passthrough devices in the VM definition, and VM boot is subsequently blocked. We need to investigate this for the next revision, but we don't encounter this issue when passing through a single device. We'll include iommufd support in the next revision to narrow down whether the required fix would be outside of libvirt. - Shameer's qemu branch specifies nestedSmmuv3 bus number with "pci-bus" instead of "bus", so the libvirt compilation test args and qemu args in qemuBuildPCINestedSmmuv3DevProps() need to be modified to match this revision of qemu. It will be reverted to using "bus" in the next qemu revision. - This patchset decrements PXB busNr based on how many devices are attached downstream, and the libvirt documentation states we must reserve busNr for the PXB itself in addition to any devices attached downstream. When I launch a VM and a PXB has a pcie-root-port and hostdev attached downstream, busNrs 253, 252, and 251 are reserved. But the PXB itself already has a bus number assigned via the <address/> attribute, and I see 253 and 252 assigned to the hostdev and pcie-root-port in the VM but not 251. Should we decrement busNr based on libvirt documentation or do we only need two busNrs 253 and 252 in the example here? This series is on Github: https://github.com/NathanChenNVIDIA/libvirt/tree/nested-smmuv3-12-05-24 Thanks, Nathan [0] https://lore.kernel.org/qemu-devel/cover.1719361174.git.nicolinc@nvidia.com/ [1] https://lore.kernel.org/qemu-devel/20241108125242.60136-1-shameerali.kolothum.thodi@huawei.com/ Signed-off-by: Nathan Chen <nathanc@nvidia.com> Nathan Chen (5): conf: Add a nestedSmmuv3 IOMMU model qemu: Implement and auto-add a nestedSmmuv3 device type qemu: Create PXBs and auto-assign VFIO devs and nested SMMUs qemu: Update PXB busNr for nestedSmmuv3 controllers qemu: Add test case for specifying multiple nested SMMUs docs/formatdomain.rst | 25 ++- src/ch/ch_domain.c | 1 + src/conf/domain_addr.c | 26 ++- src/conf/domain_addr.h | 4 +- src/conf/domain_conf.c | 188 +++++++++++++++++ src/conf/domain_conf.h | 15 ++ src/conf/domain_postparse.c | 1 + src/conf/domain_validate.c | 24 +++ src/conf/schemas/domaincommon.rng | 17 ++ src/conf/virconftypes.h | 2 + src/libvirt_private.syms | 2 + src/lxc/lxc_driver.c | 6 + src/qemu/qemu_command.c | 64 +++++- src/qemu/qemu_command.h | 4 + src/qemu/qemu_domain.c | 2 + src/qemu/qemu_domain_address.c | 193 ++++++++++++++++++ src/qemu/qemu_driver.c | 3 + src/qemu/qemu_hotplug.c | 5 + src/qemu/qemu_postparse.c | 1 + src/qemu/qemu_validate.c | 16 ++ src/test/test_driver.c | 4 + tests/meson.build | 1 + .../iommu-nestedsmmuv3.aarch64-latest.args | 38 ++++ .../iommu-nestedsmmuv3.aarch64-latest.xml | 61 ++++++ tests/qemuxmlconfdata/iommu-nestedsmmuv3.xml | 29 +++ tests/qemuxmlconftest.c | 4 +- tests/schemas/device.rng.in | 1 + tests/virnestedsmmuv3mock.c | 57 ++++++ 28 files changed, 788 insertions(+), 6 deletions(-) create mode 100644 tests/qemuxmlconfdata/iommu-nestedsmmuv3.aarch64-latest.args create mode 100644 tests/qemuxmlconfdata/iommu-nestedsmmuv3.aarch64-latest.xml create mode 100644 tests/qemuxmlconfdata/iommu-nestedsmmuv3.xml create mode 100644 tests/virnestedsmmuv3mock.c -- 2.34.1
On Wed, Dec 11, 2024 at 04:24:18PM -0800, Nathan Chen via Devel wrote: > Hi, > > This is a draft solution for supporting multiple vSMMU instances in a qemu VM. > > Based on discussions/suggestions received for a previous RFC by Nicolin here[0], > the association of vSMMUs to VFIO devices in VM PCIe topology should be moved > out of qemu into libvirt. In addition, the nested SMMU nodes should be passed > to qemu as pluggable devices. > > To address these changes, this patch series introduces a new "nestedSmmuv3" > IOMMU model and "nestedSmmuv3" device type. Upon specifying the nestedSmmuv3 > IOMMU model, nestedSmmuv3 devices will be auto-added to the VM definition based > on the available SMMU nodes in the host's sysfs. The nestedSmmuv3 devices will > each be attached to a separate PXB controller, and VFIO devices will be routed > to PXBs based on their association with host SMMU nodes. This will maintain a VM > PCIe topology that allows for multiple nested SMMUs per Nicolin's original qemu > patch series in [0] and Shameer's work in [1] to remove VM topology changes from > qemu and allow the nested SMMUs to be specified as pluggable devices. > > For instance, if we specify the nestedSmmuv3 IOMMU model and a hostdev for > passthrough: > > <devices> > <hostdev mode='subsystem' type='pci' managed='no'> > <source> > <address domain='0x0009' bus='0x01' slot='0x00' function='0x0'/> > </source> > </hostdev> > <iommu model='nestedSmmuv3'/> > </devices> > > Libvirt will scan sysfs and populate the VM definition with controllers and > nestedSmmuv3 devices based on host config. So if > /sys/bus/pci/devices/0009:01:00.0/iommu is a symlink to the host SMMU node > represented by > /sys/devices/platform/arm-smmu-v3.8.auto/iommu/smmu3.0x0000000016000000 > and there are 3 host SMMU nodes under /sys/class/iommu/, we'll see three > auto-added nestedSmmuv3 devices, each routed to a pcie-expander-bus controller. > Then the hostdev will be routed to a PXB controller that has a matching host > SMMU node associated with it: > > <devices> > ... > <controller type='pci' index='1' model='pcie-expander-bus'> > <model name='pxb-pcie'/> > <target busNr='254'/> > <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/> > </controller> > <controller type='pci' index='2' model='pcie-expander-bus'> > <model name='pxb-pcie'/> > <target busNr='251'/> > <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> > </controller> > <controller type='pci' index='3' model='pcie-expander-bus'> > <model name='pxb-pcie'/> > <target busNr='249'/> > <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> > </controller> > <controller type='pci' index='4' model='pcie-root-port'> > <model name='pcie-root-port'/> > <target chassis='7' port='0x8'/> > <address type='pci' domain='0x0000' bus='0x02' slot='0x01' function='0x0'/> > </controller> > <hostdev mode='subsystem' type='pci' managed='no'> > <source> > <address domain='0x0009' bus='0x01' slot='0x00' function='0x0'/> > </source> > <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/> > </hostdev> > <iommu model='nestedSmmuv3'/> > <nestedSmmuv3> > <name>smmu3.0x0000000012000000</name> > <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> > </nestedSmmuv3> > <nestedSmmuv3> > <name>smmu3.0x0000000016000000</name> > <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/> > </nestedSmmuv3> > <nestedSmmuv3> > <name>smmu3.0x0000000011000000</name> > <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x0'/> > </nestedSmmuv3> > <iommu model='nestedSmmuv3'/> > </devices> Top level libvirt device representation in XML is based on the device *class*, not the specific device impl. Adding a <nestedSmmuv3> device type XML element in libvirt is totally inappropriate. Any configuration must be done beneath the <iommu> element. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
Hi Daniel, >Top level libvirt device representation in XML is based on the device >*class*, not the specific device impl. Adding a <nestedSmmuv3> device >type XML element in libvirt is totally inappropriate. Any configuration >must be done beneath the <iommu> element. Would keeping track of PXB <=> host SMMU nodes be better represented with a <nestedSmmuv3> PXB attribute like below, when the "nestedSmmuv3" IOMMU model is specified? This method would be simplest IMO because we could omit keeping track of the nestedSmmuv3 bus number in the virDomainDef struct since its association with a PXB controller would already be baked-in. <devices> ... <controller type='pci' index='1' model='pcie-expander-bus'> <model name='pxb-pcie'/> <target busNr='254'/> <nestedSmmuv3>smmu3.0x0000000012000000</nestedSmmuv3> <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/> </controller> ... <iommu model='nestedSmmuv3'/> </devices> Or would it still be preferred to purely contain the host SMMU node names and bus numbers within the <iommu> element? If so, the virDomainDef struct is setup to only have a single virDomainIOMMUDef member, but we need to keep track of multiple host SMMU node names and bus numbers. Would we setup multiple virDomainIOMMUDef members? Or add "char** nestedSmmuv3" and "size_t *nestedSmmuv3Bus" members to the virDomainIOMMUDef struct to keep track of these multiple host SMMU node names and bus numbers? Or we could modify the device info member of virDomainIOMMUDef to be a variable-length array of device info structs instead of the "size_t *nestedSmmuv3Bus" member. But I'm not convinced this would be the cleanest approach, and the qemu command line doesn't specify a slot and function to arm-smmuv3-nested devices - it just specifies bus number to keep track of which PXB is associated with each SMMU node. If we need a way to associate PXB to SMMU node, we have multiple possible approaches, listed below in order of best to worst in my opinion: 1. Adding a <nestedSmmuv3> attribute for PXB controller. 2. Having a single virDomainIOMMUDef struct for virDomainDef. Adding variable-length array members to virDomainIOMMUDef for multiple SMMU node names and bus numbers. 3. Having a single virDomainIOMMUDef struct for virDomainDef. Adding a variable-length array member to virDomainIOMMUDef for SMMU node names. Changing the single virDomainDeviceInfo struct to a variable-length array of virDomainDeviceInfo structs for multiple nested SMMU bus numbers. 4. Supporting multiple virDomainIOMMUDef structs for virDomainDef. I would appreciate your thoughts on which method to go with. Thanks, Nathan
On Sun, Dec 15, 2024 at 11:45:56AM -0800, Nathan Chen wrote: > Hi Daniel, > > >Top level libvirt device representation in XML is based on the device > >*class*, not the specific device impl. Adding a <nestedSmmuv3> device > >type XML element in libvirt is totally inappropriate. Any configuration > >must be done beneath the <iommu> element. > > Would keeping track of PXB <=> host SMMU nodes be better represented with a > <nestedSmmuv3> PXB attribute like below, when the "nestedSmmuv3" IOMMU model > is specified? This method would be simplest IMO because we could omit > keeping track of the nestedSmmuv3 bus number in the virDomainDef struct > since its association with a PXB controller would already be baked-in. > > <devices> > ... > <controller type='pci' index='1' model='pcie-expander-bus'> > <model name='pxb-pcie'/> > <target busNr='254'/> > <nestedSmmuv3>smmu3.0x0000000012000000</nestedSmmuv3> > <address type='pci' domain='0x0000' bus='0x00' slot='0x01' > function='0x0'/> > </controller> > ... > <iommu model='nestedSmmuv3'/> > </devices> > > Or would it still be preferred to purely contain the host SMMU node names > and bus numbers within the <iommu> element? If so, the virDomainDef struct > is setup to only have a single virDomainIOMMUDef member, but we need to keep > track of multiple host SMMU node names and bus numbers. Would we setup > multiple virDomainIOMMUDef members? Or add "char** nestedSmmuv3" and "size_t > *nestedSmmuv3Bus" members to the virDomainIOMMUDef struct to keep track of > these multiple host SMMU node names and bus numbers? > > Or we could modify the device info member of virDomainIOMMUDef to be a > variable-length array of device info structs instead of the "size_t > *nestedSmmuv3Bus" member. But I'm not convinced this would be the cleanest > approach, and the qemu command line doesn't specify a slot and function to > arm-smmuv3-nested devices - it just specifies bus number to keep track of > which PXB is associated with each SMMU node. > > If we need a way to associate PXB to SMMU node, we have multiple possible > approaches, listed below in order of best to worst in my opinion: > > 1. Adding a <nestedSmmuv3> attribute for PXB controller. > 2. Having a single virDomainIOMMUDef struct for virDomainDef. Adding > variable-length array members to virDomainIOMMUDef for multiple SMMU node > names and bus numbers. > 3. Having a single virDomainIOMMUDef struct for virDomainDef. Adding a > variable-length array member to virDomainIOMMUDef for SMMU node names. > Changing the single virDomainDeviceInfo struct to a variable-length array of > virDomainDeviceInfo structs for multiple nested SMMU bus numbers. > 4. Supporting multiple virDomainIOMMUDef structs for virDomainDef. > > I would appreciate your thoughts on which method to go with. I'm finding it a little hard to give a recommendation, as I'm not confident I fully understand the relationship between all the pieces. Ignoring the specific QEMU impl, can you give a general outline of the relationship between host SMMU(s), guest SMMU(s) and PCI controllers, especially the M:N values of the relations. Also, if this is getting associated with the host SMMU in some way, does this have implications for live migration compatibility ? With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
> Top level libvirt device representation in XML is based on the device > *class*, not the specific device impl. Adding a <nestedSmmuv3> device > type XML element in libvirt is totally inappropriate. Any configuration > must be done beneath the <iommu> element. Hi Daniel, Thanks for the feedback - I will remove the <nestedSmmuv3> device type in the next iteration. I had implemented it in order to associate PXB controllers with host SMMU nodes, but instead I will keep track of host SMMU node names in a new PXB controller attribute. Best, Nathan
© 2016 - 2024 Red Hat, Inc.