hw/arm/smmuv3.c | 61 ++++++++++++++++++++++ hw/arm/virt-acpi-build.c | 109 ++++++++++++++++++++++++++++++++------- hw/arm/virt.c | 33 ++++++++++-- hw/core/sysbus-fdt.c | 1 + include/hw/arm/smmuv3.h | 17 ++++++ include/hw/arm/virt.h | 15 ++++++ 6 files changed, 215 insertions(+), 21 deletions(-)
Hi,
This series adds initial support for a user-creatable "arm-smmuv3-nested"
device to Qemu. At present the Qemu ARM SMMUv3 emulation is per machine
and cannot support multiple SMMUv3s.
In order to support vfio-pci dev assignment with vSMMUv3, the physical
SMMUv3 has to be configured in nested mode. Having a pluggable
"arm-smmuv3-nested" device enables us to have multiple vSMMUv3 for Guests
running on a host with multiple physical SMMUv3s. A few benefits of doing
this are,
1. Avoid invalidation broadcast or lookup in case devices are behind
multiple phys SMMUv3s.
2. Makes it easy to handle phys SMMUv3s that differ in features.
3. Easy to handle future requirements such as vCMDQ support.
This is based on discussions/suggestions received for a previous RFC by
Nicolin here[0].
This series includes,
-Adds support for "arm-smmuv3-nested" device. At present only virt is
supported and is using _plug_cb() callback to hook the sysbus mem
and irq (Not sure this has any negative repercussions). Patch #3.
-Provides a way to associate a pci-bus(pxb-pcie) to the above device.
Patch #3.
-The last patch is adding RMR support for MSI doorbell handling. Patch #5.
This may change in future[1].
This RFC is for initial discussion/test purposes only and includes patches
that are only relevant for adding the "arm-smmuv3-nested" support. For the
complete branch please find,
https://github.com/hisilicon/qemu/tree/private-smmuv3-nested-dev-rfc-v1
Few ToDos to note,
1. At present default-bus-bypass-iommu=on should be set when
arm-smmuv3-nested dev is specified. Otherwise you may get an IORT
related boot error. Requires fixing.
2. Hot adding a device is not working at the moment. Looks like pcihp irq issue.
Could be a bug in IORT id mappings.
3. The above branch doesn't support vSVA yet.
Hopefully this is helpful in taking the discussion forward. Please take a
look and let me know.
How to use it(Eg:):
On a HiSilicon platform that has multiple physical SMMUv3s, the ACC ZIP VF
devices and HNS VF devices are behind different SMMUv3s. So for a Guest,
specify two smmuv3-nested devices each behind a pxb-pcie as below,
./qemu-system-aarch64 -machine virt,gic-version=3,default-bus-bypass-iommu=on \
-enable-kvm -cpu host -m 4G -smp cpus=8,maxcpus=8 \
-object iommufd,id=iommufd0 \
-bios QEMU_EFI.fd \
-kernel Image \
-device virtio-blk-device,drive=fs \
-drive if=none,file=rootfs.qcow2,id=fs \
-device pxb-pcie,id=pcie.1,bus_nr=8,bus=pcie.0 \
-device pcie-root-port,id=pcie.port1,bus=pcie.1,chassis=1 \
-device arm-smmuv3-nested,id=smmuv1,pci-bus=pcie.1 \
-device vfio-pci,host=0000:7d:02.1,bus=pcie.port1,iommufd=iommufd0 \
-device pxb-pcie,id=pcie.2,bus_nr=16,bus=pcie.0 \
-device pcie-root-port,id=pcie.port2,bus=pcie.2,chassis=2 \
-device arm-smmuv3-nested,id=smmuv2,pci-bus=pcie.2 \
-device vfio-pci,host=0000:75:00.1,bus=pcie.port2,iommufd=iommufd0 \
-append "rdinit=init console=ttyAMA0 root=/dev/vda2 rw earlycon=pl011,0x9000000" \
-device virtio-9p-pci,fsdev=p9fs2,mount_tag=p9,bus=pcie.0 \
-fsdev local,id=p9fs2,path=p9root,security_model=mapped \
-net none \
-nographic
Guest will boot with two SMMuv3s,
[ 1.608130] arm-smmu-v3 arm-smmu-v3.0.auto: option mask 0x0
[ 1.609655] arm-smmu-v3 arm-smmu-v3.0.auto: ias 48-bit, oas 48-bit (features 0x00020b25)
[ 1.612475] arm-smmu-v3 arm-smmu-v3.0.auto: allocated 65536 entries for cmdq
[ 1.614444] arm-smmu-v3 arm-smmu-v3.0.auto: allocated 32768 entries for evtq
[ 1.617451] arm-smmu-v3 arm-smmu-v3.1.auto: option mask 0x0
[ 1.618842] arm-smmu-v3 arm-smmu-v3.1.auto: ias 48-bit, oas 48-bit (features 0x00020b25)
[ 1.621366] arm-smmu-v3 arm-smmu-v3.1.auto: allocated 65536 entries for cmdq
[ 1.623225] arm-smmu-v3 arm-smmu-v3.1.auto: allocated 32768 entries for evtq
With a pci topology like below,
[root@localhost ~]# lspci -tv
-+-[0000:00]-+-00.0 Red Hat, Inc. QEMU PCIe Host bridge
| +-01.0 Red Hat, Inc. QEMU PCIe Expander bridge
| +-02.0 Red Hat, Inc. QEMU PCIe Expander bridge
| \-03.0 Virtio: Virtio filesystem
+-[0000:08]---00.0-[09]----00.0 Huawei Technologies Co., Ltd. HNS Network Controller (Virtual Function)
\-[0000:10]---00.0-[11]----00.0 Huawei Technologies Co., Ltd. HiSilicon ZIP Engine(Virtual Function)
[root@localhost ~]#
And if you want to add another HNS VF, it should be added to the same SMMUv3
as of the first HNS dev,
-device pcie-root-port,id=pcie.port3,bus=pcie.1,chassis=3 \
-device vfio-pci,host=0000:7d:02.2,bus=pcie.port3,iommufd=iommufd0 \
[root@localhost ~]# lspci -tv
-+-[0000:00]-+-00.0 Red Hat, Inc. QEMU PCIe Host bridge
| +-01.0 Red Hat, Inc. QEMU PCIe Expander bridge
| +-02.0 Red Hat, Inc. QEMU PCIe Expander bridge
| \-03.0 Virtio: Virtio filesystem
+-[0000:08]-+-00.0-[09]----00.0 Huawei Technologies Co., Ltd. HNS Network Controller (Virtual Function)
| \-01.0-[0a]----00.0 Huawei Technologies Co., Ltd. HNS Network Controller (Virtual Function)
\-[0000:10]---00.0-[11]----00.0 Huawei Technologies Co., Ltd. HiSilicon ZIP Engine(Virtual Function)
[root@localhost ~]#
Attempt to add the HNS VF to a different SMMUv3 will result in,
-device vfio-pci,host=0000:7d:02.2,bus=pcie.port3,iommufd=iommufd0: Unable to attach viommu
-device vfio-pci,host=0000:7d:02.2,bus=pcie.port3,iommufd=iommufd0: vfio 0000:7d:02.2:
Failed to set iommu_device: [iommufd=29] error attach 0000:7d:02.2 (38) to id=11: Invalid argument
At present Qemu is not doing any extra validation other than the above
failure to make sure the user configuration is correct or not. The
assumption is libvirt will take care of this.
Thanks,
Shameer
[0] https://lore.kernel.org/qemu-devel/cover.1719361174.git.nicolinc@nvidia.com/
[1] https://lore.kernel.org/linux-iommu/ZrVN05VylFq8lK4q@Asurada-Nvidia/
Eric Auger (1):
hw/arm/virt-acpi-build: Add IORT RMR regions to handle MSI nested
binding
Nicolin Chen (2):
hw/arm/virt: Add an SMMU_IO_LEN macro
hw/arm/virt-acpi-build: Build IORT with multiple SMMU nodes
Shameer Kolothum (2):
hw/arm/smmuv3: Add initial support for SMMUv3 Nested device
hw/arm/smmuv3: Associate a pci bus with a SMMUv3 Nested device
hw/arm/smmuv3.c | 61 ++++++++++++++++++++++
hw/arm/virt-acpi-build.c | 109 ++++++++++++++++++++++++++++++++-------
hw/arm/virt.c | 33 ++++++++++--
hw/core/sysbus-fdt.c | 1 +
include/hw/arm/smmuv3.h | 17 ++++++
include/hw/arm/virt.h | 15 ++++++
6 files changed, 215 insertions(+), 21 deletions(-)
--
2.34.1
On Fri, Nov 08, 2024 at 12:52:37PM +0000, Shameer Kolothum via wrote: > How to use it(Eg:): > > On a HiSilicon platform that has multiple physical SMMUv3s, the ACC ZIP VF > devices and HNS VF devices are behind different SMMUv3s. So for a Guest, > specify two smmuv3-nested devices each behind a pxb-pcie as below, > > ./qemu-system-aarch64 -machine virt,gic-version=3,default-bus-bypass-iommu=on \ > -enable-kvm -cpu host -m 4G -smp cpus=8,maxcpus=8 \ > -object iommufd,id=iommufd0 \ > -bios QEMU_EFI.fd \ > -kernel Image \ > -device virtio-blk-device,drive=fs \ > -drive if=none,file=rootfs.qcow2,id=fs \ > -device pxb-pcie,id=pcie.1,bus_nr=8,bus=pcie.0 \ > -device pcie-root-port,id=pcie.port1,bus=pcie.1,chassis=1 \ > -device arm-smmuv3-nested,id=smmuv1,pci-bus=pcie.1 \ > -device vfio-pci,host=0000:7d:02.1,bus=pcie.port1,iommufd=iommufd0 \ > -device pxb-pcie,id=pcie.2,bus_nr=16,bus=pcie.0 \ > -device pcie-root-port,id=pcie.port2,bus=pcie.2,chassis=2 \ > -device arm-smmuv3-nested,id=smmuv2,pci-bus=pcie.2 \ > -device vfio-pci,host=0000:75:00.1,bus=pcie.port2,iommufd=iommufd0 \ > -append "rdinit=init console=ttyAMA0 root=/dev/vda2 rw earlycon=pl011,0x9000000" \ > -device virtio-9p-pci,fsdev=p9fs2,mount_tag=p9,bus=pcie.0 \ > -fsdev local,id=p9fs2,path=p9root,security_model=mapped \ > -net none \ > -nographic Above you say the host has 2 SMMUv3 devices, and you've created 2 SMMUv3 guest devices to match. The various emails in this thread & libvirt thread, indicate that each guest SMMUv3 is associated with a host SMMUv3, but I don't see any property on the command line for 'arm-ssmv3-nested' that tells it which host eSMMUv3 it is to be associated with. How does this association work ? With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
Hi Daniel, > -----Original Message----- > From: Daniel P. Berrangé <berrange@redhat.com> > Sent: Thursday, January 30, 2025 4:00 PM > To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> > Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org; > eric.auger@redhat.com; peter.maydell@linaro.org; jgg@nvidia.com; > nicolinc@nvidia.com; ddutile@redhat.com; Linuxarm > <linuxarm@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>; > jiangkunkun <jiangkunkun@huawei.com>; Jonathan Cameron > <jonathan.cameron@huawei.com>; zhangfei.gao@linaro.org > Subject: Re: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable > nested SMMUv3 > > On Fri, Nov 08, 2024 at 12:52:37PM +0000, Shameer Kolothum via wrote: > > How to use it(Eg:): > > > > On a HiSilicon platform that has multiple physical SMMUv3s, the ACC ZIP > VF > > devices and HNS VF devices are behind different SMMUv3s. So for a > Guest, > > specify two smmuv3-nested devices each behind a pxb-pcie as below, > > > > ./qemu-system-aarch64 -machine virt,gic-version=3,default-bus-bypass- > iommu=on \ > > -enable-kvm -cpu host -m 4G -smp cpus=8,maxcpus=8 \ > > -object iommufd,id=iommufd0 \ > > -bios QEMU_EFI.fd \ > > -kernel Image \ > > -device virtio-blk-device,drive=fs \ > > -drive if=none,file=rootfs.qcow2,id=fs \ > > -device pxb-pcie,id=pcie.1,bus_nr=8,bus=pcie.0 \ > > -device pcie-root-port,id=pcie.port1,bus=pcie.1,chassis=1 \ > > -device arm-smmuv3-nested,id=smmuv1,pci-bus=pcie.1 \ > > -device vfio-pci,host=0000:7d:02.1,bus=pcie.port1,iommufd=iommufd0 \ > > -device pxb-pcie,id=pcie.2,bus_nr=16,bus=pcie.0 \ > > -device pcie-root-port,id=pcie.port2,bus=pcie.2,chassis=2 \ > > -device arm-smmuv3-nested,id=smmuv2,pci-bus=pcie.2 \ > > -device vfio-pci,host=0000:75:00.1,bus=pcie.port2,iommufd=iommufd0 \ > > -append "rdinit=init console=ttyAMA0 root=/dev/vda2 rw > earlycon=pl011,0x9000000" \ > > -device virtio-9p-pci,fsdev=p9fs2,mount_tag=p9,bus=pcie.0 \ > > -fsdev local,id=p9fs2,path=p9root,security_model=mapped \ > > -net none \ > > -nographic > > Above you say the host has 2 SMMUv3 devices, and you've created 2 > SMMUv3 > guest devices to match. > > The various emails in this thread & libvirt thread, indicate that each > guest SMMUv3 is associated with a host SMMUv3, but I don't see any > property on the command line for 'arm-ssmv3-nested' that tells it which > host eSMMUv3 it is to be associated with. > > How does this association work ? You are right. The association is not very obvious in Qemu. The association and checking is done implicitly by kernel at the moment. I will try to explain it here. Each "arm-smmuv3-nested" instance, when the first device gets attached to it, will create a S2 HWPT and a corresponding SMMUv3 domain in kernel SMMUv3 driver. This domain will have a pointer representing the physical SMMUv3 that the device belongs. And any other device which belongs to the same physical SMMUv3 can share this S2 domain. If a device that belongs to a different physical SMMUv3 gets attached to the above domain, the HWPT attach will eventually fail as the physical smmuv3 in the domains will have a mismatch, https://elixir.bootlin.com/linux/v6.13/source/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c#L2860 And as I mentioned in cover letter, Qemu will report, " Attempt to add the HNS VF to a different SMMUv3 will result in, -device vfio-pci,host=0000:7d:02.2,bus=pcie.port3,iommufd=iommufd0: Unable to attach viommu -device vfio-pci,host=0000:7d:02.2,bus=pcie.port3,iommufd=iommufd0: vfio 0000:7d:02.2: Failed to set iommu_device: [iommufd=29] error attach 0000:7d:02.2 (38) to id=11: Invalid argument At present Qemu is not doing any extra validation other than the above failure to make sure the user configuration is correct or not. The assumption is libvirt will take care of this. " So in summary, if the libvirt gets it wrong, Qemu will fail with error. If a more explicit association is required, some help from kernel is required to identify the physical SMMUv3 associated with the device. Jason/Nicolin, any thoughts on this? Thanks, Shameer
On Thu, Jan 30, 2025 at 06:09:24PM +0000, Shameerali Kolothum Thodi wrote: > > Each "arm-smmuv3-nested" instance, when the first device gets attached > to it, will create a S2 HWPT and a corresponding SMMUv3 domain in kernel > SMMUv3 driver. This domain will have a pointer representing the physical > SMMUv3 that the device belongs. And any other device which belongs to > the same physical SMMUv3 can share this S2 domain. Ok, so given two guest SMMUv3s, A and B, and two host SMMUv3s, C and D, we could end up with A&C and B&D paired, or we could end up with A&D and B&C paired, depending on whether we plug the first VFIO device into guest SMMUv3 A or B. This is bad. Behaviour must not vary depending on the order in which we create devices. An guest SMMUv3 is paired to a guest PXB. A guest PXB is liable to be paired to a guest NUMA node. A guest NUMA node is liable to be paired to host NUMA node. The guest/host SMMU pairing must be chosen such that it makes conceptual sense wrt to the guest PXB NUMA to host NUMA pairing. If the kernel picks guest<->host SMMU pairings on a first-device first-paired basis, this can end up with incorrect guest NUMA configurations. The mgmt apps needs to be able to tell QEMU exactly which host SMMU to pair with each guest SMMU, and QEMU needs to then tell the kernel. > And as I mentioned in cover letter, Qemu will report, > > " > Attempt to add the HNS VF to a different SMMUv3 will result in, > > -device vfio-pci,host=0000:7d:02.2,bus=pcie.port3,iommufd=iommufd0: Unable to attach viommu > -device vfio-pci,host=0000:7d:02.2,bus=pcie.port3,iommufd=iommufd0: vfio 0000:7d:02.2: > Failed to set iommu_device: [iommufd=29] error attach 0000:7d:02.2 (38) to id=11: Invalid argument > > At present Qemu is not doing any extra validation other than the above > failure to make sure the user configuration is correct or not. The > assumption is libvirt will take care of this. > " > So in summary, if the libvirt gets it wrong, Qemu will fail with error. That's good error checking, and required, but also insufficient as illustrated above IMHO. > If a more explicit association is required, some help from kernel is required > to identify the physical SMMUv3 associated with the device. Yep, I think SMMUv3 info for devices needs to be exposed to userspace, as well as a mechanism for QEMU to tell the kernel the SMMU mapping. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
Hi Daniel, > -----Original Message----- > From: Daniel P. Berrangé <berrange@redhat.com> > Sent: Friday, January 31, 2025 9:42 PM > To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> > Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org; > eric.auger@redhat.com; peter.maydell@linaro.org; jgg@nvidia.com; > nicolinc@nvidia.com; ddutile@redhat.com; Linuxarm > <linuxarm@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>; > jiangkunkun <jiangkunkun@huawei.com>; Jonathan Cameron > <jonathan.cameron@huawei.com>; zhangfei.gao@linaro.org > Subject: Re: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable > nested SMMUv3 > > On Thu, Jan 30, 2025 at 06:09:24PM +0000, Shameerali Kolothum Thodi > wrote: > > > > Each "arm-smmuv3-nested" instance, when the first device gets attached > > to it, will create a S2 HWPT and a corresponding SMMUv3 domain in > kernel > > SMMUv3 driver. This domain will have a pointer representing the physical > > SMMUv3 that the device belongs. And any other device which belongs to > > the same physical SMMUv3 can share this S2 domain. > > Ok, so given two guest SMMUv3s, A and B, and two host SMMUv3s, > C and D, we could end up with A&C and B&D paired, or we could > end up with A&D and B&C paired, depending on whether we plug > the first VFIO device into guest SMMUv3 A or B. > > This is bad. Behaviour must not vary depending on the order > in which we create devices. > > An guest SMMUv3 is paired to a guest PXB. A guest PXB is liable > to be paired to a guest NUMA node. A guest NUMA node is liable > to be paired to host NUMA node. The guest/host SMMU pairing > must be chosen such that it makes conceptual sense wrt to the > guest PXB NUMA to host NUMA pairing. > > If the kernel picks guest<->host SMMU pairings on a first-device > first-paired basis, this can end up with incorrect guest NUMA > configurations. Ok. I am trying to understand how this can happen as I assume the Guest PXB numa node is picked up by whatever device we are attaching to it and based on which numa_id that device belongs to in physical host. And the physical smmuv3 numa id will be the same to that of the device numa_id it is associated with. Isn't it? For example I have a system here, that has 8 phys SMMUv3s and numa assignments on this is something like below, Phys SMMUv3.0 --> node 0 \..dev1 --> node0 Phys SMMUv3.1 --> node 0 \..dev2 -->node0 Phys SMMUv3.2 --> node 0 Phys SMMUv3.3 --> node 0 Phys SMMUv3.4 --> node 1 Phys SMMUv3.5 --> node 1 \..dev5 --> node1 Phys SMMUv3.6 --> node 1 Phys SMMUv3.7 --> node 1 If I have to assign say dev 1, 2 and 5 to a Guest, we need to specify 3 "arm-smmuv3-accel" instances as they belong to different phys SMMUv3s. -device pxb-pcie,id=pcie.1,bus_nr=1,bus=pcie.0,numa_id=0 \ -device pxb-pcie,id=pcie.2,bus_nr=2,bus=pcie.0,numa_id=0 \ -device pxb-pcie,id=pcie.3,bus_nr=3,bus=pcie.0,numa_id=1 \ -device arm-smmuv3-accel,id=smmuv1,bus=pcie.1 \ -device arm-smmuv3-accel,id=smmuv2,bus=pcie.2 \ -device arm-smmuv3-accel,id=smmuv3,bus=pcie.3 \ -device pcie-root-port,id=pcie.port1,bus=pcie.1,chassis=1 \ -device pcie-root-port,id=pcie.port2,bus=pcie.3,chassis=2 \ -device pcie-root-port,id=pcie.port3,bus=pcie.2,chassis=3 \ -device vfio-pci,host=0000:dev1,bus=pcie.port1,iommufd=iommufd0 \ -device vfio-pci,host=0000: dev2,bus=pcie.port2,iommufd=iommufd0 \ -device vfio-pci,host=0000: dev5,bus=pcie.port3,iommufd=iommufd0 So I guess even if we don't specify the physical SMMUv3 association explicitly, the kernel will check that based on the devices the Guest SMMUv3 is attached to (and hence the Numa association), right? In other words how an explicit association helps us here? Or is it that the Guest PXB numa_id allocation is not always based on device numa_id? (May be I am missing something here. Sorry) Thanks, Shameer > The mgmt apps needs to be able to tell QEMU exactly which > host SMMU to pair with each guest SMMU, and QEMU needs to > then tell the kernel. > > > And as I mentioned in cover letter, Qemu will report, > > > > " > > Attempt to add the HNS VF to a different SMMUv3 will result in, > > > > -device vfio-pci,host=0000:7d:02.2,bus=pcie.port3,iommufd=iommufd0: > Unable to attach viommu > > -device vfio-pci,host=0000:7d:02.2,bus=pcie.port3,iommufd=iommufd0: > vfio 0000:7d:02.2: > > Failed to set iommu_device: [iommufd=29] error attach 0000:7d:02.2 > (38) to id=11: Invalid argument > > > > At present Qemu is not doing any extra validation other than the above > > failure to make sure the user configuration is correct or not. The > > assumption is libvirt will take care of this. > > " > > So in summary, if the libvirt gets it wrong, Qemu will fail with error. > > That's good error checking, and required, but also insufficient > as illustrated above IMHO. > > > If a more explicit association is required, some help from kernel is > required > > to identify the physical SMMUv3 associated with the device. > > Yep, I think SMMUv3 info for devices needs to be exposed to userspace, > as well as a mechanism for QEMU to tell the kernel the SMMU mapping. > > > With regards, > Daniel > -- > |: https://berrange.com -o- https://www.flickr.com/photos/dberrange > :| > |: https://libvirt.org -o- https://fstop138.berrange.com :| > |: https://entangle-photo.org -o- > https://www.instagram.com/dberrange :| >
On Thu, Feb 06, 2025 at 10:02:25AM +0000, Shameerali Kolothum Thodi wrote: > Hi Daniel, > > > -----Original Message----- > > From: Daniel P. Berrangé <berrange@redhat.com> > > Sent: Friday, January 31, 2025 9:42 PM > > To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> > > Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org; > > eric.auger@redhat.com; peter.maydell@linaro.org; jgg@nvidia.com; > > nicolinc@nvidia.com; ddutile@redhat.com; Linuxarm > > <linuxarm@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>; > > jiangkunkun <jiangkunkun@huawei.com>; Jonathan Cameron > > <jonathan.cameron@huawei.com>; zhangfei.gao@linaro.org > > Subject: Re: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable > > nested SMMUv3 > > > > On Thu, Jan 30, 2025 at 06:09:24PM +0000, Shameerali Kolothum Thodi > > wrote: > > > > > > Each "arm-smmuv3-nested" instance, when the first device gets attached > > > to it, will create a S2 HWPT and a corresponding SMMUv3 domain in > > kernel > > > SMMUv3 driver. This domain will have a pointer representing the physical > > > SMMUv3 that the device belongs. And any other device which belongs to > > > the same physical SMMUv3 can share this S2 domain. > > > > Ok, so given two guest SMMUv3s, A and B, and two host SMMUv3s, > > C and D, we could end up with A&C and B&D paired, or we could > > end up with A&D and B&C paired, depending on whether we plug > > the first VFIO device into guest SMMUv3 A or B. > > > > This is bad. Behaviour must not vary depending on the order > > in which we create devices. > > > > An guest SMMUv3 is paired to a guest PXB. A guest PXB is liable > > to be paired to a guest NUMA node. A guest NUMA node is liable > > to be paired to host NUMA node. The guest/host SMMU pairing > > must be chosen such that it makes conceptual sense wrt to the > > guest PXB NUMA to host NUMA pairing. > > > > If the kernel picks guest<->host SMMU pairings on a first-device > > first-paired basis, this can end up with incorrect guest NUMA > > configurations. > > Ok. I am trying to understand how this can happen as I assume the > Guest PXB numa node is picked up by whatever device we are > attaching to it and based on which numa_id that device belongs to > in physical host. > > And the physical smmuv3 numa id will be the same to that of the > device numa_id it is associated with. Isn't it? > > For example I have a system here, that has 8 phys SMMUv3s and numa > assignments on this is something like below, > > Phys SMMUv3.0 --> node 0 > \..dev1 --> node0 > Phys SMMUv3.1 --> node 0 > \..dev2 -->node0 > Phys SMMUv3.2 --> node 0 > Phys SMMUv3.3 --> node 0 > > Phys SMMUv3.4 --> node 1 > Phys SMMUv3.5 --> node 1 > \..dev5 --> node1 > Phys SMMUv3.6 --> node 1 > Phys SMMUv3.7 --> node 1 > > > If I have to assign say dev 1, 2 and 5 to a Guest, we need to specify 3 > "arm-smmuv3-accel" instances as they belong to different phys SMMUv3s. > > -device pxb-pcie,id=pcie.1,bus_nr=1,bus=pcie.0,numa_id=0 \ > -device pxb-pcie,id=pcie.2,bus_nr=2,bus=pcie.0,numa_id=0 \ > -device pxb-pcie,id=pcie.3,bus_nr=3,bus=pcie.0,numa_id=1 \ > -device arm-smmuv3-accel,id=smmuv1,bus=pcie.1 \ > -device arm-smmuv3-accel,id=smmuv2,bus=pcie.2 \ > -device arm-smmuv3-accel,id=smmuv3,bus=pcie.3 \ > -device pcie-root-port,id=pcie.port1,bus=pcie.1,chassis=1 \ > -device pcie-root-port,id=pcie.port2,bus=pcie.3,chassis=2 \ > -device pcie-root-port,id=pcie.port3,bus=pcie.2,chassis=3 \ > -device vfio-pci,host=0000:dev1,bus=pcie.port1,iommufd=iommufd0 \ > -device vfio-pci,host=0000: dev2,bus=pcie.port2,iommufd=iommufd0 \ > -device vfio-pci,host=0000: dev5,bus=pcie.port3,iommufd=iommufd0 > > So I guess even if we don't specify the physical SMMUv3 association > explicitly, the kernel will check that based on the devices the Guest > SMMUv3 is attached to (and hence the Numa association), right? It isn't about checking the devices, it is about the guest SMMU getting differing host SMMU associations. > In other words how an explicit association helps us here? > > Or is it that the Guest PXB numa_id allocation is not always based > on device numa_id? Lets simplify to 2 SMMUs for shorter CLIs. So to start with we assume physical host with two SMMUs, and two PCI devices we want to assign 0000:dev1 - associated with host SMMU 1, and host NUMA node 0 0000:dev2 - associated with host SMMU 2, and host NUMA node 1 So now we configure QEMU like this: -device pxb-pcie,id=pcie.1,bus_nr=1,bus=pcie.0,numa_id=0 -device pxb-pcie,id=pcie.2,bus_nr=2,bus=pcie.0,numa_id=1 -device arm-smmuv3-accel,id=smmuv1,bus=pcie.1 -device arm-smmuv3-accel,id=smmuv2,bus=pcie.2 -device pcie-root-port,id=pcie.port1,bus=pcie.1,chassis=1 -device pcie-root-port,id=pcie.port2,bus=pcie.2,chassis=2 -device vfio-pci,host=0000:dev1,bus=pcie.port1,iommufd=iommufd0 -device vfio-pci,host=0000:dev2,bus=pcie.port2,iommufd=iommufd0 For brevity I'm not going to show the config for host/guest NUMA mappings, but assume that guest NUMA node 0 has been configured to map to host NUMA node 0 and guest node 1 to host node 1. In this order of QEMU CLI args we get VFIO device 0000:dev1 causes the kernel to associate guest smmuv1 with host SSMU 1. VFIO device 0000:dev2 causes the kernel to associate guest smmuv2 with host SSMU 2. Now consider we swap the ordering of the VFIO Devices on the QEMU cli -device pxb-pcie,id=pcie.1,bus_nr=1,bus=pcie.0,numa_id=0 -device pxb-pcie,id=pcie.2,bus_nr=2,bus=pcie.0,numa_id=1 -device arm-smmuv3-accel,id=smmuv1,bus=pcie.1 -device arm-smmuv3-accel,id=smmuv2,bus=pcie.2 -device pcie-root-port,id=pcie.port1,bus=pcie.1,chassis=1 -device pcie-root-port,id=pcie.port2,bus=pcie.2,chassis=2 -device vfio-pci,host=0000:dev2,bus=pcie.port2,iommufd=iommufd0 -device vfio-pci,host=0000:dev1,bus=pcie.port1,iommufd=iommufd0 In this order of QEMU CLI args we get VFIO device 0000:dev2 causes the kernel to associate guest smmuv1 with host SSMU 2. VFIO device 0000:dev1 causes the kernel to associate guest smmuv2 with host SSMU 1. This is broken, as now we have inconsistent NUMA mappings between host and guest. 0000:dev2 is associated with a PXB on NUMA node 1, but associated with a guest SMMU that was paired with a PXB on NUMA node 0. This is because the kernel is doing first-come first-matched logic for mapping guest and host SMMUs, and thus is sensitive to ordering of the VFIO devices on the CLI. We need to be ordering invariant, which means libvirt must tell QEMU which host + guest SMMUs to pair together, and QEMU must in turn tell the kernel. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
> -----Original Message-----
> From: Daniel P. Berrangé <berrange@redhat.com>
> Sent: Thursday, February 6, 2025 10:37 AM
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
> eric.auger@redhat.com; peter.maydell@linaro.org; jgg@nvidia.com;
> nicolinc@nvidia.com; ddutile@redhat.com; Linuxarm
> <linuxarm@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>;
> jiangkunkun <jiangkunkun@huawei.com>; Jonathan Cameron
> <jonathan.cameron@huawei.com>; zhangfei.gao@linaro.org;
> nathanc@nvidia.com
> Subject: Re: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable
> nested SMMUv3
>
> On Thu, Feb 06, 2025 at 10:02:25AM +0000, Shameerali Kolothum Thodi
> wrote:
> > Hi Daniel,
> >
> > > -----Original Message-----
> > > From: Daniel P. Berrangé <berrange@redhat.com>
> > > Sent: Friday, January 31, 2025 9:42 PM
> > > To: Shameerali Kolothum Thodi
> <shameerali.kolothum.thodi@huawei.com>
> > > Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
> > > eric.auger@redhat.com; peter.maydell@linaro.org; jgg@nvidia.com;
> > > nicolinc@nvidia.com; ddutile@redhat.com; Linuxarm
> > > <linuxarm@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>;
> > > jiangkunkun <jiangkunkun@huawei.com>; Jonathan Cameron
> > > <jonathan.cameron@huawei.com>; zhangfei.gao@linaro.org
> > > Subject: Re: [RFC PATCH 0/5] hw/arm/virt: Add support for user-
> creatable
> > > nested SMMUv3
> > >
> > > On Thu, Jan 30, 2025 at 06:09:24PM +0000, Shameerali Kolothum Thodi
> > > wrote:
> > > >
> > > > Each "arm-smmuv3-nested" instance, when the first device gets
> attached
> > > > to it, will create a S2 HWPT and a corresponding SMMUv3 domain in
> > > kernel
> > > > SMMUv3 driver. This domain will have a pointer representing the
> physical
> > > > SMMUv3 that the device belongs. And any other device which belongs
> to
> > > > the same physical SMMUv3 can share this S2 domain.
> > >
> > > Ok, so given two guest SMMUv3s, A and B, and two host SMMUv3s,
> > > C and D, we could end up with A&C and B&D paired, or we could
> > > end up with A&D and B&C paired, depending on whether we plug
> > > the first VFIO device into guest SMMUv3 A or B.
> > >
> > > This is bad. Behaviour must not vary depending on the order
> > > in which we create devices.
> > >
> > > An guest SMMUv3 is paired to a guest PXB. A guest PXB is liable
> > > to be paired to a guest NUMA node. A guest NUMA node is liable
> > > to be paired to host NUMA node. The guest/host SMMU pairing
> > > must be chosen such that it makes conceptual sense wrt to the
> > > guest PXB NUMA to host NUMA pairing.
> > >
> > > If the kernel picks guest<->host SMMU pairings on a first-device
> > > first-paired basis, this can end up with incorrect guest NUMA
> > > configurations.
> >
> > Ok. I am trying to understand how this can happen as I assume the
> > Guest PXB numa node is picked up by whatever device we are
> > attaching to it and based on which numa_id that device belongs to
> > in physical host.
> >
> > And the physical smmuv3 numa id will be the same to that of the
> > device numa_id it is associated with. Isn't it?
> >
> > For example I have a system here, that has 8 phys SMMUv3s and numa
> > assignments on this is something like below,
> >
> > Phys SMMUv3.0 --> node 0
> > \..dev1 --> node0
> > Phys SMMUv3.1 --> node 0
> > \..dev2 -->node0
> > Phys SMMUv3.2 --> node 0
> > Phys SMMUv3.3 --> node 0
> >
> > Phys SMMUv3.4 --> node 1
> > Phys SMMUv3.5 --> node 1
> > \..dev5 --> node1
> > Phys SMMUv3.6 --> node 1
> > Phys SMMUv3.7 --> node 1
> >
> >
> > If I have to assign say dev 1, 2 and 5 to a Guest, we need to specify 3
> > "arm-smmuv3-accel" instances as they belong to different phys
> SMMUv3s.
> >
> > -device pxb-pcie,id=pcie.1,bus_nr=1,bus=pcie.0,numa_id=0 \
> > -device pxb-pcie,id=pcie.2,bus_nr=2,bus=pcie.0,numa_id=0 \
> > -device pxb-pcie,id=pcie.3,bus_nr=3,bus=pcie.0,numa_id=1 \
> > -device arm-smmuv3-accel,id=smmuv1,bus=pcie.1 \
> > -device arm-smmuv3-accel,id=smmuv2,bus=pcie.2 \
> > -device arm-smmuv3-accel,id=smmuv3,bus=pcie.3 \
> > -device pcie-root-port,id=pcie.port1,bus=pcie.1,chassis=1 \
> > -device pcie-root-port,id=pcie.port2,bus=pcie.3,chassis=2 \
> > -device pcie-root-port,id=pcie.port3,bus=pcie.2,chassis=3 \
> > -device vfio-pci,host=0000:dev1,bus=pcie.port1,iommufd=iommufd0 \
> > -device vfio-pci,host=0000: dev2,bus=pcie.port2,iommufd=iommufd0 \
> > -device vfio-pci,host=0000: dev5,bus=pcie.port3,iommufd=iommufd0
> >
> > So I guess even if we don't specify the physical SMMUv3 association
> > explicitly, the kernel will check that based on the devices the Guest
> > SMMUv3 is attached to (and hence the Numa association), right?
>
> It isn't about checking the devices, it is about the guest SMMU
> getting differing host SMMU associations.
>
> > In other words how an explicit association helps us here?
> >
> > Or is it that the Guest PXB numa_id allocation is not always based
> > on device numa_id?
>
> Lets simplify to 2 SMMUs for shorter CLIs.
>
> So to start with we assume physical host with two SMMUs, and
> two PCI devices we want to assign
>
> 0000:dev1 - associated with host SMMU 1, and host NUMA node 0
> 0000:dev2 - associated with host SMMU 2, and host NUMA node 1
>
> So now we configure QEMU like this:
>
> -device pxb-pcie,id=pcie.1,bus_nr=1,bus=pcie.0,numa_id=0
> -device pxb-pcie,id=pcie.2,bus_nr=2,bus=pcie.0,numa_id=1
> -device arm-smmuv3-accel,id=smmuv1,bus=pcie.1
> -device arm-smmuv3-accel,id=smmuv2,bus=pcie.2
> -device pcie-root-port,id=pcie.port1,bus=pcie.1,chassis=1
> -device pcie-root-port,id=pcie.port2,bus=pcie.2,chassis=2
> -device vfio-pci,host=0000:dev1,bus=pcie.port1,iommufd=iommufd0
> -device vfio-pci,host=0000:dev2,bus=pcie.port2,iommufd=iommufd0
>
> For brevity I'm not going to show the config for host/guest NUMA
> mappings,
> but assume that guest NUMA node 0 has been configured to map to host
> NUMA
> node 0 and guest node 1 to host node 1.
>
> In this order of QEMU CLI args we get
>
> VFIO device 0000:dev1 causes the kernel to associate guest smmuv1 with
> host SSMU 1.
>
> VFIO device 0000:dev2 causes the kernel to associate guest smmuv2 with
> host SSMU 2.
>
> Now consider we swap the ordering of the VFIO Devices on the QEMU cli
>
>
> -device pxb-pcie,id=pcie.1,bus_nr=1,bus=pcie.0,numa_id=0
> -device pxb-pcie,id=pcie.2,bus_nr=2,bus=pcie.0,numa_id=1
> -device arm-smmuv3-accel,id=smmuv1,bus=pcie.1
> -device arm-smmuv3-accel,id=smmuv2,bus=pcie.2
> -device pcie-root-port,id=pcie.port1,bus=pcie.1,chassis=1
> -device pcie-root-port,id=pcie.port2,bus=pcie.2,chassis=2
> -device vfio-pci,host=0000:dev2,bus=pcie.port2,iommufd=iommufd0
> -device vfio-pci,host=0000:dev1,bus=pcie.port1,iommufd=iommufd0
>
> In this order of QEMU CLI args we get
>
> VFIO device 0000:dev2 causes the kernel to associate guest smmuv1 with
> host SSMU 2.
>
> VFIO device 0000:dev1 causes the kernel to associate guest smmuv2 with
> host SSMU 1.
>
> This is broken, as now we have inconsistent NUMA mappings between host
> and guest. 0000:dev2 is associated with a PXB on NUMA node 1, but
> associated with a guest SMMU that was paired with a PXB on NUMA node
> 0.
Hmm..I don’t think just swapping the order will change the association with
Guest SMMU here. Because, we have,
> -device arm-smmuv3-accel,id=smmuv2,bus=pcie.2
During smmuv3-accel realize time, this will result in,
pci_setup_iommu(primary_bus, ops, smmu_state);
And when the vfio dev realization happens,
set_iommu_device()
smmu_dev_set_iommu_device(bus, smmu_state, ,)
--> this is where the guest smmuv3-->host smmuv3 association is first
established. And any further vfio dev to this Guest SMMU will
only succeeds if it belongs to the same phys SMMU.
ie, the Guest SMMU to pci bus association, actually make sure you have the
same Guest SMMU for the device.
smmuv2 --> pcie.2 --> (pxb-pcie, numa_id = 1)
0000:dev2 --> pcie.port2 --> pcie.2 --> smmuv2 (pxb-pcie, numa_id = 1)
Hence the association of 0000:dev2 to Guest SMMUv2 remain same.
I hope this is clear. And I am not sure the association will be broken in any
other way unless Qemu CLI specify the dev to a different PXB.
May be it is that one of my earlier replies caused this confusion that
ordering of the VFIO Devices on the QEMU cli will affect the association.
Thanks,
Shameer
On Thu, Feb 06, 2025 at 01:51:15PM +0000, Shameerali Kolothum Thodi wrote:
> Hmm..I don’t think just swapping the order will change the association with
> Guest SMMU here. Because, we have,
>
> > -device arm-smmuv3-accel,id=smmuv2,bus=pcie.2
>
> During smmuv3-accel realize time, this will result in,
> pci_setup_iommu(primary_bus, ops, smmu_state);
>
> And when the vfio dev realization happens,
> set_iommu_device()
> smmu_dev_set_iommu_device(bus, smmu_state, ,)
> --> this is where the guest smmuv3-->host smmuv3 association is first
> established. And any further vfio dev to this Guest SMMU will
> only succeeds if it belongs to the same phys SMMU.
>
> ie, the Guest SMMU to pci bus association, actually make sure you have the
> same Guest SMMU for the device.
Ok, so at time of VFIO device realize, QEMU is telling the kernel
to associate a physical SMMU, and its doing this with the virtual
SMMU attached to PXB parenting the VFIO device.
> smmuv2 --> pcie.2 --> (pxb-pcie, numa_id = 1)
> 0000:dev2 --> pcie.port2 --> pcie.2 --> smmuv2 (pxb-pcie, numa_id = 1)
>
> Hence the association of 0000:dev2 to Guest SMMUv2 remain same.
Yes, I concur the SMMU physical <-> virtual association should
be fixed, as long as the same VFIO device is always added to
the same virtual SMMU.
> I hope this is clear. And I am not sure the association will be broken in any
> other way unless Qemu CLI specify the dev to a different PXB.
Although the ordering is at least predictable, I remain uncomfortable
about the idea of the virtual SMMU association with the physical SMMU
being a side effect of the VFIO device placement.
There is still the open door for admin mis-configuration that will not
be diagnosed. eg consider we attached VFIO device 1 from the host NUMA
node 1 to a PXB associated with host NUMA node 0. As long as that's
the first VFIO device, the kernel will happily associate the physical
and guest SMMUs.
If we set the physical/guest SMMU relationship directly, then at the
time the VFIO device is plugged, we can diagnose the incorrectly
placed VFIO device, and better reason about behaviour.
I've another question about unplug behaviour..
1. Plug a VFIO device for host SMMU 1 into a PXB with guest SMMU 1.
=> Kernel associates host SMMU 1 and guest SMMU 1 together
2. Unplug this VFIO device
3. Plug a VFIO device for host SMMU 2 into a PXB with guest SMMU 1.
Does the host/guest SMMU 1<-> 1 association remain set after step 2,
implying step 3 will fail ? Or does it get unset, allowing step 3
to succeed, and establish a new mapping host SMMU 2 to guest SMMU 1.
If step 2 does NOT break the association, do we preserve that
across a savevm+loadvm sequence of QEMU. If we don't, then step
3 would fail before the savevm, but succeed after the loadvm.
Explicitly representing the host SMMU association on the guest SMMU
config makes this behaviour unambiguous. The host / guest SMMU
relationship is fixed for the lifetime of the VM and invariant of
whatever VFIO device is (or was previously) plugged.
So I still go back to my general principle that automatic side effects
are an undesirable idea in QEMU configuration. We have a long tradition
of making everything entirely explicit to produce easily predictable
behaviour.
With regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
> -----Original Message----- > From: Daniel P. Berrangé <berrange@redhat.com> > Sent: Thursday, February 6, 2025 2:47 PM > To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> > Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org; > eric.auger@redhat.com; peter.maydell@linaro.org; jgg@nvidia.com; > nicolinc@nvidia.com; ddutile@redhat.com; Linuxarm > <linuxarm@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>; > jiangkunkun <jiangkunkun@huawei.com>; Jonathan Cameron > <jonathan.cameron@huawei.com>; zhangfei.gao@linaro.org; > nathanc@nvidia.com > Subject: Re: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable > nested SMMUv3 > > On Thu, Feb 06, 2025 at 01:51:15PM +0000, Shameerali Kolothum Thodi > wrote: > > Hmm..I don’t think just swapping the order will change the association > with > > Guest SMMU here. Because, we have, > > > > > -device arm-smmuv3-accel,id=smmuv2,bus=pcie.2 > > > > During smmuv3-accel realize time, this will result in, > > pci_setup_iommu(primary_bus, ops, smmu_state); > > > > And when the vfio dev realization happens, > > set_iommu_device() > > smmu_dev_set_iommu_device(bus, smmu_state, ,) > > --> this is where the guest smmuv3-->host smmuv3 association is first > > established. And any further vfio dev to this Guest SMMU will > > only succeeds if it belongs to the same phys SMMU. > > > > ie, the Guest SMMU to pci bus association, actually make sure you have > the > > same Guest SMMU for the device. > > Ok, so at time of VFIO device realize, QEMU is telling the kernel > to associate a physical SMMU, and its doing this with the virtual > SMMU attached to PXB parenting the VFIO device. > > > smmuv2 --> pcie.2 --> (pxb-pcie, numa_id = 1) > > 0000:dev2 --> pcie.port2 --> pcie.2 --> smmuv2 (pxb-pcie, numa_id = 1) > > > > Hence the association of 0000:dev2 to Guest SMMUv2 remain same. > > Yes, I concur the SMMU physical <-> virtual association should > be fixed, as long as the same VFIO device is always added to > the same virtual SMMU. > > > I hope this is clear. And I am not sure the association will be broken in > any > > other way unless Qemu CLI specify the dev to a different PXB. > > Although the ordering is at least predictable, I remain uncomfortable > about the idea of the virtual SMMU association with the physical SMMU > being a side effect of the VFIO device placement. > > There is still the open door for admin mis-configuration that will not > be diagnosed. eg consider we attached VFIO device 1 from the host NUMA > node 1 to a PXB associated with host NUMA node 0. As long as that's > the first VFIO device, the kernel will happily associate the physical > and guest SMMUs. Yes. A mis-configuration can place it on a wrong one. > If we set the physical/guest SMMU relationship directly, then at the > time the VFIO device is plugged, we can diagnose the incorrectly > placed VFIO device, and better reason about behaviour. Agree. > I've another question about unplug behaviour.. > > 1. Plug a VFIO device for host SMMU 1 into a PXB with guest SMMU 1. > => Kernel associates host SMMU 1 and guest SMMU 1 together > 2. Unplug this VFIO device > 3. Plug a VFIO device for host SMMU 2 into a PXB with guest SMMU 1. > > Does the host/guest SMMU 1<-> 1 association remain set after step 2, > implying step 3 will fail ? Or does it get unset, allowing step 3 > to succeed, and establish a new mapping host SMMU 2 to guest SMMU 1. At the moment the first association is not persistent. So a new mapping is possible. > If step 2 does NOT break the association, do we preserve that > across a savevm+loadvm sequence of QEMU. If we don't, then step > 3 would fail before the savevm, but succeed after the loadvm. Right. I haven't attempted migration tests yet. But agree that an explicit association is better to make migration compatible. Also I am not sure if the target has a different phys SMMUV3<--> dev mapping how we handle that. > Explicitly representing the host SMMU association on the guest SMMU > config makes this behaviour unambiguous. The host / guest SMMU > relationship is fixed for the lifetime of the VM and invariant of > whatever VFIO device is (or was previously) plugged. > > So I still go back to my general principle that automatic side effects > are an undesirable idea in QEMU configuration. We have a long tradition > of making everything entirely explicit to produce easily predictable > behaviour. Ok. Convinced 😊. Thanks for explaining. Shameer
On Thu, Feb 06, 2025 at 03:07:06PM +0000, Shameerali Kolothum Thodi wrote: > > If we set the physical/guest SMMU relationship directly, then at the > > time the VFIO device is plugged, we can diagnose the incorrectly > > placed VFIO device, and better reason about behaviour. > > Agree. Can you just take in a VFIO cdev FD reference on this command line: -device arm-smmuv3-accel,id=smmuv2,bus=pcie.2 And that will lock the pSMMU/vSMMU relationship? Jason
On Thu, Feb 06, 2025 at 01:02:38PM -0400, Jason Gunthorpe wrote: > On Thu, Feb 06, 2025 at 03:07:06PM +0000, Shameerali Kolothum Thodi wrote: > > > If we set the physical/guest SMMU relationship directly, then at the > > > time the VFIO device is plugged, we can diagnose the incorrectly > > > placed VFIO device, and better reason about behaviour. > > > > Agree. > > Can you just take in a VFIO cdev FD reference on this command line: > > -device arm-smmuv3-accel,id=smmuv2,bus=pcie.2 > > And that will lock the pSMMU/vSMMU relationship? We shouldn't assume any VFIO device exists in the QEMU cnofig at the time we realize the virtual ssmu. I expect the SMMU may be cold plugged, while the VFIO devices may be hot plugged arbitrarly later, and we should have the association initialized the SMMU is realized. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
On Thu, Feb 06, 2025 at 05:10:32PM +0000, Daniel P. Berrangé wrote: > On Thu, Feb 06, 2025 at 01:02:38PM -0400, Jason Gunthorpe wrote: > > On Thu, Feb 06, 2025 at 03:07:06PM +0000, Shameerali Kolothum Thodi wrote: > > > > If we set the physical/guest SMMU relationship directly, then at the > > > > time the VFIO device is plugged, we can diagnose the incorrectly > > > > placed VFIO device, and better reason about behaviour. > > > > > > Agree. > > > > Can you just take in a VFIO cdev FD reference on this command line: > > > > -device arm-smmuv3-accel,id=smmuv2,bus=pcie.2 > > > > And that will lock the pSMMU/vSMMU relationship? > > We shouldn't assume any VFIO device exists in the QEMU cnofig at the time > we realize the virtual ssmu. I expect the SMMU may be cold plugged, while > the VFIO devices may be hot plugged arbitrarly later, and we should have > the association initialized the SMMU is realized. This is not supported kernel side, you can't instantiate a vIOMMU without a VFIO device that uses it. For security. Jason
> -----Original Message----- > From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Thursday, February 6, 2025 5:47 PM > To: Daniel P. Berrangé <berrange@redhat.com> > Cc: Shameerali Kolothum Thodi > <shameerali.kolothum.thodi@huawei.com>; qemu-arm@nongnu.org; > qemu-devel@nongnu.org; eric.auger@redhat.com; > peter.maydell@linaro.org; nicolinc@nvidia.com; ddutile@redhat.com; > Linuxarm <linuxarm@huawei.com>; Wangzhou (B) > <wangzhou1@hisilicon.com>; jiangkunkun <jiangkunkun@huawei.com>; > Jonathan Cameron <jonathan.cameron@huawei.com>; > zhangfei.gao@linaro.org; nathanc@nvidia.com > Subject: Re: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable > nested SMMUv3 > > On Thu, Feb 06, 2025 at 05:10:32PM +0000, Daniel P. Berrangé wrote: > > On Thu, Feb 06, 2025 at 01:02:38PM -0400, Jason Gunthorpe wrote: > > > On Thu, Feb 06, 2025 at 03:07:06PM +0000, Shameerali Kolothum Thodi > wrote: > > > > > If we set the physical/guest SMMU relationship directly, then at the > > > > > time the VFIO device is plugged, we can diagnose the incorrectly > > > > > placed VFIO device, and better reason about behaviour. > > > > > > > > Agree. > > > > > > Can you just take in a VFIO cdev FD reference on this command line: > > > > > > -device arm-smmuv3-accel,id=smmuv2,bus=pcie.2 > > > > > > And that will lock the pSMMU/vSMMU relationship? > > > > We shouldn't assume any VFIO device exists in the QEMU cnofig at the > time > > we realize the virtual ssmu. I expect the SMMU may be cold plugged, > while > > the VFIO devices may be hot plugged arbitrarly later, and we should have > > the association initialized the SMMU is realized. > > This is not supported kernel side, you can't instantiate a vIOMMU > without a VFIO device that uses it. For security. I think that is fine if Qemu knows about association beforehand. During vIOMMU instantiation it can cross check whether the user specified pSMMU <->vSMMU is correct for the device. Also how do we do it with multiple VF devices under a pSUMMU ? Which cdev fd in that case? Thanks, Shameer
On Thu, Feb 06, 2025 at 05:57:38PM +0000, Shameerali Kolothum Thodi wrote: > Also how do we do it with multiple VF devices under a pSUMMU ? Which > cdev fd in that case? It doesn't matter, they are all interchangeable. Creating the VIOMMU object just requires any vfio device that is attached to the physical smmu. Jason
On Thu, Feb 06, 2025 at 01:46:47PM -0400, Jason Gunthorpe wrote: > On Thu, Feb 06, 2025 at 05:10:32PM +0000, Daniel P. Berrangé wrote: > > On Thu, Feb 06, 2025 at 01:02:38PM -0400, Jason Gunthorpe wrote: > > > On Thu, Feb 06, 2025 at 03:07:06PM +0000, Shameerali Kolothum Thodi wrote: > > > > > If we set the physical/guest SMMU relationship directly, then at the > > > > > time the VFIO device is plugged, we can diagnose the incorrectly > > > > > placed VFIO device, and better reason about behaviour. > > > > > > > > Agree. > > > > > > Can you just take in a VFIO cdev FD reference on this command line: > > > > > > -device arm-smmuv3-accel,id=smmuv2,bus=pcie.2 > > > > > > And that will lock the pSMMU/vSMMU relationship? > > > > We shouldn't assume any VFIO device exists in the QEMU cnofig at the time > > we realize the virtual ssmu. I expect the SMMU may be cold plugged, while > > the VFIO devices may be hot plugged arbitrarly later, and we should have > > the association initialized the SMMU is realized. > > This is not supported kernel side, you can't instantiate a vIOMMU > without a VFIO device that uses it. For security. What are the security concerns here ? With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
On Thu, Feb 06, 2025 at 05:54:57PM +0000, Daniel P. Berrangé wrote: > > > We shouldn't assume any VFIO device exists in the QEMU cnofig at the time > > > we realize the virtual ssmu. I expect the SMMU may be cold plugged, while > > > the VFIO devices may be hot plugged arbitrarly later, and we should have > > > the association initialized the SMMU is realized. > > > > This is not supported kernel side, you can't instantiate a vIOMMU > > without a VFIO device that uses it. For security. > > What are the security concerns here ? You should not be able to open iommufd and manipulate iommu HW that you don't have a VFIO descriptor for, including creating physical vIOMMU resources, allocating command queues and whatever else. Some kind of hot plug smmu would have to create a vSMMU without any kernel backing and then later bind it to a kernel implementation. Jason
On Thu, Feb 06, 2025 at 01:58:43PM -0400, Jason Gunthorpe wrote: > On Thu, Feb 06, 2025 at 05:54:57PM +0000, Daniel P. Berrangé wrote: > > > > We shouldn't assume any VFIO device exists in the QEMU cnofig at the time > > > > we realize the virtual ssmu. I expect the SMMU may be cold plugged, while > > > > the VFIO devices may be hot plugged arbitrarly later, and we should have > > > > the association initialized the SMMU is realized. > > > > > > This is not supported kernel side, you can't instantiate a vIOMMU > > > without a VFIO device that uses it. For security. > > > > What are the security concerns here ? > > You should not be able to open iommufd and manipulate iommu HW that > you don't have a VFIO descriptor for, including creating physical > vIOMMU resources, allocating command queues and whatever else. > > Some kind of hot plug smmu would have to create a vSMMU without any > kernel backing and then later bind it to a kernel implementation. Ok, so if we give the info about the vSMMU <-> pSMMU binding to QEMU upfront, it can delay using it until the point where the kernel accepts it. This at least gives a clear design to applications outside QEMU, and hides the low level impl details to inside QEMU. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
> -----Original Message----- > From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Thursday, February 6, 2025 5:59 PM > To: Daniel P. Berrangé <berrange@redhat.com> > Cc: Shameerali Kolothum Thodi > <shameerali.kolothum.thodi@huawei.com>; qemu-arm@nongnu.org; > qemu-devel@nongnu.org; eric.auger@redhat.com; > peter.maydell@linaro.org; nicolinc@nvidia.com; ddutile@redhat.com; > Linuxarm <linuxarm@huawei.com>; Wangzhou (B) > <wangzhou1@hisilicon.com>; jiangkunkun <jiangkunkun@huawei.com>; > Jonathan Cameron <jonathan.cameron@huawei.com>; > zhangfei.gao@linaro.org; nathanc@nvidia.com > Subject: Re: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable > nested SMMUv3 > > On Thu, Feb 06, 2025 at 05:54:57PM +0000, Daniel P. Berrangé wrote: > > > > We shouldn't assume any VFIO device exists in the QEMU cnofig at the > time > > > > we realize the virtual ssmu. I expect the SMMU may be cold plugged, > while > > > > the VFIO devices may be hot plugged arbitrarly later, and we should > have > > > > the association initialized the SMMU is realized. > > > > > > This is not supported kernel side, you can't instantiate a vIOMMU > > > without a VFIO device that uses it. For security. > > > > What are the security concerns here ? > > You should not be able to open iommufd and manipulate iommu HW that > you don't have a VFIO descriptor for, including creating physical > vIOMMU resources, allocating command queues and whatever else. > > Some kind of hot plug smmu would have to create a vSMMU without any > kernel backing and then later bind it to a kernel implementation. Not sure I get the problem with associating vSMMU with a pSMMU. Something like an iommu instance id mentioned before, -device arm-smmuv3-accel,id=smmuv2,bus=pcie.2,host-smmu=iommu.1 This can realize the vSMMU without actually creating a vIOMMU in kernel. And when the dev gets attached/realized, check (GET_HW_INFO)the specified iommu instance id matches or not. Or the concern here is exporting an iommu instance id to user space? Thanks, Shameer
On Thu, Feb 06, 2025 at 06:04:57PM +0000, Shameerali Kolothum Thodi wrote: > > Some kind of hot plug smmu would have to create a vSMMU without any > > kernel backing and then later bind it to a kernel implementation. > > Not sure I get the problem with associating vSMMU with a pSMMU. Something > like an iommu instance id mentioned before, > > -device arm-smmuv3-accel,id=smmuv2,bus=pcie.2,host-smmu=iommu.1 > > This can realize the vSMMU without actually creating a vIOMMU in kernel. > And when the dev gets attached/realized, check (GET_HW_INFO)the specified > iommu instance id matches or not. > > Or the concern here is exporting an iommu instance id to user space? Philisophically we do not permit any HW access through iommufd without a VFIO fd to "prove" the process has rights to touch hardware. We don't have any way to prove the process has rights to touch the iommu hardware seperately from VFIO. So even if you invent an iommu ID we cannot accept it as a handle to create viommu in iommufd. Jason
> -----Original Message----- > From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Thursday, February 6, 2025 6:13 PM > To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> > Cc: Daniel P. Berrangé <berrange@redhat.com>; qemu-arm@nongnu.org; > qemu-devel@nongnu.org; eric.auger@redhat.com; > peter.maydell@linaro.org; nicolinc@nvidia.com; ddutile@redhat.com; > Linuxarm <linuxarm@huawei.com>; Wangzhou (B) > <wangzhou1@hisilicon.com>; jiangkunkun <jiangkunkun@huawei.com>; > Jonathan Cameron <jonathan.cameron@huawei.com>; > zhangfei.gao@linaro.org; nathanc@nvidia.com > Subject: Re: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable > nested SMMUv3 > > On Thu, Feb 06, 2025 at 06:04:57PM +0000, Shameerali Kolothum Thodi > wrote: > > > Some kind of hot plug smmu would have to create a vSMMU without > any > > > kernel backing and then later bind it to a kernel implementation. > > > > Not sure I get the problem with associating vSMMU with a pSMMU. > Something > > like an iommu instance id mentioned before, > > > > -device arm-smmuv3-accel,id=smmuv2,bus=pcie.2,host-smmu=iommu.1 > > > > This can realize the vSMMU without actually creating a vIOMMU in kernel. > > And when the dev gets attached/realized, check (GET_HW_INFO)the > specified > > iommu instance id matches or not. > > > > Or the concern here is exporting an iommu instance id to user space? > > Philisophically we do not permit any HW access through iommufd without > a VFIO fd to "prove" the process has rights to touch hardware. > > We don't have any way to prove the process has rights to touch the > iommu hardware seperately from VFIO. It is not. Qemu just instantiates a vSMMU and assigns the IOMMU instance id to it. > > So even if you invent an iommu ID we cannot accept it as a handle to > create viommu in iommufd. Creating the vIOMMU only happens when the user does a cold/hot plug of a VFIO device. At that time Qemu checks whether the assigned id matches with whatever the kernel tell it. Thanks, Shameer
On Thu, Feb 06, 2025 at 06:18:14PM +0000, Shameerali Kolothum Thodi wrote: > > So even if you invent an iommu ID we cannot accept it as a handle to > > create viommu in iommufd. > > Creating the vIOMMU only happens when the user does a cold/hot plug of > a VFIO device. At that time Qemu checks whether the assigned id matches > with whatever the kernel tell it. This is not hard up until the guest is started. If you boot a guest without a backing viommu iommufd object then there will be some more complexities. Jason
On Thu, Feb 06, 2025 at 02:22:01PM -0400, Jason Gunthorpe wrote: > On Thu, Feb 06, 2025 at 06:18:14PM +0000, Shameerali Kolothum Thodi wrote: > > > > So even if you invent an iommu ID we cannot accept it as a handle to > > > create viommu in iommufd. > > > > Creating the vIOMMU only happens when the user does a cold/hot plug of > > a VFIO device. At that time Qemu checks whether the assigned id matches > > with whatever the kernel tell it. > > This is not hard up until the guest is started. If you boot a guest > without a backing viommu iommufd object then there will be some more > complexities. Yea, I imagined that things would be complicated with hotplugs.. On one hand, I got the part that we need some fixed link forehand to ease migration/hotplugs. On the other hand, all IOMMUFD ioctls need a VFIO device FD, which brings the immediate attention that we cannot even decide vSMMU's capabilities being reflected in its IDR/IIDR registers, without a coldplug device -- if we boot a VM (one vSMMU<->pSMMU) with only a hotplug device, the IOMMU_GET_HW_INFO cannot be done during guest kernel probing vSMMU instance. So we would have to reset the vSMMU "HW" after the device hotplug? Nicolin
> -----Original Message----- > From: Nicolin Chen <nicolinc@nvidia.com> > Sent: Thursday, February 6, 2025 8:33 PM > To: Shameerali Kolothum Thodi > <shameerali.kolothum.thodi@huawei.com>; Daniel P. Berrangé > <berrange@redhat.com>; Jason Gunthorpe <jgg@nvidia.com> > Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org; > eric.auger@redhat.com; peter.maydell@linaro.org; ddutile@redhat.com; > Linuxarm <linuxarm@huawei.com>; Wangzhou (B) > <wangzhou1@hisilicon.com>; jiangkunkun <jiangkunkun@huawei.com>; > Jonathan Cameron <jonathan.cameron@huawei.com>; > zhangfei.gao@linaro.org; nathanc@nvidia.com > Subject: Re: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable > nested SMMUv3 > > On Thu, Feb 06, 2025 at 02:22:01PM -0400, Jason Gunthorpe wrote: > > On Thu, Feb 06, 2025 at 06:18:14PM +0000, Shameerali Kolothum Thodi > wrote: > > > > > > So even if you invent an iommu ID we cannot accept it as a handle to > > > > create viommu in iommufd. > > > > > > Creating the vIOMMU only happens when the user does a cold/hot > plug of > > > a VFIO device. At that time Qemu checks whether the assigned id > matches > > > with whatever the kernel tell it. > > > > This is not hard up until the guest is started. If you boot a guest > > without a backing viommu iommufd object then there will be some more > > complexities. > > Yea, I imagined that things would be complicated with hotplugs.. > > On one hand, I got the part that we need some fixed link forehand > to ease migration/hotplugs. > > On the other hand, all IOMMUFD ioctls need a VFIO device FD, which > brings the immediate attention that we cannot even decide vSMMU's > capabilities being reflected in its IDR/IIDR registers, without a > coldplug device -- if we boot a VM (one vSMMU<->pSMMU) with only a > hotplug device, the IOMMU_GET_HW_INFO cannot be done during guest Right. I forgot about the call to smmu_dev_get_info() during the reset. That means we need at least one dev per Guest SMMU during Guest boot :( Thanks, Shameer
On Fri, Feb 07, 2025 at 10:21:17AM +0000, Shameerali Kolothum Thodi wrote: > > > > -----Original Message----- > > From: Nicolin Chen <nicolinc@nvidia.com> > > Sent: Thursday, February 6, 2025 8:33 PM > > To: Shameerali Kolothum Thodi > > <shameerali.kolothum.thodi@huawei.com>; Daniel P. Berrangé > > <berrange@redhat.com>; Jason Gunthorpe <jgg@nvidia.com> > > Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org; > > eric.auger@redhat.com; peter.maydell@linaro.org; ddutile@redhat.com; > > Linuxarm <linuxarm@huawei.com>; Wangzhou (B) > > <wangzhou1@hisilicon.com>; jiangkunkun <jiangkunkun@huawei.com>; > > Jonathan Cameron <jonathan.cameron@huawei.com>; > > zhangfei.gao@linaro.org; nathanc@nvidia.com > > Subject: Re: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable > > nested SMMUv3 > > > > On Thu, Feb 06, 2025 at 02:22:01PM -0400, Jason Gunthorpe wrote: > > > On Thu, Feb 06, 2025 at 06:18:14PM +0000, Shameerali Kolothum Thodi > > wrote: > > > > > > > > So even if you invent an iommu ID we cannot accept it as a handle to > > > > > create viommu in iommufd. > > > > > > > > Creating the vIOMMU only happens when the user does a cold/hot > > plug of > > > > a VFIO device. At that time Qemu checks whether the assigned id > > matches > > > > with whatever the kernel tell it. > > > > > > This is not hard up until the guest is started. If you boot a guest > > > without a backing viommu iommufd object then there will be some more > > > complexities. > > > > Yea, I imagined that things would be complicated with hotplugs.. > > > > On one hand, I got the part that we need some fixed link forehand > > to ease migration/hotplugs. > > > > On the other hand, all IOMMUFD ioctls need a VFIO device FD, which > > brings the immediate attention that we cannot even decide vSMMU's > > capabilities being reflected in its IDR/IIDR registers, without a > > coldplug device -- if we boot a VM (one vSMMU<->pSMMU) with only a > > hotplug device, the IOMMU_GET_HW_INFO cannot be done during guest > > Right. I forgot about the call to smmu_dev_get_info() during the reset. > That means we need at least one dev per Guest SMMU during Guest > boot :( That's pretty unpleasant as a usage restriction. It sounds like there needs to be a way to configure & control the vIOMMU independantly of attaching a specific VFIO device. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
> -----Original Message----- > From: Daniel P. Berrangé <berrange@redhat.com> > Sent: Friday, February 7, 2025 10:32 AM > To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> > Cc: Nicolin Chen <nicolinc@nvidia.com>; Jason Gunthorpe > <jgg@nvidia.com>; qemu-arm@nongnu.org; qemu-devel@nongnu.org; > eric.auger@redhat.com; peter.maydell@linaro.org; ddutile@redhat.com; > Linuxarm <linuxarm@huawei.com>; Wangzhou (B) > <wangzhou1@hisilicon.com>; jiangkunkun <jiangkunkun@huawei.com>; > Jonathan Cameron <jonathan.cameron@huawei.com>; > zhangfei.gao@linaro.org; nathanc@nvidia.com > Subject: Re: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable > nested SMMUv3 > > On Fri, Feb 07, 2025 at 10:21:17AM +0000, Shameerali Kolothum Thodi > wrote: > > > > > > > -----Original Message----- > > > From: Nicolin Chen <nicolinc@nvidia.com> > > > Sent: Thursday, February 6, 2025 8:33 PM > > > To: Shameerali Kolothum Thodi > > > <shameerali.kolothum.thodi@huawei.com>; Daniel P. Berrangé > > > <berrange@redhat.com>; Jason Gunthorpe <jgg@nvidia.com> > > > Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org; > > > eric.auger@redhat.com; peter.maydell@linaro.org; > ddutile@redhat.com; > > > Linuxarm <linuxarm@huawei.com>; Wangzhou (B) > > > <wangzhou1@hisilicon.com>; jiangkunkun <jiangkunkun@huawei.com>; > > > Jonathan Cameron <jonathan.cameron@huawei.com>; > > > zhangfei.gao@linaro.org; nathanc@nvidia.com > > > Subject: Re: [RFC PATCH 0/5] hw/arm/virt: Add support for user- > creatable > > > nested SMMUv3 > > > > > > On Thu, Feb 06, 2025 at 02:22:01PM -0400, Jason Gunthorpe wrote: > > > > On Thu, Feb 06, 2025 at 06:18:14PM +0000, Shameerali Kolothum > Thodi > > > wrote: > > > > > > > > > > So even if you invent an iommu ID we cannot accept it as a handle > to > > > > > > create viommu in iommufd. > > > > > > > > > > Creating the vIOMMU only happens when the user does a cold/hot > > > plug of > > > > > a VFIO device. At that time Qemu checks whether the assigned id > > > matches > > > > > with whatever the kernel tell it. > > > > > > > > This is not hard up until the guest is started. If you boot a guest > > > > without a backing viommu iommufd object then there will be some > more > > > > complexities. > > > > > > Yea, I imagined that things would be complicated with hotplugs.. > > > > > > On one hand, I got the part that we need some fixed link forehand > > > to ease migration/hotplugs. > > > > > > On the other hand, all IOMMUFD ioctls need a VFIO device FD, which > > > brings the immediate attention that we cannot even decide vSMMU's > > > capabilities being reflected in its IDR/IIDR registers, without a > > > coldplug device -- if we boot a VM (one vSMMU<->pSMMU) with only a > > > hotplug device, the IOMMU_GET_HW_INFO cannot be done during > guest > > > > Right. I forgot about the call to smmu_dev_get_info() during the reset. > > That means we need at least one dev per Guest SMMU during Guest > > boot :( > > That's pretty unpleasant as a usage restriction. It sounds like there > needs to be a way to configure & control the vIOMMU independantly of > attaching a specific VFIO device. Yes, that would be ideal. Just wondering whether we can have something like the vfio_register_iommu_driver() for iommufd subsystem by which it can directly access iommu drivers ops(may be a restricted set). Not sure about the layering violations and other security issues with that... Thanks, Shameer
On Fri, Feb 07, 2025 at 12:21:54PM +0000, Shameerali Kolothum Thodi wrote: > Just wondering whether we can have something like the > vfio_register_iommu_driver() for iommufd subsystem by which it can directly > access iommu drivers ops(may be a restricted set). I very much want to try hard to avoid that. AFAICT you do not need a VFIO device, or access to the HW_INFO of the smmu to start up a SMMU driver. Yes, you cannot later attach a VFIO device with a pSMMU that materially differs from vSMMU setup, but that is fine. qemu has long had a duality where you can either "inherit from host" for an easy setup or be "fully specified" and support live migration/etc. CPUID as a simple example. So, what the smmu patches are doing now is "inherit from host" and that requires a VFIO device to work. I think that is fine. If you want to do full hotplug then you need to "fully specified" on the command line so a working vSMMU can be shown to the guest with no devices, and no kernel involvement. Obviously this is a highly advanced operating mode as things like IIDR and errata need to be considered, but I would guess booting with no vPCI devices is already abnormal. Jason
On Thu, Feb 06, 2025 at 12:33:19PM -0800, Nicolin Chen wrote: > On Thu, Feb 06, 2025 at 02:22:01PM -0400, Jason Gunthorpe wrote: > > On Thu, Feb 06, 2025 at 06:18:14PM +0000, Shameerali Kolothum Thodi wrote: > > > > > > So even if you invent an iommu ID we cannot accept it as a handle to > > > > create viommu in iommufd. > > > > > > Creating the vIOMMU only happens when the user does a cold/hot plug of > > > a VFIO device. At that time Qemu checks whether the assigned id matches > > > with whatever the kernel tell it. > > > > This is not hard up until the guest is started. If you boot a guest > > without a backing viommu iommufd object then there will be some more > > complexities. > > Yea, I imagined that things would be complicated with hotplugs.. > > On one hand, I got the part that we need some fixed link forehand > to ease migration/hotplugs. > > On the other hand, all IOMMUFD ioctls need a VFIO device FD, which > brings the immediate attention that we cannot even decide vSMMU's > capabilities being reflected in its IDR/IIDR registers, without a > coldplug device As Daniel was saying this all has to be specifiable on the command line. IMHO if the vSMMU is not fully specified by the time the boot happens (either explicity via command line or implicitly by querying the live HW) then it qemu should fail. Jason
On Thu, Feb 06, 2025 at 04:38:55PM -0400, Jason Gunthorpe wrote: > On Thu, Feb 06, 2025 at 12:33:19PM -0800, Nicolin Chen wrote: > > On Thu, Feb 06, 2025 at 02:22:01PM -0400, Jason Gunthorpe wrote: > > > On Thu, Feb 06, 2025 at 06:18:14PM +0000, Shameerali Kolothum Thodi wrote: > > > > > > > > So even if you invent an iommu ID we cannot accept it as a handle to > > > > > create viommu in iommufd. > > > > > > > > Creating the vIOMMU only happens when the user does a cold/hot plug of > > > > a VFIO device. At that time Qemu checks whether the assigned id matches > > > > with whatever the kernel tell it. > > > > > > This is not hard up until the guest is started. If you boot a guest > > > without a backing viommu iommufd object then there will be some more > > > complexities. > > > > Yea, I imagined that things would be complicated with hotplugs.. > > > > On one hand, I got the part that we need some fixed link forehand > > to ease migration/hotplugs. > > > > On the other hand, all IOMMUFD ioctls need a VFIO device FD, which > > brings the immediate attention that we cannot even decide vSMMU's > > capabilities being reflected in its IDR/IIDR registers, without a > > coldplug device > > As Daniel was saying this all has to be specifiable on the command > line. > > IMHO if the vSMMU is not fully specified by the time the boot happens > (either explicity via command line or implicitly by querying the live > HW) then it qemu should fail. Though that makes sense, that would assume we could only support the case where a VM has at least one cold plug device per vSMMU? Otherwise, even if we specify vSMMU to which pSMMU via a command line, we can't get access to the pSMMU via IOMMU_GET_HW_INFO.. Thanks Nicolin
On Thu, Feb 06, 2025 at 12:48:40PM -0800, Nicolin Chen wrote: > On Thu, Feb 06, 2025 at 04:38:55PM -0400, Jason Gunthorpe wrote: > > On Thu, Feb 06, 2025 at 12:33:19PM -0800, Nicolin Chen wrote: > > > On Thu, Feb 06, 2025 at 02:22:01PM -0400, Jason Gunthorpe wrote: > > > > On Thu, Feb 06, 2025 at 06:18:14PM +0000, Shameerali Kolothum Thodi wrote: > > > > > > > > > > So even if you invent an iommu ID we cannot accept it as a handle to > > > > > > create viommu in iommufd. > > > > > > > > > > Creating the vIOMMU only happens when the user does a cold/hot plug of > > > > > a VFIO device. At that time Qemu checks whether the assigned id matches > > > > > with whatever the kernel tell it. > > > > > > > > This is not hard up until the guest is started. If you boot a guest > > > > without a backing viommu iommufd object then there will be some more > > > > complexities. > > > > > > Yea, I imagined that things would be complicated with hotplugs.. > > > > > > On one hand, I got the part that we need some fixed link forehand > > > to ease migration/hotplugs. > > > > > > On the other hand, all IOMMUFD ioctls need a VFIO device FD, which > > > brings the immediate attention that we cannot even decide vSMMU's > > > capabilities being reflected in its IDR/IIDR registers, without a > > > coldplug device > > > > As Daniel was saying this all has to be specifiable on the command > > line. > > > > IMHO if the vSMMU is not fully specified by the time the boot happens > > (either explicity via command line or implicitly by querying the live > > HW) then it qemu should fail. > > Though that makes sense, that would assume we could only support > the case where a VM has at least one cold plug device per vSMMU? > > Otherwise, even if we specify vSMMU to which pSMMU via a command > line, we can't get access to the pSMMU via IOMMU_GET_HW_INFO.. You'd use the command line information and wouldn't need GET_HW_INFO, it would be complicated Jason
On Thu, Feb 06, 2025 at 05:11:13PM -0400, Jason Gunthorpe wrote: > On Thu, Feb 06, 2025 at 12:48:40PM -0800, Nicolin Chen wrote: > > On Thu, Feb 06, 2025 at 04:38:55PM -0400, Jason Gunthorpe wrote: > > > On Thu, Feb 06, 2025 at 12:33:19PM -0800, Nicolin Chen wrote: > > > > On Thu, Feb 06, 2025 at 02:22:01PM -0400, Jason Gunthorpe wrote: > > > > > On Thu, Feb 06, 2025 at 06:18:14PM +0000, Shameerali Kolothum Thodi wrote: > > > > > > > > > > > > So even if you invent an iommu ID we cannot accept it as a handle to > > > > > > > create viommu in iommufd. > > > > > > > > > > > > Creating the vIOMMU only happens when the user does a cold/hot plug of > > > > > > a VFIO device. At that time Qemu checks whether the assigned id matches > > > > > > with whatever the kernel tell it. > > > > > > > > > > This is not hard up until the guest is started. If you boot a guest > > > > > without a backing viommu iommufd object then there will be some more > > > > > complexities. > > > > > > > > Yea, I imagined that things would be complicated with hotplugs.. > > > > > > > > On one hand, I got the part that we need some fixed link forehand > > > > to ease migration/hotplugs. > > > > > > > > On the other hand, all IOMMUFD ioctls need a VFIO device FD, which > > > > brings the immediate attention that we cannot even decide vSMMU's > > > > capabilities being reflected in its IDR/IIDR registers, without a > > > > coldplug device > > > > > > As Daniel was saying this all has to be specifiable on the command > > > line. > > > > > > IMHO if the vSMMU is not fully specified by the time the boot happens > > > (either explicity via command line or implicitly by querying the live > > > HW) then it qemu should fail. > > > > Though that makes sense, that would assume we could only support > > the case where a VM has at least one cold plug device per vSMMU? > > > > Otherwise, even if we specify vSMMU to which pSMMU via a command > > line, we can't get access to the pSMMU via IOMMU_GET_HW_INFO.. > > You'd use the command line information and wouldn't need GET_HW_INFO, > it would be complicated Do you mean the "-device arm-smmuv3-accel,id=xx" line? This still won't give us the host IDR/IIDR register values to probe a vSMMU, unless it has a VFIO device assigned to vSMMU's associated PXB in that command line? Nicolin
On Thu, Feb 06, 2025 at 02:46:42PM -0800, Nicolin Chen wrote: > > You'd use the command line information and wouldn't need GET_HW_INFO, > > it would be complicated > > Do you mean the "-device arm-smmuv3-accel,id=xx" line? This still > won't give us the host IDR/IIDR register values to probe a vSMMU, > unless it has a VFIO device assigned to vSMMU's associated PXB in > that command line? Yes, put the IDR registers on the command line too. Nothing from the host should be copied to the guest without the option to control it through the command line. Jason
> -----Original Message----- > From: Shameerali Kolothum Thodi > Sent: Thursday, January 30, 2025 6:09 PM > To: 'Daniel P. Berrangé' <berrange@redhat.com> > Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org; > eric.auger@redhat.com; peter.maydell@linaro.org; jgg@nvidia.com; > nicolinc@nvidia.com; ddutile@redhat.com; Linuxarm > <linuxarm@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>; > jiangkunkun <jiangkunkun@huawei.com>; Jonathan Cameron > <jonathan.cameron@huawei.com>; zhangfei.gao@linaro.org > Subject: RE: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable > nested SMMUv3 > > Hi Daniel, > > > -----Original Message----- > > From: Daniel P. Berrangé <berrange@redhat.com> > > Sent: Thursday, January 30, 2025 4:00 PM > > To: Shameerali Kolothum Thodi > <shameerali.kolothum.thodi@huawei.com> > > Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org; > > eric.auger@redhat.com; peter.maydell@linaro.org; jgg@nvidia.com; > > nicolinc@nvidia.com; ddutile@redhat.com; Linuxarm > > <linuxarm@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>; > > jiangkunkun <jiangkunkun@huawei.com>; Jonathan Cameron > > <jonathan.cameron@huawei.com>; zhangfei.gao@linaro.org > > Subject: Re: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable > > nested SMMUv3 > > > > On Fri, Nov 08, 2024 at 12:52:37PM +0000, Shameer Kolothum via wrote: > > > How to use it(Eg:): > > > > > > On a HiSilicon platform that has multiple physical SMMUv3s, the ACC > ZIP > > VF > > > devices and HNS VF devices are behind different SMMUv3s. So for a > > Guest, > > > specify two smmuv3-nested devices each behind a pxb-pcie as below, > > > > > > ./qemu-system-aarch64 -machine virt,gic-version=3,default-bus-bypass- > > iommu=on \ > > > -enable-kvm -cpu host -m 4G -smp cpus=8,maxcpus=8 \ > > > -object iommufd,id=iommufd0 \ > > > -bios QEMU_EFI.fd \ > > > -kernel Image \ > > > -device virtio-blk-device,drive=fs \ > > > -drive if=none,file=rootfs.qcow2,id=fs \ > > > -device pxb-pcie,id=pcie.1,bus_nr=8,bus=pcie.0 \ > > > -device pcie-root-port,id=pcie.port1,bus=pcie.1,chassis=1 \ > > > -device arm-smmuv3-nested,id=smmuv1,pci-bus=pcie.1 \ > > > -device vfio-pci,host=0000:7d:02.1,bus=pcie.port1,iommufd=iommufd0 \ > > > -device pxb-pcie,id=pcie.2,bus_nr=16,bus=pcie.0 \ > > > -device pcie-root-port,id=pcie.port2,bus=pcie.2,chassis=2 \ > > > -device arm-smmuv3-nested,id=smmuv2,pci-bus=pcie.2 \ > > > -device vfio-pci,host=0000:75:00.1,bus=pcie.port2,iommufd=iommufd0 \ > > > -append "rdinit=init console=ttyAMA0 root=/dev/vda2 rw > > earlycon=pl011,0x9000000" \ > > > -device virtio-9p-pci,fsdev=p9fs2,mount_tag=p9,bus=pcie.0 \ > > > -fsdev local,id=p9fs2,path=p9root,security_model=mapped \ > > > -net none \ > > > -nographic > > > > Above you say the host has 2 SMMUv3 devices, and you've created 2 > > SMMUv3 > > guest devices to match. > > > > The various emails in this thread & libvirt thread, indicate that each > > guest SMMUv3 is associated with a host SMMUv3, but I don't see any > > property on the command line for 'arm-ssmv3-nested' that tells it which > > host eSMMUv3 it is to be associated with. > > > > How does this association work ? > > You are right. The association is not very obvious in Qemu. The association > and checking is done implicitly by kernel at the moment. I will try to > explain > it here. > > Each "arm-smmuv3-nested" instance, when the first device gets attached > to it, will create a S2 HWPT and a corresponding SMMUv3 domain in kernel > SMMUv3 driver. This domain will have a pointer representing the physical > SMMUv3 that the device belongs. And any other device which belongs to > the same physical SMMUv3 can share this S2 domain. > > If a device that belongs to a different physical SMMUv3 gets attached to > the above domain, the HWPT attach will eventually fail as the physical > smmuv3 in the domains will have a mismatch, > https://elixir.bootlin.com/linux/v6.13/source/drivers/iommu/arm/arm- > smmu-v3/arm-smmu-v3.c#L2860 > > And as I mentioned in cover letter, Qemu will report, > > " > Attempt to add the HNS VF to a different SMMUv3 will result in, > > -device vfio-pci,host=0000:7d:02.2,bus=pcie.port3,iommufd=iommufd0: > Unable to attach viommu > -device vfio-pci,host=0000:7d:02.2,bus=pcie.port3,iommufd=iommufd0: vfio > 0000:7d:02.2: > Failed to set iommu_device: [iommufd=29] error attach 0000:7d:02.2 (38) > to id=11: Invalid argument > > At present Qemu is not doing any extra validation other than the above > failure to make sure the user configuration is correct or not. The > assumption is libvirt will take care of this. > " > So in summary, if the libvirt gets it wrong, Qemu will fail with error. > > If a more explicit association is required, some help from kernel is required > to identify the physical SMMUv3 associated with the device. Again thinking about this, to have an explicit association in the Qemu command line between the vSMMUv3 and the phys smmuv3, We can possibly add something like, -device pxb-pcie,id=pcie.1,bus_nr=8,bus=pcie.0 \ -device pcie-root-port,id=pcie.port1,bus=pcie.1,chassis=1 \ -device arm-smmuv3-accel,bus=pcie.1,phys-smmuv3= smmu3.0x0000000100000000 \ -device vfio-pci,host=0000:7d:02.1,bus=pcie.port1,iommufd=iommufd0 \ -device pxb-pcie,id=pcie.2,bus_nr=16,bus=pcie.0 \ -device pcie-root-port,id=pcie.port2,bus=pcie.2,chassis=2 \ -device arm-smmuv3-nested,id=smmuv2,pci-bus=pcie.2, phys-smmuv3= smmu3.0x0000000200000000 \ -device vfio-pci,host=0000:75:00.1,bus=pcie.port2,iommufd=iommufd0 \ etc. And Qemu does some checking to make sure that the device is indeed associated with the specified phys-smmuv3. This can be done going through the sysfs path checking which is what I guess libvirt is currently doing to populate the topology. So basically Qemu is just replicating that to validate again. Or another option is extending the IOMMU_GET_HW_INFO IOCTL to return the phys smmuv3 base address which can avoid going through the sysfs. The only difference between the current approach(kernel failing the attach implicitly) and the above is, Qemu can provide a validation of inputs and may be report a better error message than just saying " Unable to attach viommu/: Invalid argument". If the command line looks Ok, I will go with the sysfs path validation method first in my next respin. Please let me know. Thanks, Shameer
On Fri, Jan 31, 2025 at 09:33:16AM +0000, Shameerali Kolothum Thodi wrote: > And Qemu does some checking to make sure that the device is indeed associated > with the specified phys-smmuv3. This can be done going through the sysfs path checking > which is what I guess libvirt is currently doing to populate the topology. So basically > Qemu is just replicating that to validate again. I would prefer that iommufd users not have to go out to sysfs.. > Or another option is extending the IOMMU_GET_HW_INFO IOCTL to return the phys > smmuv3 base address which can avoid going through the sysfs. It also doesn't seem great to expose a physical address. But we could have an 'iommu instance id' that was a unique small integer? Jason
> -----Original Message----- > From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Friday, January 31, 2025 2:24 PM > To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> > Cc: Daniel P. Berrangé <berrange@redhat.com>; qemu-arm@nongnu.org; > qemu-devel@nongnu.org; eric.auger@redhat.com; > peter.maydell@linaro.org; nicolinc@nvidia.com; ddutile@redhat.com; > Linuxarm <linuxarm@huawei.com>; Wangzhou (B) > <wangzhou1@hisilicon.com>; jiangkunkun <jiangkunkun@huawei.com>; > Jonathan Cameron <jonathan.cameron@huawei.com>; > zhangfei.gao@linaro.org; Nathan Chen <nathanc@nvidia.com> > Subject: Re: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable > nested SMMUv3 > > On Fri, Jan 31, 2025 at 09:33:16AM +0000, Shameerali Kolothum Thodi > wrote: > > > And Qemu does some checking to make sure that the device is indeed > associated > > with the specified phys-smmuv3. This can be done going through the > sysfs path checking > > which is what I guess libvirt is currently doing to populate the topology. > So basically > > Qemu is just replicating that to validate again. > > I would prefer that iommufd users not have to go out to sysfs.. > > > Or another option is extending the IOMMU_GET_HW_INFO IOCTL to > return the phys > > smmuv3 base address which can avoid going through the sysfs. > > It also doesn't seem great to expose a physical address. But we could > have an 'iommu instance id' that was a unique small integer? Ok. But how the user space can map that to the device? Something like, /sys/bus/pci/devices/0000:7d:00.1/iommu/instance.X ? Thanks, Shameer
On Fri, Jan 31, 2025 at 02:39:53PM +0000, Shameerali Kolothum Thodi wrote: > > > And Qemu does some checking to make sure that the device is indeed > > associated > > > with the specified phys-smmuv3. This can be done going through the > > sysfs path checking > > > which is what I guess libvirt is currently doing to populate the topology. > > So basically > > > Qemu is just replicating that to validate again. > > > > I would prefer that iommufd users not have to go out to sysfs.. > > > > > Or another option is extending the IOMMU_GET_HW_INFO IOCTL to > > return the phys > > > smmuv3 base address which can avoid going through the sysfs. > > > > It also doesn't seem great to expose a physical address. But we could > > have an 'iommu instance id' that was a unique small integer? > > Ok. But how the user space can map that to the device? Why does it need to? libvirt picks some label for the vsmmu instance, it doesn't matter what the string is. qemu validates that all of the vsmmu instances are only linked to PCI device that have the same iommu ID. This is already happening in the kernel, it will fail attaches to mismatched instances. Nothing further is needed? Jason
> -----Original Message----- > From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Friday, January 31, 2025 2:54 PM > To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> > Cc: Daniel P. Berrangé <berrange@redhat.com>; qemu-arm@nongnu.org; > qemu-devel@nongnu.org; eric.auger@redhat.com; > peter.maydell@linaro.org; nicolinc@nvidia.com; ddutile@redhat.com; > Linuxarm <linuxarm@huawei.com>; Wangzhou (B) > <wangzhou1@hisilicon.com>; jiangkunkun <jiangkunkun@huawei.com>; > Jonathan Cameron <jonathan.cameron@huawei.com>; > zhangfei.gao@linaro.org; Nathan Chen <nathanc@nvidia.com> > Subject: Re: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable > nested SMMUv3 > > On Fri, Jan 31, 2025 at 02:39:53PM +0000, Shameerali Kolothum Thodi > wrote: > > > > > And Qemu does some checking to make sure that the device is indeed > > > associated > > > > with the specified phys-smmuv3. This can be done going through the > > > sysfs path checking > > > > which is what I guess libvirt is currently doing to populate the > topology. > > > So basically > > > > Qemu is just replicating that to validate again. > > > > > > I would prefer that iommufd users not have to go out to sysfs.. > > > > > > > Or another option is extending the IOMMU_GET_HW_INFO IOCTL to > > > return the phys > > > > smmuv3 base address which can avoid going through the sysfs. > > > > > > It also doesn't seem great to expose a physical address. But we could > > > have an 'iommu instance id' that was a unique small integer? > > > > Ok. But how the user space can map that to the device? > > Why does it need to? > > libvirt picks some label for the vsmmu instance, it doesn't matter > what the string is. > > qemu validates that all of the vsmmu instances are only linked to PCI > device that have the same iommu ID. This is already happening in the > kernel, it will fail attaches to mismatched instances. > > Nothing further is needed? -device pxb-pcie,id=pcie.1,bus_nr=8,bus=pcie.0 \ -device pcie-root-port,id=pcie.port1,bus=pcie.1,chassis=1 \ -device arm-smmuv3-accel,bus=pcie.1,id=smmuv1 \ -device vfio-pci,host=0000:7d:02.1,bus=pcie.port1,iommufd=iommufd0 \ -device pxb-pcie,id=pcie.2,bus_nr=16,bus=pcie.0 \ -device pcie-root-port,id=pcie.port2,bus=pcie.2,chassis=2 \ -device arm-smmuv3-accel,pci-bus=pcie.2,id=smmuv2 \ -device vfio-pci,host=0000:75:00.1,bus=pcie.port2,iommufd=iommufd0 \ I think it works from a functionality point of view. A particular instance of arm-smmuv3-accel(say id=smmuv1) can only have devices attached to the same phys smmuv3 "iommu instance id" But not sure from a libvirt/Qemu interface point of view[0] the concerns are addressed. Daniel/Nathan? Thanks, Shameer https://lists.libvirt.org/archives/list/devel@lists.libvirt.org/message/X6R52JRBYDFZ5PSJFR534A655UZ3RHKN/
Hi, On 1/31/25 4:23 PM, Shameerali Kolothum Thodi wrote: > >> -----Original Message----- >> From: Jason Gunthorpe <jgg@nvidia.com> >> Sent: Friday, January 31, 2025 2:54 PM >> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> >> Cc: Daniel P. Berrangé <berrange@redhat.com>; qemu-arm@nongnu.org; >> qemu-devel@nongnu.org; eric.auger@redhat.com; >> peter.maydell@linaro.org; nicolinc@nvidia.com; ddutile@redhat.com; >> Linuxarm <linuxarm@huawei.com>; Wangzhou (B) >> <wangzhou1@hisilicon.com>; jiangkunkun <jiangkunkun@huawei.com>; >> Jonathan Cameron <jonathan.cameron@huawei.com>; >> zhangfei.gao@linaro.org; Nathan Chen <nathanc@nvidia.com> >> Subject: Re: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable >> nested SMMUv3 >> >> On Fri, Jan 31, 2025 at 02:39:53PM +0000, Shameerali Kolothum Thodi >> wrote: >> >>>>> And Qemu does some checking to make sure that the device is indeed >>>> associated >>>>> with the specified phys-smmuv3. This can be done going through the >>>> sysfs path checking >>>>> which is what I guess libvirt is currently doing to populate the >> topology. >>>> So basically >>>>> Qemu is just replicating that to validate again. >>>> I would prefer that iommufd users not have to go out to sysfs.. >>>> >>>>> Or another option is extending the IOMMU_GET_HW_INFO IOCTL to >>>> return the phys >>>>> smmuv3 base address which can avoid going through the sysfs. >>>> It also doesn't seem great to expose a physical address. But we could >>>> have an 'iommu instance id' that was a unique small integer? >>> Ok. But how the user space can map that to the device? >> Why does it need to? >> >> libvirt picks some label for the vsmmu instance, it doesn't matter >> what the string is. >> >> qemu validates that all of the vsmmu instances are only linked to PCI >> device that have the same iommu ID. This is already happening in the >> kernel, it will fail attaches to mismatched instances. >> >> Nothing further is needed? > -device pxb-pcie,id=pcie.1,bus_nr=8,bus=pcie.0 \ > -device pcie-root-port,id=pcie.port1,bus=pcie.1,chassis=1 \ > -device arm-smmuv3-accel,bus=pcie.1,id=smmuv1 \ I don't get what is the point of adding such an id if it is not referenced anywhere? Eric > -device vfio-pci,host=0000:7d:02.1,bus=pcie.port1,iommufd=iommufd0 \ > > -device pxb-pcie,id=pcie.2,bus_nr=16,bus=pcie.0 \ > -device pcie-root-port,id=pcie.port2,bus=pcie.2,chassis=2 \ > -device arm-smmuv3-accel,pci-bus=pcie.2,id=smmuv2 \ > -device vfio-pci,host=0000:75:00.1,bus=pcie.port2,iommufd=iommufd0 \ > > I think it works from a functionality point of view. A particular > instance of arm-smmuv3-accel(say id=smmuv1) can only have devices attached > to the same phys smmuv3 "iommu instance id" > > But not sure from a libvirt/Qemu interface point of view[0] the concerns > are addressed. Daniel/Nathan? > > Thanks, > Shameer > https://lists.libvirt.org/archives/list/devel@lists.libvirt.org/message/X6R52JRBYDFZ5PSJFR534A655UZ3RHKN/ >
On Fri, Jan 31, 2025 at 05:08:28PM +0100, Eric Auger wrote: > Hi, > > > On 1/31/25 4:23 PM, Shameerali Kolothum Thodi wrote: > > > >> -----Original Message----- > >> From: Jason Gunthorpe <jgg@nvidia.com> > >> Sent: Friday, January 31, 2025 2:54 PM > >> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> > >> Cc: Daniel P. Berrangé <berrange@redhat.com>; qemu-arm@nongnu.org; > >> qemu-devel@nongnu.org; eric.auger@redhat.com; > >> peter.maydell@linaro.org; nicolinc@nvidia.com; ddutile@redhat.com; > >> Linuxarm <linuxarm@huawei.com>; Wangzhou (B) > >> <wangzhou1@hisilicon.com>; jiangkunkun <jiangkunkun@huawei.com>; > >> Jonathan Cameron <jonathan.cameron@huawei.com>; > >> zhangfei.gao@linaro.org; Nathan Chen <nathanc@nvidia.com> > >> Subject: Re: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable > >> nested SMMUv3 > >> > >> On Fri, Jan 31, 2025 at 02:39:53PM +0000, Shameerali Kolothum Thodi > >> wrote: > >> > >>>>> And Qemu does some checking to make sure that the device is indeed > >>>> associated > >>>>> with the specified phys-smmuv3. This can be done going through the > >>>> sysfs path checking > >>>>> which is what I guess libvirt is currently doing to populate the > >> topology. > >>>> So basically > >>>>> Qemu is just replicating that to validate again. > >>>> I would prefer that iommufd users not have to go out to sysfs.. > >>>> > >>>>> Or another option is extending the IOMMU_GET_HW_INFO IOCTL to > >>>> return the phys > >>>>> smmuv3 base address which can avoid going through the sysfs. > >>>> It also doesn't seem great to expose a physical address. But we could > >>>> have an 'iommu instance id' that was a unique small integer? > >>> Ok. But how the user space can map that to the device? > >> Why does it need to? > >> > >> libvirt picks some label for the vsmmu instance, it doesn't matter > >> what the string is. > >> > >> qemu validates that all of the vsmmu instances are only linked to PCI > >> device that have the same iommu ID. This is already happening in the > >> kernel, it will fail attaches to mismatched instances. > >> > >> Nothing further is needed? > > -device pxb-pcie,id=pcie.1,bus_nr=8,bus=pcie.0 \ > > -device pcie-root-port,id=pcie.port1,bus=pcie.1,chassis=1 \ > > -device arm-smmuv3-accel,bus=pcie.1,id=smmuv1 \ > I don't get what is the point of adding such an id if it is not > referenced anywhere? Every QDev device instance has an 'id' property - if you don't set one explicitly, QEMU will generate one internally. Libvirt will always set the 'id' property to avoid the internal auto- generated IDs, as it wants full knowledge of naming. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
On 2/6/25 9:53 AM, Daniel P. Berrangé wrote: > On Fri, Jan 31, 2025 at 05:08:28PM +0100, Eric Auger wrote: >> Hi, >> >> >> On 1/31/25 4:23 PM, Shameerali Kolothum Thodi wrote: >>>> -----Original Message----- >>>> From: Jason Gunthorpe <jgg@nvidia.com> >>>> Sent: Friday, January 31, 2025 2:54 PM >>>> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> >>>> Cc: Daniel P. Berrangé <berrange@redhat.com>; qemu-arm@nongnu.org; >>>> qemu-devel@nongnu.org; eric.auger@redhat.com; >>>> peter.maydell@linaro.org; nicolinc@nvidia.com; ddutile@redhat.com; >>>> Linuxarm <linuxarm@huawei.com>; Wangzhou (B) >>>> <wangzhou1@hisilicon.com>; jiangkunkun <jiangkunkun@huawei.com>; >>>> Jonathan Cameron <jonathan.cameron@huawei.com>; >>>> zhangfei.gao@linaro.org; Nathan Chen <nathanc@nvidia.com> >>>> Subject: Re: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable >>>> nested SMMUv3 >>>> >>>> On Fri, Jan 31, 2025 at 02:39:53PM +0000, Shameerali Kolothum Thodi >>>> wrote: >>>> >>>>>>> And Qemu does some checking to make sure that the device is indeed >>>>>> associated >>>>>>> with the specified phys-smmuv3. This can be done going through the >>>>>> sysfs path checking >>>>>>> which is what I guess libvirt is currently doing to populate the >>>> topology. >>>>>> So basically >>>>>>> Qemu is just replicating that to validate again. >>>>>> I would prefer that iommufd users not have to go out to sysfs.. >>>>>> >>>>>>> Or another option is extending the IOMMU_GET_HW_INFO IOCTL to >>>>>> return the phys >>>>>>> smmuv3 base address which can avoid going through the sysfs. >>>>>> It also doesn't seem great to expose a physical address. But we could >>>>>> have an 'iommu instance id' that was a unique small integer? >>>>> Ok. But how the user space can map that to the device? >>>> Why does it need to? >>>> >>>> libvirt picks some label for the vsmmu instance, it doesn't matter >>>> what the string is. >>>> >>>> qemu validates that all of the vsmmu instances are only linked to PCI >>>> device that have the same iommu ID. This is already happening in the >>>> kernel, it will fail attaches to mismatched instances. >>>> >>>> Nothing further is needed? >>> -device pxb-pcie,id=pcie.1,bus_nr=8,bus=pcie.0 \ >>> -device pcie-root-port,id=pcie.port1,bus=pcie.1,chassis=1 \ >>> -device arm-smmuv3-accel,bus=pcie.1,id=smmuv1 \ >> I don't get what is the point of adding such an id if it is not >> referenced anywhere? > Every QDev device instance has an 'id' property - if you don't > set one explicitly, QEMU will generate one internally. Libvirt > will always set the 'id' property to avoid the internal auto- > generated IDs, as it wants full knowledge of naming. OK thank you for the explanation Eric > > With regards, > Daniel
On 1/31/2025 8:08 AM, Eric Auger wrote: >>>>>> And Qemu does some checking to make sure that the device is indeed >>>>> associated >>>>>> with the specified phys-smmuv3. This can be done going through the >>>>> sysfs path checking >>>>>> which is what I guess libvirt is currently doing to populate the >>> topology. >>>>> So basically >>>>>> Qemu is just replicating that to validate again. >>>>> I would prefer that iommufd users not have to go out to sysfs.. >>>>> >>>>>> Or another option is extending the IOMMU_GET_HW_INFO IOCTL to >>>>> return the phys >>>>>> smmuv3 base address which can avoid going through the sysfs. >>>>> It also doesn't seem great to expose a physical address. But we could >>>>> have an 'iommu instance id' that was a unique small integer? >>>> Ok. But how the user space can map that to the device? >>> Why does it need to? >>> >>> libvirt picks some label for the vsmmu instance, it doesn't matter >>> what the string is. >>> >>> qemu validates that all of the vsmmu instances are only linked to PCI >>> device that have the same iommu ID. This is already happening in the >>> kernel, it will fail attaches to mismatched instances. >>> >>> Nothing further is needed? >> -device pxb-pcie,id=pcie.1,bus_nr=8,bus=pcie.0 \ >> -device pcie-root-port,id=pcie.port1,bus=pcie.1,chassis=1 \ >> -device arm-smmuv3-accel,bus=pcie.1,id=smmuv1 \ > I don't get what is the point of adding such an id if it is not > referenced anywhere? > > Eric Daniel mentions that the host-to-guest SMMU pairing must be chosen such that it makes conceptual sense w.r.t. the guest NUMA to host NUMA pairing [0]. The current implementation allows for incorrect host to guest numa node pairings, e.g. pSMMU has affinity to host numa node 0, but it’s paired with a vSMMU paired with a guest numa node pinned to host numa node 1. By specifying the host SMMU id, we can explicitly pair a host SMMU with a guest SMMU associated with the correct PXB NUMA node, vs. implying the host-to-guest SMMU pairing based on what devices are attached to the PXB. While it would not completely prevent the incorrect pSMMU/vSMMU pairing w.r.t. host to guest numa node pairings, specifying the pSMMU id would make the implications of host to guest numa node pairings more clear when specifying a vSMMU instance. From the libvirt discussion with Daniel [1], he also states "libvirt's goal has always been to make everything that's functionally impacting a guest device be 100% explicit. So I don't think we should be implying mappings to the host SMMU in QEMU at all, QEMU must be told what to map to." Specifying the id would be a means of explicitly specifying host to guest SMMU mapping instead of implying the mapping. [0] https://lore.kernel.org/qemu-devel/Z51DmtP83741RAsb@redhat.com/ [1] https://lists.libvirt.org/archives/list/devel@lists.libvirt.org/thread/7GDT6RX5LPAJMPP4ZSC4ACME6GVMG236/#X6R52JRBYDFZ5PSJFR534A655UZ3RHKN Thanks, Nathan
On Wed, Feb 05, 2025 at 12:53:42PM -0800, Nathan Chen wrote: > > > On 1/31/2025 8:08 AM, Eric Auger wrote: > > > > > > > And Qemu does some checking to make sure that the device is indeed > > > > > > associated > > > > > > > with the specified phys-smmuv3. This can be done going through the > > > > > > sysfs path checking > > > > > > > which is what I guess libvirt is currently doing to populate the > > > > topology. > > > > > > So basically > > > > > > > Qemu is just replicating that to validate again. > > > > > > I would prefer that iommufd users not have to go out to sysfs.. > > > > > > > > > > > > > Or another option is extending the IOMMU_GET_HW_INFO IOCTL to > > > > > > return the phys > > > > > > > smmuv3 base address which can avoid going through the sysfs. > > > > > > It also doesn't seem great to expose a physical address. But we could > > > > > > have an 'iommu instance id' that was a unique small integer? > > > > > Ok. But how the user space can map that to the device? > > > > Why does it need to? > > > > > > > > libvirt picks some label for the vsmmu instance, it doesn't matter > > > > what the string is. > > > > > > > > qemu validates that all of the vsmmu instances are only linked to PCI > > > > device that have the same iommu ID. This is already happening in the > > > > kernel, it will fail attaches to mismatched instances. > > > > > > > > Nothing further is needed? > > > -device pxb-pcie,id=pcie.1,bus_nr=8,bus=pcie.0 \ > > > -device pcie-root-port,id=pcie.port1,bus=pcie.1,chassis=1 \ > > > -device arm-smmuv3-accel,bus=pcie.1,id=smmuv1 \ > > I don't get what is the point of adding such an id if it is not > > referenced anywhere? > > > > Eric > > Daniel mentions that the host-to-guest SMMU pairing must be chosen such that > it makes conceptual sense w.r.t. the guest NUMA to host NUMA pairing [0]. > The current implementation allows for incorrect host to guest numa node > pairings, e.g. pSMMU has affinity to host numa node 0, but it’s paired with > a vSMMU paired with a guest numa node pinned to host numa node 1. > > By specifying the host SMMU id, we can explicitly pair a host SMMU with a > guest SMMU associated with the correct PXB NUMA node, vs. implying the > host-to-guest SMMU pairing based on what devices are attached to the PXB. > While it would not completely prevent the incorrect pSMMU/vSMMU pairing > w.r.t. host to guest numa node pairings, specifying the pSMMU id would make > the implications of host to guest numa node pairings more clear when > specifying a vSMMU instance. You've not specified any host SMMU id in the above CLI args though, only the PXB association. It needs something like -device arm-smmuv3-accel,bus=pcie.1,id=smmuv1,host-smmu=XXXXX where 'XXXX' is some value to identify the host SMMU With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
Hi Shameer, On 1/31/25 10:33 AM, Shameerali Kolothum Thodi wrote: > >> -----Original Message----- >> From: Shameerali Kolothum Thodi >> Sent: Thursday, January 30, 2025 6:09 PM >> To: 'Daniel P. Berrangé' <berrange@redhat.com> >> Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org; >> eric.auger@redhat.com; peter.maydell@linaro.org; jgg@nvidia.com; >> nicolinc@nvidia.com; ddutile@redhat.com; Linuxarm >> <linuxarm@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>; >> jiangkunkun <jiangkunkun@huawei.com>; Jonathan Cameron >> <jonathan.cameron@huawei.com>; zhangfei.gao@linaro.org >> Subject: RE: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable >> nested SMMUv3 >> >> Hi Daniel, >> >>> -----Original Message----- >>> From: Daniel P. Berrangé <berrange@redhat.com> >>> Sent: Thursday, January 30, 2025 4:00 PM >>> To: Shameerali Kolothum Thodi >> <shameerali.kolothum.thodi@huawei.com> >>> Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org; >>> eric.auger@redhat.com; peter.maydell@linaro.org; jgg@nvidia.com; >>> nicolinc@nvidia.com; ddutile@redhat.com; Linuxarm >>> <linuxarm@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>; >>> jiangkunkun <jiangkunkun@huawei.com>; Jonathan Cameron >>> <jonathan.cameron@huawei.com>; zhangfei.gao@linaro.org >>> Subject: Re: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable >>> nested SMMUv3 >>> >>> On Fri, Nov 08, 2024 at 12:52:37PM +0000, Shameer Kolothum via wrote: >>>> How to use it(Eg:): >>>> >>>> On a HiSilicon platform that has multiple physical SMMUv3s, the ACC >> ZIP >>> VF >>>> devices and HNS VF devices are behind different SMMUv3s. So for a >>> Guest, >>>> specify two smmuv3-nested devices each behind a pxb-pcie as below, >>>> >>>> ./qemu-system-aarch64 -machine virt,gic-version=3,default-bus-bypass- >>> iommu=on \ >>>> -enable-kvm -cpu host -m 4G -smp cpus=8,maxcpus=8 \ >>>> -object iommufd,id=iommufd0 \ >>>> -bios QEMU_EFI.fd \ >>>> -kernel Image \ >>>> -device virtio-blk-device,drive=fs \ >>>> -drive if=none,file=rootfs.qcow2,id=fs \ >>>> -device pxb-pcie,id=pcie.1,bus_nr=8,bus=pcie.0 \ >>>> -device pcie-root-port,id=pcie.port1,bus=pcie.1,chassis=1 \ >>>> -device arm-smmuv3-nested,id=smmuv1,pci-bus=pcie.1 \ >>>> -device vfio-pci,host=0000:7d:02.1,bus=pcie.port1,iommufd=iommufd0 \ >>>> -device pxb-pcie,id=pcie.2,bus_nr=16,bus=pcie.0 \ >>>> -device pcie-root-port,id=pcie.port2,bus=pcie.2,chassis=2 \ >>>> -device arm-smmuv3-nested,id=smmuv2,pci-bus=pcie.2 \ >>>> -device vfio-pci,host=0000:75:00.1,bus=pcie.port2,iommufd=iommufd0 \ >>>> -append "rdinit=init console=ttyAMA0 root=/dev/vda2 rw >>> earlycon=pl011,0x9000000" \ >>>> -device virtio-9p-pci,fsdev=p9fs2,mount_tag=p9,bus=pcie.0 \ >>>> -fsdev local,id=p9fs2,path=p9root,security_model=mapped \ >>>> -net none \ >>>> -nographic >>> Above you say the host has 2 SMMUv3 devices, and you've created 2 >>> SMMUv3 >>> guest devices to match. >>> >>> The various emails in this thread & libvirt thread, indicate that each >>> guest SMMUv3 is associated with a host SMMUv3, but I don't see any >>> property on the command line for 'arm-ssmv3-nested' that tells it which >>> host eSMMUv3 it is to be associated with. >>> >>> How does this association work ? >> You are right. The association is not very obvious in Qemu. The association >> and checking is done implicitly by kernel at the moment. I will try to >> explain >> it here. >> >> Each "arm-smmuv3-nested" instance, when the first device gets attached >> to it, will create a S2 HWPT and a corresponding SMMUv3 domain in kernel >> SMMUv3 driver. This domain will have a pointer representing the physical >> SMMUv3 that the device belongs. And any other device which belongs to >> the same physical SMMUv3 can share this S2 domain. >> >> If a device that belongs to a different physical SMMUv3 gets attached to >> the above domain, the HWPT attach will eventually fail as the physical >> smmuv3 in the domains will have a mismatch, >> https://elixir.bootlin.com/linux/v6.13/source/drivers/iommu/arm/arm- >> smmu-v3/arm-smmu-v3.c#L2860 >> >> And as I mentioned in cover letter, Qemu will report, >> >> " >> Attempt to add the HNS VF to a different SMMUv3 will result in, >> >> -device vfio-pci,host=0000:7d:02.2,bus=pcie.port3,iommufd=iommufd0: >> Unable to attach viommu >> -device vfio-pci,host=0000:7d:02.2,bus=pcie.port3,iommufd=iommufd0: vfio >> 0000:7d:02.2: >> Failed to set iommu_device: [iommufd=29] error attach 0000:7d:02.2 (38) >> to id=11: Invalid argument >> >> At present Qemu is not doing any extra validation other than the above >> failure to make sure the user configuration is correct or not. The >> assumption is libvirt will take care of this. >> " >> So in summary, if the libvirt gets it wrong, Qemu will fail with error. >> >> If a more explicit association is required, some help from kernel is required >> to identify the physical SMMUv3 associated with the device. > Again thinking about this, to have an explicit association in the Qemu command > line between the vSMMUv3 and the phys smmuv3, > > We can possibly add something like, > > -device pxb-pcie,id=pcie.1,bus_nr=8,bus=pcie.0 \ > -device pcie-root-port,id=pcie.port1,bus=pcie.1,chassis=1 \ > -device arm-smmuv3-accel,bus=pcie.1,phys-smmuv3= smmu3.0x0000000100000000 \ > -device vfio-pci,host=0000:7d:02.1,bus=pcie.port1,iommufd=iommufd0 \ > > -device pxb-pcie,id=pcie.2,bus_nr=16,bus=pcie.0 \ > -device pcie-root-port,id=pcie.port2,bus=pcie.2,chassis=2 \ > -device arm-smmuv3-nested,id=smmuv2,pci-bus=pcie.2, phys-smmuv3= smmu3.0x0000000200000000 \ > -device vfio-pci,host=0000:75:00.1,bus=pcie.port2,iommufd=iommufd0 \ > > etc. > > And Qemu does some checking to make sure that the device is indeed associated > with the specified phys-smmuv3. This can be done going through the sysfs path checking > which is what I guess libvirt is currently doing to populate the topology. So basically > Qemu is just replicating that to validate again. > > Or another option is extending the IOMMU_GET_HW_INFO IOCTL to return the phys > smmuv3 base address which can avoid going through the sysfs. > > The only difference between the current approach(kernel failing the attach implicitly) > and the above is, Qemu can provide a validation of inputs and may be report a better > error message than just saying " Unable to attach viommu/: Invalid argument". > > If the command line looks Ok, I will go with the sysfs path validation method first in my > next respin. The command line looks sensible to me. on vfio we use host=6810000.ethernet. Maybe reuse this instead of phys-smmuv3? Thanks Eric > > Please let me know. > > Thanks, > Shameer > > > >
On Fri, Nov 08, 2024 at 12:52:37PM +0000, Shameer Kolothum via wrote: > Hi, > > This series adds initial support for a user-creatable "arm-smmuv3-nested" > device to Qemu. At present the Qemu ARM SMMUv3 emulation is per machine > and cannot support multiple SMMUv3s. > > In order to support vfio-pci dev assignment with vSMMUv3, the physical > SMMUv3 has to be configured in nested mode. Having a pluggable > "arm-smmuv3-nested" device enables us to have multiple vSMMUv3 for Guests > running on a host with multiple physical SMMUv3s. A few benefits of doing > this are, I'm not very familiar with arm, but from this description I'm not really seeing how "nesting" is involved here. You're only talking about the host and 1 L1 guest, no L2 guest. Also what is the relation between the physical SMMUv3 and the guest SMMUv3 that's referenced ? Is this in fact some form of host device passthrough rather than nesting ? With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
On Fri, Dec 13, 2024 at 12:00:43PM +0000, Daniel P. Berrangé wrote: > On Fri, Nov 08, 2024 at 12:52:37PM +0000, Shameer Kolothum via wrote: > > Hi, > > > > This series adds initial support for a user-creatable "arm-smmuv3-nested" > > device to Qemu. At present the Qemu ARM SMMUv3 emulation is per machine > > and cannot support multiple SMMUv3s. > > > > In order to support vfio-pci dev assignment with vSMMUv3, the physical > > SMMUv3 has to be configured in nested mode. Having a pluggable > > "arm-smmuv3-nested" device enables us to have multiple vSMMUv3 for Guests > > running on a host with multiple physical SMMUv3s. A few benefits of doing > > this are, > > I'm not very familiar with arm, but from this description I'm not > really seeing how "nesting" is involved here. You're only talking > about the host and 1 L1 guest, no L2 guest. nesting is the term the iommu side is using to refer to the 2 dimensional paging, ie a guest page table on top of a hypervisor page table. Nothing to do with vm nesting. > Also what is the relation between the physical SMMUv3 and the guest > SMMUv3 that's referenced ? Is this in fact some form of host device > passthrough rather than nesting ? It is an acceeleration feature, the iommu HW does more work instead of the software emulating things. Similar to how the 2d paging option in KVM is an acceleration feature. All of the iommu series on vfio are creating paravirtualized iommu models inside the VM. They access various levels of HW acceleration to speed up the paravirtualization. Jason
On Fri, 13 Dec 2024 at 12:46, Jason Gunthorpe <jgg@nvidia.com> wrote: > > On Fri, Dec 13, 2024 at 12:00:43PM +0000, Daniel P. Berrangé wrote: > > On Fri, Nov 08, 2024 at 12:52:37PM +0000, Shameer Kolothum via wrote: > > > Hi, > > > > > > This series adds initial support for a user-creatable "arm-smmuv3-nested" > > > device to Qemu. At present the Qemu ARM SMMUv3 emulation is per machine > > > and cannot support multiple SMMUv3s. > > > > > > In order to support vfio-pci dev assignment with vSMMUv3, the physical > > > SMMUv3 has to be configured in nested mode. Having a pluggable > > > "arm-smmuv3-nested" device enables us to have multiple vSMMUv3 for Guests > > > running on a host with multiple physical SMMUv3s. A few benefits of doing > > > this are, > > > > I'm not very familiar with arm, but from this description I'm not > > really seeing how "nesting" is involved here. You're only talking > > about the host and 1 L1 guest, no L2 guest. > > nesting is the term the iommu side is using to refer to the 2 > dimensional paging, ie a guest page table on top of a hypervisor page > table. Isn't that more usually called "two stage" paging? Calling that "nesting" seems like it is going to be massively confusing... Also, how does it relate to what this series seems to be doing, where we provide the guest with two separate SMMUs? (Are those two SMMUs "nested" in the sense that one is sitting behind the other?) thanks -- PMM
> -----Original Message-----
> From: Peter Maydell <peter.maydell@linaro.org>
> Sent: Friday, December 13, 2024 1:33 PM
> To: Jason Gunthorpe <jgg@nvidia.com>
> Cc: Daniel P. Berrangé <berrange@redhat.com>; Shameerali Kolothum
> Thodi <shameerali.kolothum.thodi@huawei.com>; qemu-arm@nongnu.org;
> qemu-devel@nongnu.org; eric.auger@redhat.com; nicolinc@nvidia.com;
> ddutile@redhat.com; Linuxarm <linuxarm@huawei.com>; Wangzhou (B)
> <wangzhou1@hisilicon.com>; jiangkunkun <jiangkunkun@huawei.com>;
> Jonathan Cameron <jonathan.cameron@huawei.com>;
> zhangfei.gao@linaro.org
> Subject: Re: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable
> nested SMMUv3
>
> On Fri, 13 Dec 2024 at 12:46, Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Fri, Dec 13, 2024 at 12:00:43PM +0000, Daniel P. Berrangé wrote:
> > > On Fri, Nov 08, 2024 at 12:52:37PM +0000, Shameer Kolothum via wrote:
> > > > Hi,
> > > >
> > > > This series adds initial support for a user-creatable "arm-smmuv3-
> nested"
> > > > device to Qemu. At present the Qemu ARM SMMUv3 emulation is per
> machine
> > > > and cannot support multiple SMMUv3s.
> > > >
> > > > In order to support vfio-pci dev assignment with vSMMUv3, the
> physical
> > > > SMMUv3 has to be configured in nested mode. Having a pluggable
> > > > "arm-smmuv3-nested" device enables us to have multiple vSMMUv3
> for Guests
> > > > running on a host with multiple physical SMMUv3s. A few benefits of
> doing
> > > > this are,
> > >
> > > I'm not very familiar with arm, but from this description I'm not
> > > really seeing how "nesting" is involved here. You're only talking
> > > about the host and 1 L1 guest, no L2 guest.
> >
> > nesting is the term the iommu side is using to refer to the 2
> > dimensional paging, ie a guest page table on top of a hypervisor page
> > table.
>
> Isn't that more usually called "two stage" paging? Calling
> that "nesting" seems like it is going to be massively confusing...
Yes. This will be renamed in future revisions as arm-smmuv3-accel.
>
> Also, how does it relate to what this series seems to be
> doing, where we provide the guest with two separate SMMUs?
> (Are those two SMMUs "nested" in the sense that one is sitting
> behind the other?)
I don't think it requires two SMMUs in Guest. The nested or "two
stage" means the stage 1 page table is owned by Guest and stage 2
by host. And this is achieved by IOMMUFD provided IOCTLs.
There is a precurser to this series where the support for hw accelerated
2 stage support is added in Qemu SMMUv3 code.
Please see the complete branch here,
https://github.com/hisilicon/qemu/commits/private-smmuv3-nested-dev-rfc-v1/
And patches prior to this commit adds that support:
4ccdbe3: ("cover-letter: Add HW accelerated nesting support for arm
SMMUv3")
Nicolin is soon going to send out those for review. Or I can include
those in this series so that it gives a complete picture. Nicolin?
Hope this clarifies any confusion.
Thanks,
Shameer
On Mon, Dec 16, 2024 at 10:01:29AM +0000, Shameerali Kolothum Thodi wrote:
> And patches prior to this commit adds that support:
> 4ccdbe3: ("cover-letter: Add HW accelerated nesting support for arm
> SMMUv3")
>
> Nicolin is soon going to send out those for review. Or I can include
> those in this series so that it gives a complete picture. Nicolin?
Just found that I forgot to reply this one...sorry
I asked Don/Eric to take over that vSMMU series:
https://lore.kernel.org/qemu-devel/Zy0jiPItu8A3wNTL@Asurada-Nvidia/
(The majority of my effort has been still on the kernel side:
previously vIOMMU/vDEVICE, and now vEVENTQ/MSI/vCMDQ..)
Don/Eric, is there any update from your side?
I think it's also a good time to align with each other so we
can take our next step in the new year :)
Thanks
Nicolin
Hi Nicolin,
On 1/9/25 5:45 AM, Nicolin Chen wrote:
> On Mon, Dec 16, 2024 at 10:01:29AM +0000, Shameerali Kolothum Thodi wrote:
>> And patches prior to this commit adds that support:
>> 4ccdbe3: ("cover-letter: Add HW accelerated nesting support for arm
>> SMMUv3")
>>
>> Nicolin is soon going to send out those for review. Or I can include
>> those in this series so that it gives a complete picture. Nicolin?
> Just found that I forgot to reply this one...sorry
>
> I asked Don/Eric to take over that vSMMU series:
> https://lore.kernel.org/qemu-devel/Zy0jiPItu8A3wNTL@Asurada-Nvidia/
> (The majority of my effort has been still on the kernel side:
> previously vIOMMU/vDEVICE, and now vEVENTQ/MSI/vCMDQ..)
>
> Don/Eric, is there any update from your side?
To be honest we have not much progressed so far. On my end I can
dedicate some cycles now. I currently try to understand how and what
subset I can respin and which test setup can be used. I will come back
to you next week.
Eric
>
> I think it's also a good time to align with each other so we
> can take our next step in the new year :)
>
> Thanks
> Nicolin
>
On Fri, Jan 31, 2025 at 05:54:56PM +0100, Eric Auger wrote:
> On 1/9/25 5:45 AM, Nicolin Chen wrote:
> > On Mon, Dec 16, 2024 at 10:01:29AM +0000, Shameerali Kolothum Thodi wrote:
> >> And patches prior to this commit adds that support:
> >> 4ccdbe3: ("cover-letter: Add HW accelerated nesting support for arm
> >> SMMUv3")
> >>
> >> Nicolin is soon going to send out those for review. Or I can include
> >> those in this series so that it gives a complete picture. Nicolin?
> > Just found that I forgot to reply this one...sorry
> >
> > I asked Don/Eric to take over that vSMMU series:
> > https://lore.kernel.org/qemu-devel/Zy0jiPItu8A3wNTL@Asurada-Nvidia/
> > (The majority of my effort has been still on the kernel side:
> > previously vIOMMU/vDEVICE, and now vEVENTQ/MSI/vCMDQ..)
> >
> > Don/Eric, is there any update from your side?
> To be honest we have not much progressed so far. On my end I can
> dedicate some cycles now. I currently try to understand how and what
> subset I can respin and which test setup can be used. I will come back
> to you next week.
In summary, we will have the following series:
1) HWPT uAPI patches in backends/iommufd.c (Zhenzhong or Shameer)
https://lore.kernel.org/qemu-devel/SJ0PR11MB6744943702EB5798EC9B3B9992E02@SJ0PR11MB6744.namprd11.prod.outlook.com/
2) vIOMMU uAPI patches in backends/iommufd.c (I will rebase/send)
3) vSMMUv3 patches for HW-acc/nesting (Hoping Don/you could take over)
4) Shameer's work on "-device" in ARM virt.c
5) vEVENTQ for fault injection (if time is right, squash into 2/3)
Perhaps, 3/4 would come in a different order, or maybe 4 could split
into a few patches changing "-device" (sending before 3) and then a
few other patches adding multi-vSMMU support (sending after 3).
My latest QEMU branch for reference:
https://github.com/nicolinc/qemu/commits/wip/for_iommufd_veventq-v6
It hasn't integrated Shameer's and Nathan's work though..
For testing, use this kernel branch:
https://github.com/nicolinc/iommufd/commits/iommufd_veventq-v6-with-rmr
I think we'd need to build a shared branch by integrating the latest
series in the list above.
Thanks
Nicolin
Hi Nicolin, Shameer,
On 2/3/25 7:50 PM, Nicolin Chen wrote:
> On Fri, Jan 31, 2025 at 05:54:56PM +0100, Eric Auger wrote:
>> On 1/9/25 5:45 AM, Nicolin Chen wrote:
>>> On Mon, Dec 16, 2024 at 10:01:29AM +0000, Shameerali Kolothum Thodi wrote:
>>>> And patches prior to this commit adds that support:
>>>> 4ccdbe3: ("cover-letter: Add HW accelerated nesting support for arm
>>>> SMMUv3")
>>>>
>>>> Nicolin is soon going to send out those for review. Or I can include
>>>> those in this series so that it gives a complete picture. Nicolin?
>>> Just found that I forgot to reply this one...sorry
>>>
>>> I asked Don/Eric to take over that vSMMU series:
>>> https://lore.kernel.org/qemu-devel/Zy0jiPItu8A3wNTL@Asurada-Nvidia/
>>> (The majority of my effort has been still on the kernel side:
>>> previously vIOMMU/vDEVICE, and now vEVENTQ/MSI/vCMDQ..)
>>>
>>> Don/Eric, is there any update from your side?
>> To be honest we have not much progressed so far. On my end I can
>> dedicate some cycles now. I currently try to understand how and what
>> subset I can respin and which test setup can be used. I will come back
>> to you next week.
> In summary, we will have the following series:
> 1) HWPT uAPI patches in backends/iommufd.c (Zhenzhong or Shameer)
> https://lore.kernel.org/qemu-devel/SJ0PR11MB6744943702EB5798EC9B3B9992E02@SJ0PR11MB6744.namprd11.prod.outlook.com/
> 2) vIOMMU uAPI patches in backends/iommufd.c (I will rebase/send)
for 1 and 2, are you taking about the "Add VIOMMU infrastructure support
" series in Shameer's branch: private-smmuv3-nested-dev-rfc-v1.
Sorry I may instead refer to NVidia or Intel's branch but I am not sure
about the last ones.
> 3) vSMMUv3 patches for HW-acc/nesting (Hoping Don/you could take over)
We can start sending it upstream assuming we have a decent test environment.
However in
https://lore.kernel.org/all/329445b2f68a47269292aefb34584375@huawei.com/
Shameer suggested he may include it in his SMMU multi instance series.
What do you both prefer?
Eric
> 4) Shameer's work on "-device" in ARM virt.c
> 5) vEVENTQ for fault injection (if time is right, squash into 2/3)
>
> Perhaps, 3/4 would come in a different order, or maybe 4 could split
> into a few patches changing "-device" (sending before 3) and then a
> few other patches adding multi-vSMMU support (sending after 3).
>
> My latest QEMU branch for reference:
> https://github.com/nicolinc/qemu/commits/wip/for_iommufd_veventq-v6
> It hasn't integrated Shameer's and Nathan's work though..
> For testing, use this kernel branch:
> https://github.com/nicolinc/iommufd/commits/iommufd_veventq-v6-with-rmr
>
> I think we'd need to build a shared branch by integrating the latest
> series in the list above.
>
> Thanks
> Nicolin
>
On Tue, Feb 04, 2025 at 06:49:15PM +0100, Eric Auger wrote: > > In summary, we will have the following series: > > 1) HWPT uAPI patches in backends/iommufd.c (Zhenzhong or Shameer) > > https://lore.kernel.org/qemu-devel/SJ0PR11MB6744943702EB5798EC9B3B9992E02@SJ0PR11MB6744.namprd11.prod.outlook.com/ > > 2) vIOMMU uAPI patches in backends/iommufd.c (I will rebase/send) > for 1 and 2, are you taking about the "Add VIOMMU infrastructure support > " series in Shameer's branch: private-smmuv3-nested-dev-rfc-v1. > Sorry I may instead refer to NVidia or Intel's branch but I am not sure > about the last ones. That "vIOMMU infrastructure" is for 2, yes. For 1, it's inside the Intel's series: "cover-letter: intel_iommu: Enable stage-1 translation for passthrough device" So, we need to extract them out and make it separately.. > > 3) vSMMUv3 patches for HW-acc/nesting (Hoping Don/you could take over) > We can start sending it upstream assuming we have a decent test environment. > > However in > https://lore.kernel.org/all/329445b2f68a47269292aefb34584375@huawei.com/ > > Shameer suggested he may include it in his SMMU multi instance series. > What do you both prefer? Sure, I think it's good to include those patches, though I believe we need to build a new shared branch as Shameer's branch might not reflect the latest kernel uAPI header. Here is a new branch on top of latest master tree (v9.2.50): https://github.com/nicolinc/qemu/commits/wip/for_shameer_02042025 I took HWPT patches from Zhenzhong's series and rebased all related changes from my tree. I did some sanity and it should work with RMR. Shameer, would you please try this branch and then integrate your series on top of the following series? cover-letter: Add HW accelerated nesting support for arm SMMUv3 cover-letter: Add vIOMMU-based nesting infrastructure support cover-letter: Add HWPT-based nesting infrastructure support Basically, just replace my old multi-instance series with yours, to create a shared branch for all of us. Eric, perhaps you can start to look at the these series. Even the first two iommufd series are a bit of rough integrations :) Thanks Nicolin
Hi Nicolin, > -----Original Message----- > From: Nicolin Chen <nicolinc@nvidia.com> > Sent: Wednesday, February 5, 2025 12:09 AM > To: Shameerali Kolothum Thodi > <shameerali.kolothum.thodi@huawei.com>; Eric Auger > <eric.auger@redhat.com> > Cc: ddutile@redhat.com; Peter Maydell <peter.maydell@linaro.org>; Jason > Gunthorpe <jgg@nvidia.com>; Daniel P. Berrangé <berrange@redhat.com>; > qemu-arm@nongnu.org; qemu-devel@nongnu.org; Linuxarm > <linuxarm@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>; > jiangkunkun <jiangkunkun@huawei.com>; Jonathan Cameron > <jonathan.cameron@huawei.com>; zhangfei.gao@linaro.org > Subject: Re: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable > nested SMMUv3 > > On Tue, Feb 04, 2025 at 06:49:15PM +0100, Eric Auger wrote: > > > In summary, we will have the following series: > > > 1) HWPT uAPI patches in backends/iommufd.c (Zhenzhong or Shameer) > > > https://lore.kernel.org/qemu- > devel/SJ0PR11MB6744943702EB5798EC9B3B9992E02@SJ0PR11MB6744.nam > prd11.prod.outlook.com/ > > > 2) vIOMMU uAPI patches in backends/iommufd.c (I will rebase/send) > > > for 1 and 2, are you taking about the "Add VIOMMU infrastructure > support > > " series in Shameer's branch: private-smmuv3-nested-dev-rfc-v1. > > Sorry I may instead refer to NVidia or Intel's branch but I am not sure > > about the last ones. > > That "vIOMMU infrastructure" is for 2, yes. > > For 1, it's inside the Intel's series: > "cover-letter: intel_iommu: Enable stage-1 translation for passthrough > device" > > So, we need to extract them out and make it separately.. > > > > 3) vSMMUv3 patches for HW-acc/nesting (Hoping Don/you could take > over) > > We can start sending it upstream assuming we have a decent test > environment. > > > > However in > > > https://lore.kernel.org/all/329445b2f68a47269292aefb34584375@huawei.c > om/ > > > > Shameer suggested he may include it in his SMMU multi instance series. > > What do you both prefer? > > Sure, I think it's good to include those patches, One of the feedback I received on my series was to rename "arm-smmuv3-nested" to "arm-smmuv3-accel" and possibly rename function names to include "accel' as well and move those functions to a separate "smmuv3-accel.c" file. I suppose that applies to the " Add HW accelerated nesting support for arm SMMUv3" series as well. Is that fine with you? Thanks, Shameer
On Thu, Feb 06, 2025 at 10:34:15AM +0000, Shameerali Kolothum Thodi wrote: > > -----Original Message----- > > From: Nicolin Chen <nicolinc@nvidia.com> > > On Tue, Feb 04, 2025 at 06:49:15PM +0100, Eric Auger wrote: > > > However in > > > > > > Shameer suggested he may include it in his SMMU multi instance series. > > > What do you both prefer? > > > > Sure, I think it's good to include those patches, > > One of the feedback I received on my series was to rename "arm-smmuv3-nested" > to "arm-smmuv3-accel" and possibly rename function names to include "accel' as well > and move those functions to a separate "smmuv3-accel.c" file. I suppose that applies to > the " Add HW accelerated nesting support for arm SMMUv3" series as well. > > Is that fine with you? Oh, no problem. If you want to rename the whole thing, please feel free. I do see the naming conflict between the "nested" stage and the "nested" HW feature, which are both supported by the vSMMU now. Thanks Nicolin
Hi Nicolin, > -----Original Message----- > From: Nicolin Chen <nicolinc@nvidia.com> > Sent: Thursday, February 6, 2025 6:58 PM > To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> > Cc: Eric Auger <eric.auger@redhat.com>; ddutile@redhat.com; Peter > Maydell <peter.maydell@linaro.org>; Jason Gunthorpe <jgg@nvidia.com>; > Daniel P. Berrangé <berrange@redhat.com>; qemu-arm@nongnu.org; > qemu-devel@nongnu.org; Linuxarm <linuxarm@huawei.com>; Wangzhou > (B) <wangzhou1@hisilicon.com>; jiangkunkun <jiangkunkun@huawei.com>; > Jonathan Cameron <jonathan.cameron@huawei.com>; > zhangfei.gao@linaro.org > Subject: Re: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable > nested SMMUv3 > [..] > > One of the feedback I received on my series was to rename "arm-smmuv3- > nested" > > to "arm-smmuv3-accel" and possibly rename function names to include > "accel' as well > > and move those functions to a separate "smmuv3-accel.c" file. I suppose > that applies to > > the " Add HW accelerated nesting support for arm SMMUv3" series as > well. > > > > Is that fine with you? > > Oh, no problem. If you want to rename the whole thing, please feel > free. I do see the naming conflict between the "nested" stage and > the "nested" HW feature, which are both supported by the vSMMU now. I am working on the above now and have quick question to you😊. Looking at the smmu_dev_attach_viommu() fn here[0], it appears to do the following: 1. Alloc a s2_hwpt if not allocated already and attach it. 2. Allocate abort and bypass hwpt 3. Attach bypass hwpt. I didn't get why we are doing the step 3 here. To me it looks like, when we attach the s2_hwpt(ie, the nested parent domain attach), the kernel will do, arm_smmu_attach_dev() arm_smmu_make_s2_domain_ste() It appears through step 3, we achieve the same thing again. Or it is possible I missed something obvious here. Please let me know. Thanks, Shameer [0] https://github.com/nicolinc/qemu/blob/wip/for_shameer_02042025/hw/arm/smmu-common.c#L910C13-L910C35
On Mon, Mar 03, 2025 at 03:21:57PM +0000, Shameerali Kolothum Thodi wrote: > I am working on the above now and have quick question to you😊. > > Looking at the smmu_dev_attach_viommu() fn here[0], > it appears to do the following: > > 1. Alloc a s2_hwpt if not allocated already and attach it. > 2. Allocate abort and bypass hwpt > 3. Attach bypass hwpt. > > I didn't get why we are doing the step 3 here. To me it looks like, > when we attach the s2_hwpt(ie, the nested parent domain attach), > the kernel will do, > > arm_smmu_attach_dev() > arm_smmu_make_s2_domain_ste() > > It appears through step 3, we achieve the same thing again. > > Or it is possible I missed something obvious here. Because a device cannot attach to a vIOMMU object directly, but only via a proxy hwpt_nested. So, this bypass hwpt gives us the port to associate the device to the vIOMMU, before a vDEVICE or a "translate" hwpt_nested is allocated. Currently it's the same because an S2 parent hwpt holds a VMID, so we could just attach the device to the S2 hwpt for the same STE configuration as attaching the device to the proxy bypass hwpt. Yet, this will change in the future after letting vIOMMU objects hold their own VMIDs to share a common S2 parent hwpt that won't have a VMID, i.e. arm_smmu_make_s2_domain_ste() will need the vIOMMU object to get the VMID for STE. I should have added a few lines of comments there :) Thanks Nicolin
> -----Original Message----- > From: Nicolin Chen <nicolinc@nvidia.com> > Sent: Monday, March 3, 2025 5:05 PM > To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> > Cc: Eric Auger <eric.auger@redhat.com>; ddutile@redhat.com; Peter > Maydell <peter.maydell@linaro.org>; Jason Gunthorpe <jgg@nvidia.com>; > Daniel P. Berrangé <berrange@redhat.com>; qemu-arm@nongnu.org; > qemu-devel@nongnu.org; Linuxarm <linuxarm@huawei.com>; Wangzhou > (B) <wangzhou1@hisilicon.com>; jiangkunkun <jiangkunkun@huawei.com>; > Jonathan Cameron <jonathan.cameron@huawei.com>; > zhangfei.gao@linaro.org > Subject: Re: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable > nested SMMUv3 > > On Mon, Mar 03, 2025 at 03:21:57PM +0000, Shameerali Kolothum Thodi > wrote: > > I am working on the above now and have quick question to you😊. > > > > Looking at the smmu_dev_attach_viommu() fn here[0], > > it appears to do the following: > > > > 1. Alloc a s2_hwpt if not allocated already and attach it. > > 2. Allocate abort and bypass hwpt > > 3. Attach bypass hwpt. > > > > I didn't get why we are doing the step 3 here. To me it looks like, > > when we attach the s2_hwpt(ie, the nested parent domain attach), > > the kernel will do, > > > > arm_smmu_attach_dev() > > arm_smmu_make_s2_domain_ste() > > > > It appears through step 3, we achieve the same thing again. > > > > Or it is possible I missed something obvious here. > > Because a device cannot attach to a vIOMMU object directly, but > only via a proxy hwpt_nested. So, this bypass hwpt gives us the > port to associate the device to the vIOMMU, before a vDEVICE or > a "translate" hwpt_nested is allocated. > > Currently it's the same because an S2 parent hwpt holds a VMID, > so we could just attach the device to the S2 hwpt for the same > STE configuration as attaching the device to the proxy bypass > hwpt. Yet, this will change in the future after letting vIOMMU > objects hold their own VMIDs to share a common S2 parent hwpt > that won't have a VMID, i.e. arm_smmu_make_s2_domain_ste() will > need the vIOMMU object to get the VMID for STE. > > I should have added a few lines of comments there :) Ok. Thanks for the explanation. I will keep it then and add few comments to make it clear. Do you have an initial implementation of the above with vIOMMU object holding the VMIDs to share? Actually I do have a dependency on that for my KVM pinned VMID series[0] where it was suggested that the VMID should associated with a vIOMMU object rather than the IOMMUFD context I used in there. And Jason mentioned about the work involved to do that here[1]. Appreciate if you could share if any progress is made on that so that I can try to rebase that KVM Pinned series on top of that and give it a try. Thanks, Shameer [0] https://lore.kernel.org/linux-iommu/20240208151837.35068-1-shameerali.kolothum.thodi@huawei.com/ [1] https://lore.kernel.org/linux-arm-kernel/20241129150628.GG1253388@nvidia.com/
Hi Nicolin, On 2/5/25 1:08 AM, Nicolin Chen wrote: > On Tue, Feb 04, 2025 at 06:49:15PM +0100, Eric Auger wrote: >>> In summary, we will have the following series: >>> 1) HWPT uAPI patches in backends/iommufd.c (Zhenzhong or Shameer) >>> https://lore.kernel.org/qemu-devel/SJ0PR11MB6744943702EB5798EC9B3B9992E02@SJ0PR11MB6744.namprd11.prod.outlook.com/ >>> 2) vIOMMU uAPI patches in backends/iommufd.c (I will rebase/send) >> for 1 and 2, are you taking about the "Add VIOMMU infrastructure support >> " series in Shameer's branch: private-smmuv3-nested-dev-rfc-v1. >> Sorry I may instead refer to NVidia or Intel's branch but I am not sure >> about the last ones. > That "vIOMMU infrastructure" is for 2, yes. > > For 1, it's inside the Intel's series: > "cover-letter: intel_iommu: Enable stage-1 translation for passthrough device" > > So, we need to extract them out and make it separately.. OK > >>> 3) vSMMUv3 patches for HW-acc/nesting (Hoping Don/you could take over) >> We can start sending it upstream assuming we have a decent test environment. >> >> However in >> https://lore.kernel.org/all/329445b2f68a47269292aefb34584375@huawei.com/ >> >> Shameer suggested he may include it in his SMMU multi instance series. >> What do you both prefer? > Sure, I think it's good to include those patches, though I believe > we need to build a new shared branch as Shameer's branch might not > reflect the latest kernel uAPI header. > > Here is a new branch on top of latest master tree (v9.2.50): > https://github.com/nicolinc/qemu/commits/wip/for_shameer_02042025 > > I took HWPT patches from Zhenzhong's series and rebased all related > changes from my tree. I did some sanity and it should work with RMR. > > Shameer, would you please try this branch and then integrate your > series on top of the following series? > cover-letter: Add HW accelerated nesting support for arm SMMUv3 > cover-letter: Add vIOMMU-based nesting infrastructure support > cover-letter: Add HWPT-based nesting infrastructure support > Basically, just replace my old multi-instance series with yours, to > create a shared branch for all of us. > > Eric, perhaps you can start to look at the these series. Even the > first two iommufd series are a bit of rough integrations :) OK I am starting this week Eric > > Thanks > Nicolin >
> -----Original Message----- > From: Nicolin Chen <nicolinc@nvidia.com> > Sent: Wednesday, February 5, 2025 12:09 AM > To: Shameerali Kolothum Thodi > <shameerali.kolothum.thodi@huawei.com>; Eric Auger > <eric.auger@redhat.com> > Cc: ddutile@redhat.com; Peter Maydell <peter.maydell@linaro.org>; Jason > Gunthorpe <jgg@nvidia.com>; Daniel P. Berrangé <berrange@redhat.com>; > qemu-arm@nongnu.org; qemu-devel@nongnu.org; Linuxarm > <linuxarm@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>; > jiangkunkun <jiangkunkun@huawei.com>; Jonathan Cameron > <jonathan.cameron@huawei.com>; zhangfei.gao@linaro.org > Subject: Re: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable > nested SMMUv3 > > On Tue, Feb 04, 2025 at 06:49:15PM +0100, Eric Auger wrote: > > > In summary, we will have the following series: > > > 1) HWPT uAPI patches in backends/iommufd.c (Zhenzhong or Shameer) > > > https://lore.kernel.org/qemu- > devel/SJ0PR11MB6744943702EB5798EC9B3B9992E02@SJ0PR11MB6744.nam > prd11.prod.outlook.com/ > > > 2) vIOMMU uAPI patches in backends/iommufd.c (I will rebase/send) > > > for 1 and 2, are you taking about the "Add VIOMMU infrastructure > support > > " series in Shameer's branch: private-smmuv3-nested-dev-rfc-v1. > > Sorry I may instead refer to NVidia or Intel's branch but I am not sure > > about the last ones. > > That "vIOMMU infrastructure" is for 2, yes. > > For 1, it's inside the Intel's series: > "cover-letter: intel_iommu: Enable stage-1 translation for passthrough > device" > > So, we need to extract them out and make it separately.. > > > > 3) vSMMUv3 patches for HW-acc/nesting (Hoping Don/you could take > over) > > We can start sending it upstream assuming we have a decent test > environment. > > > > However in > > > https://lore.kernel.org/all/329445b2f68a47269292aefb34584375@huawei.c > om/ > > > > Shameer suggested he may include it in his SMMU multi instance series. > > What do you both prefer? > > Sure, I think it's good to include those patches, though I believe > we need to build a new shared branch as Shameer's branch might not > reflect the latest kernel uAPI header. > > Here is a new branch on top of latest master tree (v9.2.50): > https://github.com/nicolinc/qemu/commits/wip/for_shameer_02042025 > > I took HWPT patches from Zhenzhong's series and rebased all related > changes from my tree. I did some sanity and it should work with RMR. > > Shameer, would you please try this branch and then integrate your > series on top of the following series? > cover-letter: Add HW accelerated nesting support for arm SMMUv3 > cover-letter: Add vIOMMU-based nesting infrastructure support > cover-letter: Add HWPT-based nesting infrastructure support > Basically, just replace my old multi-instance series with yours, to > create a shared branch for all of us. Ok. I will take a look at that and rebase. Thanks, Shameer
Nicolin,
Hi!
On 1/8/25 11:45 PM, Nicolin Chen wrote:
> On Mon, Dec 16, 2024 at 10:01:29AM +0000, Shameerali Kolothum Thodi wrote:
>> And patches prior to this commit adds that support:
>> 4ccdbe3: ("cover-letter: Add HW accelerated nesting support for arm
>> SMMUv3")
>>
>> Nicolin is soon going to send out those for review. Or I can include
>> those in this series so that it gives a complete picture. Nicolin?
>
> Just found that I forgot to reply this one...sorry
>
> I asked Don/Eric to take over that vSMMU series:
> https://lore.kernel.org/qemu-devel/Zy0jiPItu8A3wNTL@Asurada-Nvidia/
> (The majority of my effort has been still on the kernel side:
> previously vIOMMU/vDEVICE, and now vEVENTQ/MSI/vCMDQ..)
>
> Don/Eric, is there any update from your side?
>
Apologies for delayed response, been at customer site, and haven't been keeping up w/biz email.
Eric is probably waiting for me to get back and chat as well.
Will look to reply early next week.
- Don
> I think it's also a good time to align with each other so we
> can take our next step in the new year :)
>
> Thanks
> Nicolin
>
Hi Don,
On Fri, Jan 10, 2025 at 11:05:24PM -0500, Donald Dutile wrote:
> On 1/8/25 11:45 PM, Nicolin Chen wrote:
> > On Mon, Dec 16, 2024 at 10:01:29AM +0000, Shameerali Kolothum Thodi wrote:
> > > And patches prior to this commit adds that support:
> > > 4ccdbe3: ("cover-letter: Add HW accelerated nesting support for arm
> > > SMMUv3")
> > >
> > > Nicolin is soon going to send out those for review. Or I can include
> > > those in this series so that it gives a complete picture. Nicolin?
> >
> > Just found that I forgot to reply this one...sorry
> >
> > I asked Don/Eric to take over that vSMMU series:
> > https://lore.kernel.org/qemu-devel/Zy0jiPItu8A3wNTL@Asurada-Nvidia/
> > (The majority of my effort has been still on the kernel side:
> > previously vIOMMU/vDEVICE, and now vEVENTQ/MSI/vCMDQ..)
> >
> > Don/Eric, is there any update from your side?
> >
> Apologies for delayed response, been at customer site, and haven't been keeping up w/biz email.
> Eric is probably waiting for me to get back and chat as well.
> Will look to reply early next week.
I wonder if we can make some progress in Feb? If so, we can start
to wrap up the iommufd uAPI patches for HWPT, which was a part of
intel's series but never got sent since their emulated series is
seemingly still pending?
One detail for the uAPI patches is to decide how vIOMMU code will
interact with those backend APIs.. Hopefully, you and Eric should
have something in mind :)
Thanks
Nicolin
> -----Original Message-----
> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Thursday, January 23, 2025 4:10 AM
> To: Donald Dutile <ddutile@redhat.com>
> Cc: Shameerali Kolothum Thodi
> <shameerali.kolothum.thodi@huawei.com>; eric.auger@redhat.com; Peter
> Maydell <peter.maydell@linaro.org>; Jason Gunthorpe <jgg@nvidia.com>;
> Daniel P. Berrangé <berrange@redhat.com>; qemu-arm@nongnu.org;
> qemu-devel@nongnu.org; Linuxarm <linuxarm@huawei.com>; Wangzhou
> (B) <wangzhou1@hisilicon.com>; jiangkunkun <jiangkunkun@huawei.com>;
> Jonathan Cameron <jonathan.cameron@huawei.com>;
> zhangfei.gao@linaro.org
> Subject: Re: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable
> nested SMMUv3
>
> Hi Don,
>
> On Fri, Jan 10, 2025 at 11:05:24PM -0500, Donald Dutile wrote:
> > On 1/8/25 11:45 PM, Nicolin Chen wrote:
> > > On Mon, Dec 16, 2024 at 10:01:29AM +0000, Shameerali Kolothum Thodi
> wrote:
> > > > And patches prior to this commit adds that support:
> > > > 4ccdbe3: ("cover-letter: Add HW accelerated nesting support for arm
> > > > SMMUv3")
> > > >
> > > > Nicolin is soon going to send out those for review. Or I can include
> > > > those in this series so that it gives a complete picture. Nicolin?
> > >
> > > Just found that I forgot to reply this one...sorry
> > >
> > > I asked Don/Eric to take over that vSMMU series:
> > > https://lore.kernel.org/qemu-devel/Zy0jiPItu8A3wNTL@Asurada-Nvidia/
> > > (The majority of my effort has been still on the kernel side:
> > > previously vIOMMU/vDEVICE, and now vEVENTQ/MSI/vCMDQ..)
> > >
> > > Don/Eric, is there any update from your side?
> > >
> > Apologies for delayed response, been at customer site, and haven't been
> keeping up w/biz email.
> > Eric is probably waiting for me to get back and chat as well.
> > Will look to reply early next week.
>
> I wonder if we can make some progress in Feb? If so, we can start
> to wrap up the iommufd uAPI patches for HWPT, which was a part of
> intel's series but never got sent since their emulated series is
> seemingly still pending?
I think these are the 5 patches that we require from Intel pass-through series,
vfio/iommufd: Implement [at|de]tach_hwpt handlers
vfio/iommufd: Implement HostIOMMUDeviceClass::realize_late() handler
HostIOMMUDevice: Introduce realize_late callback
vfio/iommufd: Add properties and handlers to TYPE_HOST_IOMMU_DEVICE_IOMMUFD
backends/iommufd: Add helpers for invalidating user-managed HWPT
See the commits from here,
https://github.com/hisilicon/qemu/commit/bbdc65af38fa5723f1bd9b026e292730901f57b5
[CC Zhenzhong]
Hi Zhenzhong,
Just wondering what your plans are for the above patches. If it make sense and you
are fine with it, I think it is a good idea one of us can pick up those from that series
and sent out separately so that it can get some review and take it forward.
Thanks,
Shameer
Hi Shameer,
>-----Original Message-----
>From: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
>Subject: RE: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable nested
>SMMUv3
>
>
>
>> -----Original Message-----
>> From: Nicolin Chen <nicolinc@nvidia.com>
>> Sent: Thursday, January 23, 2025 4:10 AM
>> To: Donald Dutile <ddutile@redhat.com>
>> Cc: Shameerali Kolothum Thodi
>> <shameerali.kolothum.thodi@huawei.com>; eric.auger@redhat.com; Peter
>> Maydell <peter.maydell@linaro.org>; Jason Gunthorpe <jgg@nvidia.com>;
>> Daniel P. Berrangé <berrange@redhat.com>; qemu-arm@nongnu.org;
>> qemu-devel@nongnu.org; Linuxarm <linuxarm@huawei.com>; Wangzhou
>> (B) <wangzhou1@hisilicon.com>; jiangkunkun <jiangkunkun@huawei.com>;
>> Jonathan Cameron <jonathan.cameron@huawei.com>;
>> zhangfei.gao@linaro.org
>> Subject: Re: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable
>> nested SMMUv3
>>
>> Hi Don,
>>
>> On Fri, Jan 10, 2025 at 11:05:24PM -0500, Donald Dutile wrote:
>> > On 1/8/25 11:45 PM, Nicolin Chen wrote:
>> > > On Mon, Dec 16, 2024 at 10:01:29AM +0000, Shameerali Kolothum Thodi
>> wrote:
>> > > > And patches prior to this commit adds that support:
>> > > > 4ccdbe3: ("cover-letter: Add HW accelerated nesting support for arm
>> > > > SMMUv3")
>> > > >
>> > > > Nicolin is soon going to send out those for review. Or I can include
>> > > > those in this series so that it gives a complete picture. Nicolin?
>> > >
>> > > Just found that I forgot to reply this one...sorry
>> > >
>> > > I asked Don/Eric to take over that vSMMU series:
>> > > https://lore.kernel.org/qemu-devel/Zy0jiPItu8A3wNTL@Asurada-Nvidia/
>> > > (The majority of my effort has been still on the kernel side:
>> > > previously vIOMMU/vDEVICE, and now vEVENTQ/MSI/vCMDQ..)
>> > >
>> > > Don/Eric, is there any update from your side?
>> > >
>> > Apologies for delayed response, been at customer site, and haven't been
>> keeping up w/biz email.
>> > Eric is probably waiting for me to get back and chat as well.
>> > Will look to reply early next week.
>>
>> I wonder if we can make some progress in Feb? If so, we can start
>> to wrap up the iommufd uAPI patches for HWPT, which was a part of
>> intel's series but never got sent since their emulated series is
>> seemingly still pending?
>
>I think these are the 5 patches that we require from Intel pass-through series,
>
>vfio/iommufd: Implement [at|de]tach_hwpt handlers
>vfio/iommufd: Implement HostIOMMUDeviceClass::realize_late() handler
>HostIOMMUDevice: Introduce realize_late callback
>vfio/iommufd: Add properties and handlers to
>TYPE_HOST_IOMMU_DEVICE_IOMMUFD
>backends/iommufd: Add helpers for invalidating user-managed HWPT
>
>See the commits from here,
>https://github.com/hisilicon/qemu/commit/bbdc65af38fa5723f1bd9b026e29273
>0901f57b5
>
>[CC Zhenzhong]
>
>Hi Zhenzhong,
>
>Just wondering what your plans are for the above patches. If it make sense and
>you
>are fine with it, I think it is a good idea one of us can pick up those from that
>series
>and sent out separately so that it can get some review and take it forward.
Emulated series is merged, I plan to send Intel pass-through series after
Chinese festival vacation, but at least half a month later. So feel free to
pick those patches you need and send for comments.
Thanks
Zhenzhong
Hi Shameer, Nicolin,
>-----Original Message-----
>From: Duan, Zhenzhong
>Subject: RE: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable nested
>SMMUv3
>
>Hi Shameer,
>
>>-----Original Message-----
>>From: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
>>Subject: RE: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable
>nested
>>SMMUv3
>>
>>
>>
>>> -----Original Message-----
>>> From: Nicolin Chen <nicolinc@nvidia.com>
>>> Sent: Thursday, January 23, 2025 4:10 AM
>>> To: Donald Dutile <ddutile@redhat.com>
>>> Cc: Shameerali Kolothum Thodi
>>> <shameerali.kolothum.thodi@huawei.com>; eric.auger@redhat.com; Peter
>>> Maydell <peter.maydell@linaro.org>; Jason Gunthorpe <jgg@nvidia.com>;
>>> Daniel P. Berrangé <berrange@redhat.com>; qemu-arm@nongnu.org;
>>> qemu-devel@nongnu.org; Linuxarm <linuxarm@huawei.com>; Wangzhou
>>> (B) <wangzhou1@hisilicon.com>; jiangkunkun <jiangkunkun@huawei.com>;
>>> Jonathan Cameron <jonathan.cameron@huawei.com>;
>>> zhangfei.gao@linaro.org
>>> Subject: Re: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable
>>> nested SMMUv3
>>>
>>> Hi Don,
>>>
>>> On Fri, Jan 10, 2025 at 11:05:24PM -0500, Donald Dutile wrote:
>>> > On 1/8/25 11:45 PM, Nicolin Chen wrote:
>>> > > On Mon, Dec 16, 2024 at 10:01:29AM +0000, Shameerali Kolothum Thodi
>>> wrote:
>>> > > > And patches prior to this commit adds that support:
>>> > > > 4ccdbe3: ("cover-letter: Add HW accelerated nesting support for arm
>>> > > > SMMUv3")
>>> > > >
>>> > > > Nicolin is soon going to send out those for review. Or I can include
>>> > > > those in this series so that it gives a complete picture. Nicolin?
>>> > >
>>> > > Just found that I forgot to reply this one...sorry
>>> > >
>>> > > I asked Don/Eric to take over that vSMMU series:
>>> > > https://lore.kernel.org/qemu-devel/Zy0jiPItu8A3wNTL@Asurada-Nvidia/
>>> > > (The majority of my effort has been still on the kernel side:
>>> > > previously vIOMMU/vDEVICE, and now vEVENTQ/MSI/vCMDQ..)
>>> > >
>>> > > Don/Eric, is there any update from your side?
>>> > >
>>> > Apologies for delayed response, been at customer site, and haven't been
>>> keeping up w/biz email.
>>> > Eric is probably waiting for me to get back and chat as well.
>>> > Will look to reply early next week.
>>>
>>> I wonder if we can make some progress in Feb? If so, we can start
>>> to wrap up the iommufd uAPI patches for HWPT, which was a part of
>>> intel's series but never got sent since their emulated series is
>>> seemingly still pending?
>>
>>I think these are the 5 patches that we require from Intel pass-through series,
>>
>>vfio/iommufd: Implement [at|de]tach_hwpt handlers
>>vfio/iommufd: Implement HostIOMMUDeviceClass::realize_late() handler
>>HostIOMMUDevice: Introduce realize_late callback
>>vfio/iommufd: Add properties and handlers to
>>TYPE_HOST_IOMMU_DEVICE_IOMMUFD
>>backends/iommufd: Add helpers for invalidating user-managed HWPT
>>
>>See the commits from here,
>>https://github.com/hisilicon/qemu/commit/bbdc65af38fa5723f1bd9b026e2927
>3
>>0901f57b5
>>
>>[CC Zhenzhong]
>>
>>Hi Zhenzhong,
>>
>>Just wondering what your plans are for the above patches. If it make sense and
>>you
>>are fine with it, I think it is a good idea one of us can pick up those from that
>>series
>>and sent out separately so that it can get some review and take it forward.
>
>Emulated series is merged, I plan to send Intel pass-through series after
>Chinese festival vacation, but at least half a month later. So feel free to
>pick those patches you need and send for comments.
I plan to send vtd nesting series out this week and want to ask about status
of "1) HWPT uAPI patches in backends/iommufd.c" series.
If you had sent it out, I will do a rebase and bypass them to avoid duplicate
review effort in community. Or I can send them in vtd nesting series if you not yet.
Thanks
Zhenzhong
Hi Zhenzhong, > -----Original Message----- > From: Duan, Zhenzhong <zhenzhong.duan@intel.com> > Sent: Monday, February 17, 2025 9:17 AM > To: Shameerali Kolothum Thodi > <shameerali.kolothum.thodi@huawei.com>; Nicolin Chen > <nicolinc@nvidia.com>; Donald Dutile <ddutile@redhat.com> > Cc: eric.auger@redhat.com; Peter Maydell <peter.maydell@linaro.org>; > Jason Gunthorpe <jgg@nvidia.com>; Daniel P. Berrangé > <berrange@redhat.com>; qemu-arm@nongnu.org; qemu- > devel@nongnu.org; Linuxarm <linuxarm@huawei.com>; Wangzhou (B) > <wangzhou1@hisilicon.com>; jiangkunkun <jiangkunkun@huawei.com>; > Jonathan Cameron <jonathan.cameron@huawei.com>; > zhangfei.gao@linaro.org; Peng, Chao P <chao.p.peng@intel.com> > Subject: RE: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable > nested SMMUv3 > > Hi Shameer, Nicolin, > [...] > >>Hi Zhenzhong, > >> > >>Just wondering what your plans are for the above patches. If it make > sense and > >>you > >>are fine with it, I think it is a good idea one of us can pick up those from > that > >>series > >>and sent out separately so that it can get some review and take it > forward. > > > >Emulated series is merged, I plan to send Intel pass-through series after > >Chinese festival vacation, but at least half a month later. So feel free to > >pick those patches you need and send for comments. > > I plan to send vtd nesting series out this week and want to ask about status > of "1) HWPT uAPI patches in backends/iommufd.c" series. > > If you had sent it out, I will do a rebase and bypass them to avoid duplicate > review effort in community. Or I can send them in vtd nesting series if you > not yet. No. It is not send out yet. Please include it in your vtd nesting series. Thanks. I am currently working on refactoring the SMMUv3 accel series and the "Add HW accelerated nesting support for arm SMMUv3" series from Nicolin. Thanks, Shameer.
Hi Shammeer, On 2/18/25 7:52 AM, Shameerali Kolothum Thodi wrote: > Hi Zhenzhong, > >> -----Original Message----- >> From: Duan, Zhenzhong <zhenzhong.duan@intel.com> >> Sent: Monday, February 17, 2025 9:17 AM >> To: Shameerali Kolothum Thodi >> <shameerali.kolothum.thodi@huawei.com>; Nicolin Chen >> <nicolinc@nvidia.com>; Donald Dutile <ddutile@redhat.com> >> Cc: eric.auger@redhat.com; Peter Maydell <peter.maydell@linaro.org>; >> Jason Gunthorpe <jgg@nvidia.com>; Daniel P. Berrangé >> <berrange@redhat.com>; qemu-arm@nongnu.org; qemu- >> devel@nongnu.org; Linuxarm <linuxarm@huawei.com>; Wangzhou (B) >> <wangzhou1@hisilicon.com>; jiangkunkun <jiangkunkun@huawei.com>; >> Jonathan Cameron <jonathan.cameron@huawei.com>; >> zhangfei.gao@linaro.org; Peng, Chao P <chao.p.peng@intel.com> >> Subject: RE: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable >> nested SMMUv3 >> >> Hi Shameer, Nicolin, >> > [...] > >>>> Hi Zhenzhong, >>>> >>>> Just wondering what your plans are for the above patches. If it make >> sense and >>>> you >>>> are fine with it, I think it is a good idea one of us can pick up those from >> that >>>> series >>>> and sent out separately so that it can get some review and take it >> forward. >>> Emulated series is merged, I plan to send Intel pass-through series after >>> Chinese festival vacation, but at least half a month later. So feel free to >>> pick those patches you need and send for comments. >> I plan to send vtd nesting series out this week and want to ask about status >> of "1) HWPT uAPI patches in backends/iommufd.c" series. >> >> If you had sent it out, I will do a rebase and bypass them to avoid duplicate >> review effort in community. Or I can send them in vtd nesting series if you >> not yet. > No. It is not send out yet. Please include it in your vtd nesting series. Thanks. > > I am currently working on refactoring the SMMUv3 accel series and the > "Add HW accelerated nesting support for arm SMMUv3" series so will you send "Add HW accelerated nesting support for arm SMMUv3" or do you want me to do it? Thanks Eric > from Nicolin. > > Thanks, > Shameer. > >
> -----Original Message----- > From: Eric Auger <eric.auger@redhat.com> > Sent: Thursday, March 6, 2025 6:00 PM > To: Shameerali Kolothum Thodi > <shameerali.kolothum.thodi@huawei.com>; Duan, Zhenzhong > <zhenzhong.duan@intel.com>; Nicolin Chen <nicolinc@nvidia.com>; > Donald Dutile <ddutile@redhat.com> > Cc: Peter Maydell <peter.maydell@linaro.org>; Jason Gunthorpe > <jgg@nvidia.com>; Daniel P. Berrangé <berrange@redhat.com>; qemu- > arm@nongnu.org; qemu-devel@nongnu.org; Linuxarm > <linuxarm@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>; > jiangkunkun <jiangkunkun@huawei.com>; Jonathan Cameron > <jonathan.cameron@huawei.com>; zhangfei.gao@linaro.org; Peng, Chao P > <chao.p.peng@intel.com> > Subject: Re: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable > nested SMMUv3 > > Hi Shammeer, > Hi Eric, > > > > I am currently working on refactoring the SMMUv3 accel series and the > > "Add HW accelerated nesting support for arm SMMUv3" series > so will you send "Add HW accelerated nesting support for arm SMMUv3" or > do you want me to do it? Thanks Eric Yes. I am on it. Hopefully I will be able to send out everything next week. Thanks, Shameer
Hi Shameer, On 3/6/25 7:27 PM, Shameerali Kolothum Thodi wrote: > >> -----Original Message----- >> From: Eric Auger <eric.auger@redhat.com> >> Sent: Thursday, March 6, 2025 6:00 PM >> To: Shameerali Kolothum Thodi >> <shameerali.kolothum.thodi@huawei.com>; Duan, Zhenzhong >> <zhenzhong.duan@intel.com>; Nicolin Chen <nicolinc@nvidia.com>; >> Donald Dutile <ddutile@redhat.com> >> Cc: Peter Maydell <peter.maydell@linaro.org>; Jason Gunthorpe >> <jgg@nvidia.com>; Daniel P. Berrangé <berrange@redhat.com>; qemu- >> arm@nongnu.org; qemu-devel@nongnu.org; Linuxarm >> <linuxarm@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>; >> jiangkunkun <jiangkunkun@huawei.com>; Jonathan Cameron >> <jonathan.cameron@huawei.com>; zhangfei.gao@linaro.org; Peng, Chao P >> <chao.p.peng@intel.com> >> Subject: Re: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable >> nested SMMUv3 >> >> Hi Shammeer, >> > Hi Eric, > >>> I am currently working on refactoring the SMMUv3 accel series and the >>> "Add HW accelerated nesting support for arm SMMUv3" series >> so will you send "Add HW accelerated nesting support for arm SMMUv3" or >> do you want me to do it? Thanks Eric > Yes. I am on it. Hopefully I will be able to send out everything next week. Sure. No pressure. I will continue reviewing Zhenzhong's series then. Looking forward to seeing your respin. Eric > > Thanks, > Shameer
On Thu, Jan 23, 2025 at 08:28:34AM +0000, Shameerali Kolothum Thodi wrote: > > -----Original Message----- > > From: Nicolin Chen <nicolinc@nvidia.com> > > I wonder if we can make some progress in Feb? If so, we can start > > to wrap up the iommufd uAPI patches for HWPT, which was a part of > > intel's series but never got sent since their emulated series is > > seemingly still pending? > > I think these are the 5 patches that we require from Intel pass-through series, > > vfio/iommufd: Implement [at|de]tach_hwpt handlers > vfio/iommufd: Implement HostIOMMUDeviceClass::realize_late() handler > HostIOMMUDevice: Introduce realize_late callback > vfio/iommufd: Add properties and handlers to TYPE_HOST_IOMMU_DEVICE_IOMMUFD > backends/iommufd: Add helpers for invalidating user-managed HWPT > Hi Zhenzhong, > > Just wondering what your plans are for the above patches. If it make sense and you > are fine with it, I think it is a good idea one of us can pick up those from that series > and sent out separately so that it can get some review and take it forward. +1 These uAPI/backend patches can be sent in a smaller series to get reviewed prior to the intel/arm series. It can merge with either of the intel/arm series that runs faster at the end of the day :) Nicolin
On Fri, Dec 13, 2024 at 08:46:42AM -0400, Jason Gunthorpe wrote: > On Fri, Dec 13, 2024 at 12:00:43PM +0000, Daniel P. Berrangé wrote: > > On Fri, Nov 08, 2024 at 12:52:37PM +0000, Shameer Kolothum via wrote: > > > Hi, > > > > > > This series adds initial support for a user-creatable "arm-smmuv3-nested" > > > device to Qemu. At present the Qemu ARM SMMUv3 emulation is per machine > > > and cannot support multiple SMMUv3s. > > > > > > In order to support vfio-pci dev assignment with vSMMUv3, the physical > > > SMMUv3 has to be configured in nested mode. Having a pluggable > > > "arm-smmuv3-nested" device enables us to have multiple vSMMUv3 for Guests > > > running on a host with multiple physical SMMUv3s. A few benefits of doing > > > this are, > > > > I'm not very familiar with arm, but from this description I'm not > > really seeing how "nesting" is involved here. You're only talking > > about the host and 1 L1 guest, no L2 guest. > > nesting is the term the iommu side is using to refer to the 2 > dimensional paging, ie a guest page table on top of a hypervisor page > table. > > Nothing to do with vm nesting. Ok, that naming is destined to cause confusion for many, given the commonly understood use of 'nesting' in the context of VMs... > > > Also what is the relation between the physical SMMUv3 and the guest > > SMMUv3 that's referenced ? Is this in fact some form of host device > > passthrough rather than nesting ? > > It is an acceeleration feature, the iommu HW does more work instead of > the software emulating things. Similar to how the 2d paging option in > KVM is an acceleration feature. > > All of the iommu series on vfio are creating paravirtualized iommu > models inside the VM. They access various levels of HW acceleration to > speed up the paravirtualization. ... describing it as a HW accelerated iommu makes it significantly clearer to me what this proposal is about. Perhaps the device is better named as "arm-smmuv3-accel" ? With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
On 12/13/24 8:19 AM, Daniel P. Berrangé wrote: > On Fri, Dec 13, 2024 at 08:46:42AM -0400, Jason Gunthorpe wrote: >> On Fri, Dec 13, 2024 at 12:00:43PM +0000, Daniel P. Berrangé wrote: >>> On Fri, Nov 08, 2024 at 12:52:37PM +0000, Shameer Kolothum via wrote: >>>> Hi, >>>> >>>> This series adds initial support for a user-creatable "arm-smmuv3-nested" >>>> device to Qemu. At present the Qemu ARM SMMUv3 emulation is per machine >>>> and cannot support multiple SMMUv3s. >>>> >>>> In order to support vfio-pci dev assignment with vSMMUv3, the physical >>>> SMMUv3 has to be configured in nested mode. Having a pluggable >>>> "arm-smmuv3-nested" device enables us to have multiple vSMMUv3 for Guests >>>> running on a host with multiple physical SMMUv3s. A few benefits of doing >>>> this are, >>> >>> I'm not very familiar with arm, but from this description I'm not >>> really seeing how "nesting" is involved here. You're only talking >>> about the host and 1 L1 guest, no L2 guest. >> >> nesting is the term the iommu side is using to refer to the 2 >> dimensional paging, ie a guest page table on top of a hypervisor page >> table. >> >> Nothing to do with vm nesting. > > Ok, that naming is destined to cause confusion for many, given the > commonly understood use of 'nesting' in the context of VMs... > >> >>> Also what is the relation between the physical SMMUv3 and the guest >>> SMMUv3 that's referenced ? Is this in fact some form of host device >>> passthrough rather than nesting ? >> >> It is an acceeleration feature, the iommu HW does more work instead of >> the software emulating things. Similar to how the 2d paging option in >> KVM is an acceleration feature. >> >> All of the iommu series on vfio are creating paravirtualized iommu >> models inside the VM. They access various levels of HW acceleration to >> speed up the paravirtualization. > > ... describing it as a HW accelerated iommu makes it significantly clearer > to me what this proposal is about. Perhaps the device is better named as > "arm-smmuv3-accel" ? > I'm having deja-vu! ;-) Thanks for echo-ing my earlier statements in this patch series about the use of 'nested'. and the better use of 'accel' in these circumstances. Even 'accel' on an 'arm-smmuv3' is a bit of a hammer, as there can be multiple accel's features &/or implementations... I would like to see the 'accel' as a parameter to 'arm-smmuv3', and not a complete name-space onto itself, so we can do things like 'accel=cmdvq', accel='2-level', ... and for libvirt's sanity, a way to get those hw features from sysfs for (possible) migration-compatibility testing. > > With regards, > Daniel
> -----Original Message----- > From: Daniel P. Berrangé <berrange@redhat.com> > Sent: Friday, December 13, 2024 1:20 PM > To: Jason Gunthorpe <jgg@nvidia.com> > Cc: Shameerali Kolothum Thodi > <shameerali.kolothum.thodi@huawei.com>; qemu-arm@nongnu.org; > qemu-devel@nongnu.org; eric.auger@redhat.com; > peter.maydell@linaro.org; nicolinc@nvidia.com; ddutile@redhat.com; > Linuxarm <linuxarm@huawei.com>; Wangzhou (B) > <wangzhou1@hisilicon.com>; jiangkunkun <jiangkunkun@huawei.com>; > Jonathan Cameron <jonathan.cameron@huawei.com>; > zhangfei.gao@linaro.org > Subject: Re: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable > nested SMMUv3 > > On Fri, Dec 13, 2024 at 08:46:42AM -0400, Jason Gunthorpe wrote: > > On Fri, Dec 13, 2024 at 12:00:43PM +0000, Daniel P. Berrangé wrote: > > > On Fri, Nov 08, 2024 at 12:52:37PM +0000, Shameer Kolothum via wrote: > > > > Hi, > > > > > > > > This series adds initial support for a user-creatable "arm-smmuv3- > nested" > > > > device to Qemu. At present the Qemu ARM SMMUv3 emulation is per > machine > > > > and cannot support multiple SMMUv3s. > > > > > > > > In order to support vfio-pci dev assignment with vSMMUv3, the > physical > > > > SMMUv3 has to be configured in nested mode. Having a pluggable > > > > "arm-smmuv3-nested" device enables us to have multiple vSMMUv3 > for Guests > > > > running on a host with multiple physical SMMUv3s. A few benefits of > doing > > > > this are, > > > > > > I'm not very familiar with arm, but from this description I'm not > > > really seeing how "nesting" is involved here. You're only talking > > > about the host and 1 L1 guest, no L2 guest. > > > > nesting is the term the iommu side is using to refer to the 2 > > dimensional paging, ie a guest page table on top of a hypervisor page > > table. > > > > Nothing to do with vm nesting. > > Ok, that naming is destined to cause confusion for many, given the > commonly understood use of 'nesting' in the context of VMs... > > > > > > Also what is the relation between the physical SMMUv3 and the guest > > > SMMUv3 that's referenced ? Is this in fact some form of host device > > > passthrough rather than nesting ? > > > > It is an acceeleration feature, the iommu HW does more work instead of > > the software emulating things. Similar to how the 2d paging option in > > KVM is an acceleration feature. > > > > All of the iommu series on vfio are creating paravirtualized iommu > > models inside the VM. They access various levels of HW acceleration to > > speed up the paravirtualization. > > ... describing it as a HW accelerated iommu makes it significantly clearer > to me what this proposal is about. Perhaps the device is better named as > "arm-smmuv3-accel" ? Agree. There were similar previous comments from reviewers that current smmuv3 already has emulated stage 1 and stage 2 support and refers to that as "nested" in code. So this will be renamed as above. Thanks, Shameer
Hi Shameer, On 11/8/24 13:52, Shameer Kolothum wrote: > Hi, > > This series adds initial support for a user-creatable "arm-smmuv3-nested" > device to Qemu. At present the Qemu ARM SMMUv3 emulation is per machine > and cannot support multiple SMMUv3s. > > In order to support vfio-pci dev assignment with vSMMUv3, the physical > SMMUv3 has to be configured in nested mode. Having a pluggable > "arm-smmuv3-nested" device enables us to have multiple vSMMUv3 for Guests > running on a host with multiple physical SMMUv3s. A few benefits of doing > this are, > > 1. Avoid invalidation broadcast or lookup in case devices are behind > multiple phys SMMUv3s. > 2. Makes it easy to handle phys SMMUv3s that differ in features. > 3. Easy to handle future requirements such as vCMDQ support. > > This is based on discussions/suggestions received for a previous RFC by > Nicolin here[0]. > > This series includes, > -Adds support for "arm-smmuv3-nested" device. At present only virt is > supported and is using _plug_cb() callback to hook the sysbus mem > and irq (Not sure this has any negative repercussions). Patch #3. > -Provides a way to associate a pci-bus(pxb-pcie) to the above device. > Patch #3. > -The last patch is adding RMR support for MSI doorbell handling. Patch #5. > This may change in future[1]. > > This RFC is for initial discussion/test purposes only and includes patches > that are only relevant for adding the "arm-smmuv3-nested" support. For the > complete branch please find, > https://github.com/hisilicon/qemu/tree/private-smmuv3-nested-dev-rfc-v1 > > Few ToDos to note, > 1. At present default-bus-bypass-iommu=on should be set when > arm-smmuv3-nested dev is specified. Otherwise you may get an IORT > related boot error. Requires fixing. > 2. Hot adding a device is not working at the moment. Looks like pcihp irq issue. > Could be a bug in IORT id mappings. > 3. The above branch doesn't support vSVA yet. > > Hopefully this is helpful in taking the discussion forward. Please take a > look and let me know. > > How to use it(Eg:): > > On a HiSilicon platform that has multiple physical SMMUv3s, the ACC ZIP VF > devices and HNS VF devices are behind different SMMUv3s. So for a Guest, > specify two smmuv3-nested devices each behind a pxb-pcie as below, > > ./qemu-system-aarch64 -machine virt,gic-version=3,default-bus-bypass-iommu=on \ > -enable-kvm -cpu host -m 4G -smp cpus=8,maxcpus=8 \ > -object iommufd,id=iommufd0 \ > -bios QEMU_EFI.fd \ > -kernel Image \ > -device virtio-blk-device,drive=fs \ > -drive if=none,file=rootfs.qcow2,id=fs \ > -device pxb-pcie,id=pcie.1,bus_nr=8,bus=pcie.0 \ > -device pcie-root-port,id=pcie.port1,bus=pcie.1,chassis=1 \ > -device arm-smmuv3-nested,id=smmuv1,pci-bus=pcie.1 \ > -device vfio-pci,host=0000:7d:02.1,bus=pcie.port1,iommufd=iommufd0 \ > -device pxb-pcie,id=pcie.2,bus_nr=16,bus=pcie.0 \ > -device pcie-root-port,id=pcie.port2,bus=pcie.2,chassis=2 \ > -device arm-smmuv3-nested,id=smmuv2,pci-bus=pcie.2 \ > -device vfio-pci,host=0000:75:00.1,bus=pcie.port2,iommufd=iommufd0 \ > -append "rdinit=init console=ttyAMA0 root=/dev/vda2 rw earlycon=pl011,0x9000000" \ > -device virtio-9p-pci,fsdev=p9fs2,mount_tag=p9,bus=pcie.0 \ > -fsdev local,id=p9fs2,path=p9root,security_model=mapped \ This kind of instantiation matches what I had in mind. It is questionable whether the legacy SMMU shouldn't be migrated to that mode too (instead of using a machine option setting), depending on Peter's feedbacks and also comments from Libvirt guys. Adding Andrea in the loop. Thanks Eric > -net none \ > -nographic > > Guest will boot with two SMMuv3s, > [ 1.608130] arm-smmu-v3 arm-smmu-v3.0.auto: option mask 0x0 > [ 1.609655] arm-smmu-v3 arm-smmu-v3.0.auto: ias 48-bit, oas 48-bit (features 0x00020b25) > [ 1.612475] arm-smmu-v3 arm-smmu-v3.0.auto: allocated 65536 entries for cmdq > [ 1.614444] arm-smmu-v3 arm-smmu-v3.0.auto: allocated 32768 entries for evtq > [ 1.617451] arm-smmu-v3 arm-smmu-v3.1.auto: option mask 0x0 > [ 1.618842] arm-smmu-v3 arm-smmu-v3.1.auto: ias 48-bit, oas 48-bit (features 0x00020b25) > [ 1.621366] arm-smmu-v3 arm-smmu-v3.1.auto: allocated 65536 entries for cmdq > [ 1.623225] arm-smmu-v3 arm-smmu-v3.1.auto: allocated 32768 entries for evtq > > With a pci topology like below, > [root@localhost ~]# lspci -tv > -+-[0000:00]-+-00.0 Red Hat, Inc. QEMU PCIe Host bridge > | +-01.0 Red Hat, Inc. QEMU PCIe Expander bridge > | +-02.0 Red Hat, Inc. QEMU PCIe Expander bridge > | \-03.0 Virtio: Virtio filesystem > +-[0000:08]---00.0-[09]----00.0 Huawei Technologies Co., Ltd. HNS Network Controller (Virtual Function) > \-[0000:10]---00.0-[11]----00.0 Huawei Technologies Co., Ltd. HiSilicon ZIP Engine(Virtual Function) > [root@localhost ~]# > > And if you want to add another HNS VF, it should be added to the same SMMUv3 > as of the first HNS dev, > > -device pcie-root-port,id=pcie.port3,bus=pcie.1,chassis=3 \ > -device vfio-pci,host=0000:7d:02.2,bus=pcie.port3,iommufd=iommufd0 \ > > [root@localhost ~]# lspci -tv > -+-[0000:00]-+-00.0 Red Hat, Inc. QEMU PCIe Host bridge > | +-01.0 Red Hat, Inc. QEMU PCIe Expander bridge > | +-02.0 Red Hat, Inc. QEMU PCIe Expander bridge > | \-03.0 Virtio: Virtio filesystem > +-[0000:08]-+-00.0-[09]----00.0 Huawei Technologies Co., Ltd. HNS Network Controller (Virtual Function) > | \-01.0-[0a]----00.0 Huawei Technologies Co., Ltd. HNS Network Controller (Virtual Function) > \-[0000:10]---00.0-[11]----00.0 Huawei Technologies Co., Ltd. HiSilicon ZIP Engine(Virtual Function) > [root@localhost ~]# > > Attempt to add the HNS VF to a different SMMUv3 will result in, > > -device vfio-pci,host=0000:7d:02.2,bus=pcie.port3,iommufd=iommufd0: Unable to attach viommu > -device vfio-pci,host=0000:7d:02.2,bus=pcie.port3,iommufd=iommufd0: vfio 0000:7d:02.2: > Failed to set iommu_device: [iommufd=29] error attach 0000:7d:02.2 (38) to id=11: Invalid argument > > At present Qemu is not doing any extra validation other than the above > failure to make sure the user configuration is correct or not. The > assumption is libvirt will take care of this. > > Thanks, > Shameer > [0] https://lore.kernel.org/qemu-devel/cover.1719361174.git.nicolinc@nvidia.com/ > [1] https://lore.kernel.org/linux-iommu/ZrVN05VylFq8lK4q@Asurada-Nvidia/ > > Eric Auger (1): > hw/arm/virt-acpi-build: Add IORT RMR regions to handle MSI nested > binding > > Nicolin Chen (2): > hw/arm/virt: Add an SMMU_IO_LEN macro > hw/arm/virt-acpi-build: Build IORT with multiple SMMU nodes > > Shameer Kolothum (2): > hw/arm/smmuv3: Add initial support for SMMUv3 Nested device > hw/arm/smmuv3: Associate a pci bus with a SMMUv3 Nested device > > hw/arm/smmuv3.c | 61 ++++++++++++++++++++++ > hw/arm/virt-acpi-build.c | 109 ++++++++++++++++++++++++++++++++------- > hw/arm/virt.c | 33 ++++++++++-- > hw/core/sysbus-fdt.c | 1 + > include/hw/arm/smmuv3.h | 17 ++++++ > include/hw/arm/virt.h | 15 ++++++ > 6 files changed, 215 insertions(+), 21 deletions(-) >
On Mon, Nov 18, 2024 at 11:50:46AM +0100, Eric Auger wrote: > Hi Shameer, > > On 11/8/24 13:52, Shameer Kolothum wrote: > > Hi, > > > > This series adds initial support for a user-creatable "arm-smmuv3-nested" > > device to Qemu. At present the Qemu ARM SMMUv3 emulation is per machine > > and cannot support multiple SMMUv3s. > > > > In order to support vfio-pci dev assignment with vSMMUv3, the physical > > SMMUv3 has to be configured in nested mode. Having a pluggable > > "arm-smmuv3-nested" device enables us to have multiple vSMMUv3 for Guests > > running on a host with multiple physical SMMUv3s. A few benefits of doing > > this are, > > > > 1. Avoid invalidation broadcast or lookup in case devices are behind > > multiple phys SMMUv3s. > > 2. Makes it easy to handle phys SMMUv3s that differ in features. > > 3. Easy to handle future requirements such as vCMDQ support. > > > > This is based on discussions/suggestions received for a previous RFC by > > Nicolin here[0]. > > > > This series includes, > > -Adds support for "arm-smmuv3-nested" device. At present only virt is > > supported and is using _plug_cb() callback to hook the sysbus mem > > and irq (Not sure this has any negative repercussions). Patch #3. > > -Provides a way to associate a pci-bus(pxb-pcie) to the above device. > > Patch #3. > > -The last patch is adding RMR support for MSI doorbell handling. Patch #5. > > This may change in future[1]. > > > > This RFC is for initial discussion/test purposes only and includes patches > > that are only relevant for adding the "arm-smmuv3-nested" support. For the > > complete branch please find, > > https://github.com/hisilicon/qemu/tree/private-smmuv3-nested-dev-rfc-v1 > > > > Few ToDos to note, > > 1. At present default-bus-bypass-iommu=on should be set when > > arm-smmuv3-nested dev is specified. Otherwise you may get an IORT > > related boot error. Requires fixing. > > 2. Hot adding a device is not working at the moment. Looks like pcihp irq issue. > > Could be a bug in IORT id mappings. > > 3. The above branch doesn't support vSVA yet. > > > > Hopefully this is helpful in taking the discussion forward. Please take a > > look and let me know. > > > > How to use it(Eg:): > > > > On a HiSilicon platform that has multiple physical SMMUv3s, the ACC ZIP VF > > devices and HNS VF devices are behind different SMMUv3s. So for a Guest, > > specify two smmuv3-nested devices each behind a pxb-pcie as below, > > > > ./qemu-system-aarch64 -machine virt,gic-version=3,default-bus-bypass-iommu=on \ > > -enable-kvm -cpu host -m 4G -smp cpus=8,maxcpus=8 \ > > -object iommufd,id=iommufd0 \ > > -bios QEMU_EFI.fd \ > > -kernel Image \ > > -device virtio-blk-device,drive=fs \ > > -drive if=none,file=rootfs.qcow2,id=fs \ > > -device pxb-pcie,id=pcie.1,bus_nr=8,bus=pcie.0 \ > > -device pcie-root-port,id=pcie.port1,bus=pcie.1,chassis=1 \ > > -device arm-smmuv3-nested,id=smmuv1,pci-bus=pcie.1 \ > > -device vfio-pci,host=0000:7d:02.1,bus=pcie.port1,iommufd=iommufd0 \ > > -device pxb-pcie,id=pcie.2,bus_nr=16,bus=pcie.0 \ > > -device pcie-root-port,id=pcie.port2,bus=pcie.2,chassis=2 \ > > -device arm-smmuv3-nested,id=smmuv2,pci-bus=pcie.2 \ > > -device vfio-pci,host=0000:75:00.1,bus=pcie.port2,iommufd=iommufd0 \ > > -append "rdinit=init console=ttyAMA0 root=/dev/vda2 rw earlycon=pl011,0x9000000" \ > > -device virtio-9p-pci,fsdev=p9fs2,mount_tag=p9,bus=pcie.0 \ > > -fsdev local,id=p9fs2,path=p9root,security_model=mapped \ > This kind of instantiation matches what I had in mind. It is > questionable whether the legacy SMMU shouldn't be migrated to that mode > too (instead of using a machine option setting), depending on Peter's > feedbacks and also comments from Libvirt guys. Adding Andrea in the loop. Yeah, looking at the current config I'm pretty surprised to see it configured with '-machine virt,iommu=ssmuv3', where 'smmuv3' is a type name. This effectively a back-door reinvention of the '-device' arg. I think it'd make more sense to deprecate the 'iommu' property on the machine, and allow '-device ssmu3,pci-bus=pcie.0' to associate the IOMMU with the PCI root bus, so we have consistent approaches for all SMMU impls. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
On Fri, Nov 08, 2024 at 12:52:37PM +0000, Shameer Kolothum wrote: > This RFC is for initial discussion/test purposes only and includes patches > that are only relevant for adding the "arm-smmuv3-nested" support. For the > complete branch please find, > https://github.com/hisilicon/qemu/commits/private-smmuv3-nested-dev-rfc-v1/ I guess the QEMU branch above pairs with this (vIOMMU v6)? https://github.com/nicolinc/iommufd/commits/smmuv3_nesting-with-rmr Thanks Nicolin
> -----Original Message----- > From: Nicolin Chen <nicolinc@nvidia.com> > Sent: Wednesday, November 13, 2024 9:43 PM > To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> > Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org; > eric.auger@redhat.com; peter.maydell@linaro.org; jgg@nvidia.com; > ddutile@redhat.com; Linuxarm <linuxarm@huawei.com>; Wangzhou (B) > <wangzhou1@hisilicon.com>; jiangkunkun <jiangkunkun@huawei.com>; > Jonathan Cameron <jonathan.cameron@huawei.com>; > zhangfei.gao@linaro.org > Subject: Re: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable > nested SMMUv3 > > On Fri, Nov 08, 2024 at 12:52:37PM +0000, Shameer Kolothum wrote: > > This RFC is for initial discussion/test purposes only and includes > > patches that are only relevant for adding the "arm-smmuv3-nested" > > support. For the complete branch please find, > > https://github.com/hisilicon/qemu/commits/private-smmuv3-nested-dev- > rf > > c-v1/ > > I guess the QEMU branch above pairs with this (vIOMMU v6)? > https://github.com/nicolinc/iommufd/commits/smmuv3_nesting-with-rmr I actually based it on top of a kernel branch that Zhangfei is keeping for his verification tests. https://github.com/Linaro/linux-kernel-uadk/commits/6.12-wip-10.26/ But yes, it indeed looks like based on the branch you mentioned above. Thanks, Shameer.
Hi Shameer, On Fri, Nov 08, 2024 at 12:52:37PM +0000, Shameer Kolothum via wrote: > Hi, > > This series adds initial support for a user-creatable "arm-smmuv3-nested" > device to Qemu. At present the Qemu ARM SMMUv3 emulation is per machine > and cannot support multiple SMMUv3s. > I had a quick look at the SMMUv3 files, as now SMMUv3 supports nested translation emulation, would it make sense to rename this? As AFAIU, this is about virt (stage-1) SMMUv3 that is emulated to a guest. Including vSMMU or virt would help distinguish the code, as now some new function as smmu_nested_realize() looks confusing. Thanks, Mostafa > In order to support vfio-pci dev assignment with vSMMUv3, the physical > SMMUv3 has to be configured in nested mode. Having a pluggable > "arm-smmuv3-nested" device enables us to have multiple vSMMUv3 for Guests > running on a host with multiple physical SMMUv3s. A few benefits of doing > this are, > > 1. Avoid invalidation broadcast or lookup in case devices are behind > multiple phys SMMUv3s. > 2. Makes it easy to handle phys SMMUv3s that differ in features. > 3. Easy to handle future requirements such as vCMDQ support. > > This is based on discussions/suggestions received for a previous RFC by > Nicolin here[0]. > > This series includes, > -Adds support for "arm-smmuv3-nested" device. At present only virt is > supported and is using _plug_cb() callback to hook the sysbus mem > and irq (Not sure this has any negative repercussions). Patch #3. > -Provides a way to associate a pci-bus(pxb-pcie) to the above device. > Patch #3. > -The last patch is adding RMR support for MSI doorbell handling. Patch #5. > This may change in future[1]. > > This RFC is for initial discussion/test purposes only and includes patches > that are only relevant for adding the "arm-smmuv3-nested" support. For the > complete branch please find, > https://github.com/hisilicon/qemu/tree/private-smmuv3-nested-dev-rfc-v1 > > Few ToDos to note, > 1. At present default-bus-bypass-iommu=on should be set when > arm-smmuv3-nested dev is specified. Otherwise you may get an IORT > related boot error. Requires fixing. > 2. Hot adding a device is not working at the moment. Looks like pcihp irq issue. > Could be a bug in IORT id mappings. > 3. The above branch doesn't support vSVA yet. > > Hopefully this is helpful in taking the discussion forward. Please take a > look and let me know. > > How to use it(Eg:): > > On a HiSilicon platform that has multiple physical SMMUv3s, the ACC ZIP VF > devices and HNS VF devices are behind different SMMUv3s. So for a Guest, > specify two smmuv3-nested devices each behind a pxb-pcie as below, > > ./qemu-system-aarch64 -machine virt,gic-version=3,default-bus-bypass-iommu=on \ > -enable-kvm -cpu host -m 4G -smp cpus=8,maxcpus=8 \ > -object iommufd,id=iommufd0 \ > -bios QEMU_EFI.fd \ > -kernel Image \ > -device virtio-blk-device,drive=fs \ > -drive if=none,file=rootfs.qcow2,id=fs \ > -device pxb-pcie,id=pcie.1,bus_nr=8,bus=pcie.0 \ > -device pcie-root-port,id=pcie.port1,bus=pcie.1,chassis=1 \ > -device arm-smmuv3-nested,id=smmuv1,pci-bus=pcie.1 \ > -device vfio-pci,host=0000:7d:02.1,bus=pcie.port1,iommufd=iommufd0 \ > -device pxb-pcie,id=pcie.2,bus_nr=16,bus=pcie.0 \ > -device pcie-root-port,id=pcie.port2,bus=pcie.2,chassis=2 \ > -device arm-smmuv3-nested,id=smmuv2,pci-bus=pcie.2 \ > -device vfio-pci,host=0000:75:00.1,bus=pcie.port2,iommufd=iommufd0 \ > -append "rdinit=init console=ttyAMA0 root=/dev/vda2 rw earlycon=pl011,0x9000000" \ > -device virtio-9p-pci,fsdev=p9fs2,mount_tag=p9,bus=pcie.0 \ > -fsdev local,id=p9fs2,path=p9root,security_model=mapped \ > -net none \ > -nographic > > Guest will boot with two SMMuv3s, > [ 1.608130] arm-smmu-v3 arm-smmu-v3.0.auto: option mask 0x0 > [ 1.609655] arm-smmu-v3 arm-smmu-v3.0.auto: ias 48-bit, oas 48-bit (features 0x00020b25) > [ 1.612475] arm-smmu-v3 arm-smmu-v3.0.auto: allocated 65536 entries for cmdq > [ 1.614444] arm-smmu-v3 arm-smmu-v3.0.auto: allocated 32768 entries for evtq > [ 1.617451] arm-smmu-v3 arm-smmu-v3.1.auto: option mask 0x0 > [ 1.618842] arm-smmu-v3 arm-smmu-v3.1.auto: ias 48-bit, oas 48-bit (features 0x00020b25) > [ 1.621366] arm-smmu-v3 arm-smmu-v3.1.auto: allocated 65536 entries for cmdq > [ 1.623225] arm-smmu-v3 arm-smmu-v3.1.auto: allocated 32768 entries for evtq > > With a pci topology like below, > [root@localhost ~]# lspci -tv > -+-[0000:00]-+-00.0 Red Hat, Inc. QEMU PCIe Host bridge > | +-01.0 Red Hat, Inc. QEMU PCIe Expander bridge > | +-02.0 Red Hat, Inc. QEMU PCIe Expander bridge > | \-03.0 Virtio: Virtio filesystem > +-[0000:08]---00.0-[09]----00.0 Huawei Technologies Co., Ltd. HNS Network Controller (Virtual Function) > \-[0000:10]---00.0-[11]----00.0 Huawei Technologies Co., Ltd. HiSilicon ZIP Engine(Virtual Function) > [root@localhost ~]# > > And if you want to add another HNS VF, it should be added to the same SMMUv3 > as of the first HNS dev, > > -device pcie-root-port,id=pcie.port3,bus=pcie.1,chassis=3 \ > -device vfio-pci,host=0000:7d:02.2,bus=pcie.port3,iommufd=iommufd0 \ > > [root@localhost ~]# lspci -tv > -+-[0000:00]-+-00.0 Red Hat, Inc. QEMU PCIe Host bridge > | +-01.0 Red Hat, Inc. QEMU PCIe Expander bridge > | +-02.0 Red Hat, Inc. QEMU PCIe Expander bridge > | \-03.0 Virtio: Virtio filesystem > +-[0000:08]-+-00.0-[09]----00.0 Huawei Technologies Co., Ltd. HNS Network Controller (Virtual Function) > | \-01.0-[0a]----00.0 Huawei Technologies Co., Ltd. HNS Network Controller (Virtual Function) > \-[0000:10]---00.0-[11]----00.0 Huawei Technologies Co., Ltd. HiSilicon ZIP Engine(Virtual Function) > [root@localhost ~]# > > Attempt to add the HNS VF to a different SMMUv3 will result in, > > -device vfio-pci,host=0000:7d:02.2,bus=pcie.port3,iommufd=iommufd0: Unable to attach viommu > -device vfio-pci,host=0000:7d:02.2,bus=pcie.port3,iommufd=iommufd0: vfio 0000:7d:02.2: > Failed to set iommu_device: [iommufd=29] error attach 0000:7d:02.2 (38) to id=11: Invalid argument > > At present Qemu is not doing any extra validation other than the above > failure to make sure the user configuration is correct or not. The > assumption is libvirt will take care of this. > > Thanks, > Shameer > [0] https://lore.kernel.org/qemu-devel/cover.1719361174.git.nicolinc@nvidia.com/ > [1] https://lore.kernel.org/linux-iommu/ZrVN05VylFq8lK4q@Asurada-Nvidia/ > > Eric Auger (1): > hw/arm/virt-acpi-build: Add IORT RMR regions to handle MSI nested > binding > > Nicolin Chen (2): > hw/arm/virt: Add an SMMU_IO_LEN macro > hw/arm/virt-acpi-build: Build IORT with multiple SMMU nodes > > Shameer Kolothum (2): > hw/arm/smmuv3: Add initial support for SMMUv3 Nested device > hw/arm/smmuv3: Associate a pci bus with a SMMUv3 Nested device > > hw/arm/smmuv3.c | 61 ++++++++++++++++++++++ > hw/arm/virt-acpi-build.c | 109 ++++++++++++++++++++++++++++++++------- > hw/arm/virt.c | 33 ++++++++++-- > hw/core/sysbus-fdt.c | 1 + > include/hw/arm/smmuv3.h | 17 ++++++ > include/hw/arm/virt.h | 15 ++++++ > 6 files changed, 215 insertions(+), 21 deletions(-) > > -- > 2.34.1 > >
Hi Mostafa, > -----Original Message----- > From: Mostafa Saleh <smostafa@google.com> > Sent: Wednesday, November 13, 2024 4:17 PM > To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> > Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org; > eric.auger@redhat.com; peter.maydell@linaro.org; jgg@nvidia.com; > nicolinc@nvidia.com; ddutile@redhat.com; Linuxarm > <linuxarm@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>; > jiangkunkun <jiangkunkun@huawei.com>; Jonathan Cameron > <jonathan.cameron@huawei.com>; zhangfei.gao@linaro.org > Subject: Re: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable > nested SMMUv3 > > Hi Shameer, > > On Fri, Nov 08, 2024 at 12:52:37PM +0000, Shameer Kolothum via wrote: > > Hi, > > > > This series adds initial support for a user-creatable "arm-smmuv3-nested" > > device to Qemu. At present the Qemu ARM SMMUv3 emulation is per > machine > > and cannot support multiple SMMUv3s. > > > > I had a quick look at the SMMUv3 files, as now SMMUv3 supports nested > translation emulation, would it make sense to rename this? As AFAIU, > this is about virt (stage-1) SMMUv3 that is emulated to a guest. > Including vSMMU or virt would help distinguish the code, as now > some new function as smmu_nested_realize() looks confusing. Yes. I have noticed that. We need to call it something else to avoid the confusion. Not sure including "virt" is a good idea as it may indicate virt machine. Probably "acc" as Nicolin suggested to indicate hw accelerated. I will think about a better one. Open to suggestions. Thanks, Shameer
Hi Shameer, On Thu, Nov 14, 2024 at 08:01:28AM +0000, Shameerali Kolothum Thodi wrote: > Hi Mostafa, > > > -----Original Message----- > > From: Mostafa Saleh <smostafa@google.com> > > Sent: Wednesday, November 13, 2024 4:17 PM > > To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> > > Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org; > > eric.auger@redhat.com; peter.maydell@linaro.org; jgg@nvidia.com; > > nicolinc@nvidia.com; ddutile@redhat.com; Linuxarm > > <linuxarm@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>; > > jiangkunkun <jiangkunkun@huawei.com>; Jonathan Cameron > > <jonathan.cameron@huawei.com>; zhangfei.gao@linaro.org > > Subject: Re: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable > > nested SMMUv3 > > > > Hi Shameer, > > > > On Fri, Nov 08, 2024 at 12:52:37PM +0000, Shameer Kolothum via wrote: > > > Hi, > > > > > > This series adds initial support for a user-creatable "arm-smmuv3-nested" > > > device to Qemu. At present the Qemu ARM SMMUv3 emulation is per > > machine > > > and cannot support multiple SMMUv3s. > > > > > > > I had a quick look at the SMMUv3 files, as now SMMUv3 supports nested > > translation emulation, would it make sense to rename this? As AFAIU, > > this is about virt (stage-1) SMMUv3 that is emulated to a guest. > > Including vSMMU or virt would help distinguish the code, as now > > some new function as smmu_nested_realize() looks confusing. > > Yes. I have noticed that. We need to call it something else to avoid the > confusion. Not sure including "virt" is a good idea as it may indicate virt > machine. Probably "acc" as Nicolin suggested to indicate hw accelerated. > I will think about a better one. Open to suggestions. "acc" sounds good to me, also if possible we can have smmuv3-acc.c where it has all the specific logic, and the main file just calls into it. Thanks, Mostafa > > Thanks, > Shameer >
On Fri, Nov 08, 2024 at 12:52:37PM +0000, Shameer Kolothum wrote: > Few ToDos to note, > 1. At present default-bus-bypass-iommu=on should be set when > arm-smmuv3-nested dev is specified. Otherwise you may get an IORT > related boot error. Requires fixing. > 2. Hot adding a device is not working at the moment. Looks like pcihp irq issue. > Could be a bug in IORT id mappings. Do we have enough bus number space for each pbx bus in IORT? The bus range is defined by min_/max_bus in hort_host_bridges(), where the pci_bus_range() function call might not leave enough space in the range for hotplugs IIRC. > ./qemu-system-aarch64 -machine virt,gic-version=3,default-bus-bypass-iommu=on \ > -enable-kvm -cpu host -m 4G -smp cpus=8,maxcpus=8 \ > -object iommufd,id=iommufd0 \ > -bios QEMU_EFI.fd \ > -kernel Image \ > -device virtio-blk-device,drive=fs \ > -drive if=none,file=rootfs.qcow2,id=fs \ > -device pxb-pcie,id=pcie.1,bus_nr=8,bus=pcie.0 \ > -device pcie-root-port,id=pcie.port1,bus=pcie.1,chassis=1 \ > -device arm-smmuv3-nested,id=smmuv1,pci-bus=pcie.1 \ > -device vfio-pci,host=0000:7d:02.1,bus=pcie.port1,iommufd=iommufd0 \ > -device pxb-pcie,id=pcie.2,bus_nr=16,bus=pcie.0 \ > -device pcie-root-port,id=pcie.port2,bus=pcie.2,chassis=2 \ > -device arm-smmuv3-nested,id=smmuv2,pci-bus=pcie.2 \ > -device vfio-pci,host=0000:75:00.1,bus=pcie.port2,iommufd=iommufd0 \ > -append "rdinit=init console=ttyAMA0 root=/dev/vda2 rw earlycon=pl011,0x9000000" \ > -device virtio-9p-pci,fsdev=p9fs2,mount_tag=p9,bus=pcie.0 \ > -fsdev local,id=p9fs2,path=p9root,security_model=mapped \ > -net none \ > -nographic .. > With a pci topology like below, > [root@localhost ~]# lspci -tv > -+-[0000:00]-+-00.0 Red Hat, Inc. QEMU PCIe Host bridge > | +-01.0 Red Hat, Inc. QEMU PCIe Expander bridge > | +-02.0 Red Hat, Inc. QEMU PCIe Expander bridge > | \-03.0 Virtio: Virtio filesystem > +-[0000:08]---00.0-[09]----00.0 Huawei Technologies Co., Ltd. HNS Network Controller (Virtual Function) > \-[0000:10]---00.0-[11]----00.0 Huawei Technologies Co., Ltd. HiSilicon ZIP Engine(Virtual Function) > [root@localhost ~]# > > And if you want to add another HNS VF, it should be added to the same SMMUv3 > as of the first HNS dev, > > -device pcie-root-port,id=pcie.port3,bus=pcie.1,chassis=3 \ > -device vfio-pci,host=0000:7d:02.2,bus=pcie.port3,iommufd=iommufd0 \ .. > At present Qemu is not doing any extra validation other than the above > failure to make sure the user configuration is correct or not. The > assumption is libvirt will take care of this. Nathan from NVIDIA side is working on the libvirt. And he already did some prototype coding in libvirt that could generate required PCI topology. I think he can take this patches for a combined test. Thanks Nicolin
Hi Shameer, > Attempt to add the HNS VF to a different SMMUv3 will result in, > > -device vfio-pci,host=0000:7d:02.2,bus=pcie.port3,iommufd=iommufd0: Unable to attach viommu > -device vfio-pci,host=0000:7d:02.2,bus=pcie.port3,iommufd=iommufd0: vfio 0000:7d:02.2: > Failed to set iommu_device: [iommufd=29] error attach 0000:7d:02.2 (38) to id=11: Invalid argument > > At present Qemu is not doing any extra validation other than the above > failure to make sure the user configuration is correct or not. The > assumption is libvirt will take care of this. Would you be able to elaborate what Qemu is validating with this error message? I'm not seeing these errors when assigning a GPU's pcie-root-port to different PXBs (with different associated SMMU nodes). I launched a VM using my libvirt prototype code + your qemu branch and noted a few small things: 1. Are there plans to support "-device addr" for arm-smmuv3-nested's PCIe slot and function like any other device? If not I'll exclude it from my libvirt prototype. 2. Is "id" for "-device arm-smmuv3-nested" necessary for any sort of functionality? If so, I'll make a change to my libvirt prototype to support this. I was able to boot a VM and see a similar VM PCI topology as your example without specifying "id". Otherwise, the VM topology looks OK with your qemu branch + my libvirt prototype. Also as a heads up, I've added support for auto-inserting PCIe switch between the PXB and GPUs in libvirt to attach multiple devices to a SMMU node per libvirt's documentation - "If you intend to plug multiple devices into a pcie-expander-bus, you must connect a pcie-switch-upstream-port to the pcie-root-port that is plugged into the pcie-expander-bus, and multiple pcie-switch-downstream-ports to the pcie-switch-upstream-port". Future unit-tests should follow this topology configuration. Thanks, Nathan
Hi Shameer, Could you share the branch/version of the boot firmware file "QEMU_EFI.fd" from your example, and where you retrieved it from? I've been encountering PCI host bridge resource conflicts whenever assigning more than one passthrough device to a multi-vSMMU VM, booting with the boot firmware provided by qemu-efi-aarch64 version 2024.02-2. This prevents the VM from booting, eventually dropping into the UEFI shell with an error message indicating DMA mapping failed for the passthrough devices. Thanks, Nathan
>with an error message indicating DMA mapping failed for the
passthrough >devices.
A correction - the message indicates UEFI failed to find a mapping for
the boot partition ("map: no mapping found"), not that DMA mapping
failed. But earlier EDK debug logs still show PCI host bridge resource
conflicts for the passthrough devices that seem related to the VM boot
failure.
Hi Nathan,
> -----Original Message-----
> From: Nathan Chen <nathanc@nvidia.com>
> Sent: Friday, December 13, 2024 1:02 AM
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
> eric.auger@redhat.com; peter.maydell@linaro.org; jgg@nvidia.com;
> ddutile@redhat.com; Linuxarm <linuxarm@huawei.com>; Wangzhou (B)
> <wangzhou1@hisilicon.com>; jiangkunkun <jiangkunkun@huawei.com>;
> Jonathan Cameron <jonathan.cameron@huawei.com>;
> zhangfei.gao@linaro.org; Nicolin Chen <nicolinc@nvidia.com>
> Subject: Re: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable
> nested SMMUv3
>
>
> >with an error message indicating DMA mapping failed for the
> passthrough >devices.
>
> A correction - the message indicates UEFI failed to find a mapping for
> the boot partition ("map: no mapping found"), not that DMA mapping
> failed. But earlier EDK debug logs still show PCI host bridge resource
> conflicts for the passthrough devices that seem related to the VM boot
> failure.
I have tried a 2023 version EFI which works. And for more recent tests I am
using a one built directly from,
https://github.com/tianocore/edk2.git master
Commit: 0f3867fa6ef0("UefiPayloadPkg/UefiPayloadEntry: Fix PT protection
in 5 level paging"
With both, I don’t remember seeing any boot failure and the above UEFI
related "map: no mapping found" error. But the Guest kernel at times
complaints about pci bridge window memory assignment failures.
...
pci 0000:10:01.0: bridge window [mem size 0x00200000 64bit pref]: can't assign; no space
pci 0000:10:01.0: bridge window [mem size 0x00200000 64bit pref]: failed to assign
pci 0000:10:00.0: bridge window [io size 0x1000]:can't assign; no space
...
But Guest still boots and worked fine so far.
Thanks,
Shameer
>> >with an error message indicating DMA mapping failed for the
>> passthrough >devices.
>>
>> A correction - the message indicates UEFI failed to find a mapping for
>> the boot partition ("map: no mapping found"), not that DMA mapping
>> failed. But earlier EDK debug logs still show PCI host bridge resource
>> conflicts for the passthrough devices that seem related to the VM boot
>> failure.
>
> I have tried a 2023 version EFI which works. And for more recent tests I am
> using a one built directly from,
> https://github.com/tianocore/edk2.git master
>
> Commit: 0f3867fa6ef0("UefiPayloadPkg/UefiPayloadEntry: Fix PT protection
> in 5 level paging"
>
> With both, I don’t remember seeing any boot failure and the above UEFI
> related "map: no mapping found" error. But the Guest kernel at times
> complaints about pci bridge window memory assignment failures.
> ...
> pci 0000:10:01.0: bridge window [mem size 0x00200000 64bit pref]: can't assign; no space
> pci 0000:10:01.0: bridge window [mem size 0x00200000 64bit pref]: failed to assign
> pci 0000:10:00.0: bridge window [io size 0x1000]:can't assign; no space
> ...
>
> But Guest still boots and worked fine so far.
Hi Shameer,
Just letting you know I resolved this by increasing the MMIO region size
in hw/arm/virt.c to support passing through GPUs with large BAR regions
(VIRT_HIGH_PCIE_MMIO). Thanks for taking a look.
Thanks,
Nathan
> -----Original Message-----
> From: Nathan Chen <nathanc@nvidia.com>
> Sent: Saturday, January 25, 2025 2:44 AM
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: ddutile@redhat.com; eric.auger@redhat.com; jgg@nvidia.com;
> jiangkunkun <jiangkunkun@huawei.com>; Jonathan Cameron
> <jonathan.cameron@huawei.com>; Linuxarm <linuxarm@huawei.com>;
> nathanc@nvidia.com; nicolinc@nvidia.com; peter.maydell@linaro.org;
> qemu-arm@nongnu.org; Wangzhou (B) <wangzhou1@hisilicon.com>;
> zhangfei.gao@linaro.org; qemu-devel@nongnu.org
> Subject: RE: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable
> nested SMMUv3
>
> >> >with an error message indicating DMA mapping failed for the
> >> passthrough >devices.
> >>
> >> A correction - the message indicates UEFI failed to find a mapping for
> >> the boot partition ("map: no mapping found"), not that DMA mapping
> >> failed. But earlier EDK debug logs still show PCI host bridge resource
> >> conflicts for the passthrough devices that seem related to the VM boot
> >> failure.
> >
> > I have tried a 2023 version EFI which works. And for more recent tests I
> am
> > using a one built directly from,
> > https://github.com/tianocore/edk2.git master
> >
> > Commit: 0f3867fa6ef0("UefiPayloadPkg/UefiPayloadEntry: Fix PT
> protection
> > in 5 level paging"
> >
> > With both, I don’t remember seeing any boot failure and the above UEFI
> > related "map: no mapping found" error. But the Guest kernel at times
> > complaints about pci bridge window memory assignment failures.
> > ...
> > pci 0000:10:01.0: bridge window [mem size 0x00200000 64bit pref]: can't
> assign; no space
> > pci 0000:10:01.0: bridge window [mem size 0x00200000 64bit pref]: failed
> to assign
> > pci 0000:10:00.0: bridge window [io size 0x1000]:can't assign; no space
> > ...
> >
> > But Guest still boots and worked fine so far.
>
> Hi Shameer,
>
> Just letting you know I resolved this by increasing the MMIO region size
> in hw/arm/virt.c to support passing through GPUs with large BAR regions
> (VIRT_HIGH_PCIE_MMIO). Thanks for taking a look.
>
Ok. Thanks for that. Does that mean may be an optional property to specify
the size for VIRT_HIGH_PCIE_MMIO is worth adding?
And for the PCI bridge window specific errors that I mentioned above,
>>pci 0000:10:01.0: bridge window [mem size 0x00200000 64bit pref]: can't assign; no space
adding ""mem-reserve=X" and "io-reserve=X" to pcie-root-port helps.
Thanks,
Shameer
>>>> >with an error message indicating DMA mapping failed for the
>>>> passthrough >devices.
>>>>
>>>> A correction - the message indicates UEFI failed to find a mapping for
>>>> the boot partition ("map: no mapping found"), not that DMA mapping
>>>> failed. But earlier EDK debug logs still show PCI host bridge resource
>>>> conflicts for the passthrough devices that seem related to the VM boot
>>>> failure.
>>>
>>> I have tried a 2023 version EFI which works. And for more recent tests I
>> am
>>> using a one built directly from,
>>> https://github.com/tianocore/edk2.git master
>>>
>>> Commit: 0f3867fa6ef0("UefiPayloadPkg/UefiPayloadEntry: Fix PT
>> protection
>>> in 5 level paging"
>>>
>>> With both, I don’t remember seeing any boot failure and the above UEFI
>>> related "map: no mapping found" error. But the Guest kernel at times
>>> complaints about pci bridge window memory assignment failures.
>>> ...
>>> pci 0000:10:01.0: bridge window [mem size 0x00200000 64bit pref]: can't
>> assign; no space
>>> pci 0000:10:01.0: bridge window [mem size 0x00200000 64bit pref]: failed
>> to assign
>>> pci 0000:10:00.0: bridge window [io size 0x1000]:can't assign; no space
>>> ...
>>>
>>> But Guest still boots and worked fine so far.
>>
>> Hi Shameer,
>>
>> Just letting you know I resolved this by increasing the MMIO region size
>> in hw/arm/virt.c to support passing through GPUs with large BAR regions
>> (VIRT_HIGH_PCIE_MMIO). Thanks for taking a look.
>>
>
> Ok. Thanks for that. Does that mean may be an optional property to specify
> the size for VIRT_HIGH_PCIE_MMIO is worth adding?
Yes, and actually we have a patch ready for the configurable highmem
region size. Matt Ochs will send it out in the next day or so and CC you
on the submission.
> adding ""mem-reserve=X" and "io-reserve=X" to pcie-root-port helps
Ok, good to know - I'll keep that in mind for future testing.
Thanks,
Nathan
Hi Nathan, > -----Original Message----- > From: Nathan Chen <nathanc@nvidia.com> > Sent: Wednesday, November 20, 2024 11:59 PM > To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> > Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org; > eric.auger@redhat.com; peter.maydell@linaro.org; jgg@nvidia.com; > ddutile@redhat.com; Linuxarm <linuxarm@huawei.com>; Wangzhou (B) > <wangzhou1@hisilicon.com>; jiangkunkun <jiangkunkun@huawei.com>; > Jonathan Cameron <jonathan.cameron@huawei.com>; > zhangfei.gao@linaro.org; Nicolin Chen <nicolinc@nvidia.com> > Subject: Re: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable > nested SMMUv3 > > Hi Shameer, > > > Attempt to add the HNS VF to a different SMMUv3 will result in, > > > > -device vfio-pci,host=0000:7d:02.2,bus=pcie.port3,iommufd=iommufd0: > Unable to attach viommu > > -device vfio-pci,host=0000:7d:02.2,bus=pcie.port3,iommufd=iommufd0: > vfio 0000:7d:02.2: > > Failed to set iommu_device: [iommufd=29] error attach 0000:7d:02.2 > (38) to id=11: Invalid argument > > > > At present Qemu is not doing any extra validation other than the above > > failure to make sure the user configuration is correct or not. The > > assumption is libvirt will take care of this. > Would you be able to elaborate what Qemu is validating with this error > message? I'm not seeing these errors when assigning a GPU's > pcie-root-port to different PXBs (with different associated SMMU nodes). You should see that error when you have two devices that belongs to two different physical SMMUv3s in the host kernel, is assigned to a single PXB/SMMUv3 for Guest. Something like, -device pxb-pcie,id=pcie.1,bus_nr=8,bus=pcie.0 \ -device arm-smmuv3-nested,id=smmuv1,pci-bus=pcie.1 \ -device pcie-root-port,id=pcie.port1,bus=pcie.1,chassis=1 \ -device pcie-root-port,id=pcie.port2,bus=pcie.1,chassis=1 \ -device vfio-pci,host=0000:7d:02.1,bus=pcie.port1,iommufd=iommufd0 \ --> This device belongs to phys SMMUv3_0 -device vfio-pci,host=0000:75:02.1,bus=pcie.port2,iommufd=iommufd0 \ --> This device belongs to phys SMMUv3_1 So the assumption above is that libvirt will be able to detect which devices belongs to the same physical SMMUv3 and do the assignment for Guests correctly. > I launched a VM using my libvirt prototype code + your qemu branch and > noted a few small things: Thanks for giving this a spin with libvirt. > 1. Are there plans to support "-device addr" for arm-smmuv3-nested's > PCIe slot and function like any other device? If not I'll exclude it > from my libvirt prototype. Not at the moment. arm-smmuv3-nested at the moment is not making any use of PCI slot and func info specifically. I am not sure how that will be useful for this though. > 2. Is "id" for "-device arm-smmuv3-nested" necessary for any sort of > functionality? If so, I'll make a change to my libvirt prototype to > support this. I was able to boot a VM and see a similar VM PCI topology > as your example without specifying "id". Yes, "id" not used and without it, it will work. > Otherwise, the VM topology looks OK with your qemu branch + my libvirt > prototype. That is good to know. > Also as a heads up, I've added support for auto-inserting PCIe switch > between the PXB and GPUs in libvirt to attach multiple devices to a SMMU > node per libvirt's documentation - "If you intend to plug multiple > devices into a pcie-expander-bus, you must connect a > pcie-switch-upstream-port to the pcie-root-port that is plugged into the > pcie-expander-bus, and multiple pcie-switch-downstream-ports to the > pcie-switch-upstream-port". Future unit-tests should follow this > topology configuration. Ok. Could you please give me an example Qemu equivalent command option, if possible, for the above case. I am not that familiar with libvirt and I would also like to test the above scenario. Thanks, Shameer
>> Also as a heads up, I've added support for auto-inserting PCIe switch >> between the PXB and GPUs in libvirt to attach multiple devices to a SMMU >> node per libvirt's documentation - "If you intend to plug multiple >> devices into a pcie-expander-bus, you must connect a >> pcie-switch-upstream-port to the pcie-root-port that is plugged into the >> pcie-expander-bus, and multiple pcie-switch-downstream-ports to the >> pcie-switch-upstream-port". Future unit-tests should follow this >> topology configuration. > > Ok. Could you please give me an example Qemu equivalent command option, > if possible, for the above case. I am not that familiar with libvirt and I would > also like to test the above scenario. You can use "-device x3130-upstream" for the upstream switch port, and "-device xio3130-downstream" for the downstream port: -device pxb-pcie,bus_nr=250,id=pci.1,bus=pcie.0,addr=0x1 \ -device pcie-root-port,id=pci.2,bus=pci.1,addr=0x0 \ -device x3130-upstream,id=pci.3,bus=pci.2,addr=0x0 \ -device xio3130-downstream,id=pci.4,bus=pci.3,addr=0x0,chassis=17,port=1 \ -device vfio-pci,host=0009:01:00.0,id=hostdev0,bus=pci.4,addr=0x0 \ -device arm-smmuv3-nested,pci-bus=pci.1 -Nathan
> -----Original Message----- > From: Nathan Chen <nathanc@nvidia.com> > Sent: Friday, November 22, 2024 1:42 AM > To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> > Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org; > eric.auger@redhat.com; peter.maydell@linaro.org; jgg@nvidia.com; > ddutile@redhat.com; Linuxarm <linuxarm@huawei.com>; Wangzhou (B) > <wangzhou1@hisilicon.com>; jiangkunkun <jiangkunkun@huawei.com>; > Jonathan Cameron <jonathan.cameron@huawei.com>; > zhangfei.gao@linaro.org; Nicolin Chen <nicolinc@nvidia.com> > Subject: Re: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable > nested SMMUv3 > > >> Also as a heads up, I've added support for auto-inserting PCIe switch > >> between the PXB and GPUs in libvirt to attach multiple devices to a > SMMU > >> node per libvirt's documentation - "If you intend to plug multiple > >> devices into a pcie-expander-bus, you must connect a > >> pcie-switch-upstream-port to the pcie-root-port that is plugged into the > >> pcie-expander-bus, and multiple pcie-switch-downstream-ports to the > >> pcie-switch-upstream-port". Future unit-tests should follow this > >> topology configuration. > > > > Ok. Could you please give me an example Qemu equivalent command > option, > > if possible, for the above case. I am not that familiar with libvirt > and I would > > also like to test the above scenario. > > You can use "-device x3130-upstream" for the upstream switch port, and > "-device xio3130-downstream" for the downstream port: > > -device pxb-pcie,bus_nr=250,id=pci.1,bus=pcie.0,addr=0x1 \ > -device pcie-root-port,id=pci.2,bus=pci.1,addr=0x0 \ > -device x3130-upstream,id=pci.3,bus=pci.2,addr=0x0 \ > -device xio3130- > downstream,id=pci.4,bus=pci.3,addr=0x0,chassis=17,port=1 \ > -device vfio-pci,host=0009:01:00.0,id=hostdev0,bus=pci.4,addr=0x0 \ > -device arm-smmuv3-nested,pci-bus=pci.1 Thanks. Just wondering why libvirt mandates usage of pcie-switch for multiple device plugging rather than just using pcie-root-ports? Please let me if there is any advantage in doing so that you are aware of. Thanks, Shameer
On Fri, Nov 22, 2024 at 05:38:54PM +0000, Shameerali Kolothum Thodi via wrote: > > > > -----Original Message----- > > From: Nathan Chen <nathanc@nvidia.com> > > Sent: Friday, November 22, 2024 1:42 AM > > To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> > > Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org; > > eric.auger@redhat.com; peter.maydell@linaro.org; jgg@nvidia.com; > > ddutile@redhat.com; Linuxarm <linuxarm@huawei.com>; Wangzhou (B) > > <wangzhou1@hisilicon.com>; jiangkunkun <jiangkunkun@huawei.com>; > > Jonathan Cameron <jonathan.cameron@huawei.com>; > > zhangfei.gao@linaro.org; Nicolin Chen <nicolinc@nvidia.com> > > Subject: Re: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable > > nested SMMUv3 > > > > >> Also as a heads up, I've added support for auto-inserting PCIe switch > > >> between the PXB and GPUs in libvirt to attach multiple devices to a > > SMMU > > >> node per libvirt's documentation - "If you intend to plug multiple > > >> devices into a pcie-expander-bus, you must connect a > > >> pcie-switch-upstream-port to the pcie-root-port that is plugged into the > > >> pcie-expander-bus, and multiple pcie-switch-downstream-ports to the > > >> pcie-switch-upstream-port". Future unit-tests should follow this > > >> topology configuration. > > > > > > Ok. Could you please give me an example Qemu equivalent command > > option, > > > if possible, for the above case. I am not that familiar with libvirt > > and I would > > > also like to test the above scenario. > > > > You can use "-device x3130-upstream" for the upstream switch port, and > > "-device xio3130-downstream" for the downstream port: > > > > -device pxb-pcie,bus_nr=250,id=pci.1,bus=pcie.0,addr=0x1 \ > > -device pcie-root-port,id=pci.2,bus=pci.1,addr=0x0 \ > > -device x3130-upstream,id=pci.3,bus=pci.2,addr=0x0 \ > > -device xio3130- > > downstream,id=pci.4,bus=pci.3,addr=0x0,chassis=17,port=1 \ > > -device vfio-pci,host=0009:01:00.0,id=hostdev0,bus=pci.4,addr=0x0 \ > > -device arm-smmuv3-nested,pci-bus=pci.1 > > Thanks. Just wondering why libvirt mandates usage of pcie-switch for multiple > device plugging rather than just using pcie-root-ports? Libvirt does not rquire use of pcie-switch. It supports them, but in the absence of app requested configs, libvirt will always just populate pcie-root-port devices. switches are something that has to be explicitly asked for, and I don't see much need todo that. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
On Fri, Dec 13, 2024 at 11:58:02AM +0000, Daniel P. Berrangé wrote: > Libvirt does not rquire use of pcie-switch. It supports them, but in the > absence of app requested configs, libvirt will always just populate > pcie-root-port devices. switches are something that has to be explicitly > asked for, and I don't see much need todo that. If you are assigning all VFIO devices within a multi-device iommu group there are good reasons to show the switch, and the switch has to reflect certain ACS properties. We have some systems like this.. Jason
> >> Also as a heads up, I've added support for auto-inserting PCIe switch > >> between the PXB and GPUs in libvirt to attach multiple devices to a > SMMU > >> node per libvirt's documentation - "If you intend to plug multiple > >> devices into a pcie-expander-bus, you must connect a > >> pcie-switch-upstream-port to the pcie-root-port that is plugged into the > >> pcie-expander-bus, and multiple pcie-switch-downstream-ports to the > >> pcie-switch-upstream-port". Future unit-tests should follow this > >> topology configuration. > > > > > Ok. Could you please give me an example Qemu equivalent command > > option, > > > if possible, for the above case. I am not that familiar with libvirt > > and I would > > > also like to test the above scenario. > > > > You can use "-device x3130-upstream" for the upstream switch port, and > > "-device xio3130-downstream" for the downstream port: > > > > -device pxb-pcie,bus_nr=250,id=pci.1,bus=pcie.0,addr=0x1 \ > > -device pcie-root-port,id=pci.2,bus=pci.1,addr=0x0 \ > > -device x3130-upstream,id=pci.3,bus=pci.2,addr=0x0 \ > > -device xio3130- > > downstream,id=pci.4,bus=pci.3,addr=0x0,chassis=17,port=1 \ > > -device vfio-pci,host=0009:01:00.0,id=hostdev0,bus=pci.4,addr=0x0 \ > > -device arm-smmuv3-nested,pci-bus=pci.1 > > Thanks. Just wondering why libvirt mandates usage of pcie-switch for multiple > device plugging rather than just using pcie-root-ports? > > Please let me if there is any advantage in doing so that you are aware > of. Actually it seems like that documentation I quoted is out of date. That section of the documentation for pcie-expander-bus was written before a patch that revised libvirt's pxb to have 32 slots instead of just 1 slot, and it wasn't updated afterwards. With your branch and my libvirt prototype, I was still able to attach a passthrough device behind a PCIe switch and see it attached to a vSMMU in the VM, so I'm not sure if you need to make additional changes to your solution to support this. But I think we should still support/test the case where VFIO devices are behind a switch, otherwise we're placing a limitation on end users who have a use case for it. -Nathan
Hi Nathan, On 11/22/24 7:53 PM, Nathan Chen wrote: > > >> Also as a heads up, I've added support for auto-inserting PCIe > switch > > >> between the PXB and GPUs in libvirt to attach multiple devices to a > > SMMU > > >> node per libvirt's documentation - "If you intend to plug multiple > > >> devices into a pcie-expander-bus, you must connect a > > >> pcie-switch-upstream-port to the pcie-root-port that is plugged > into the > > >> pcie-expander-bus, and multiple pcie-switch-downstream-ports to the > > >> pcie-switch-upstream-port". Future unit-tests should follow this > > >> topology configuration. > > > > > > > Ok. Could you please give me an example Qemu equivalent command > > > option, > > > > if possible, for the above case. I am not that familiar with > libvirt > > > and I would > > > > also like to test the above scenario. > > > > > > You can use "-device x3130-upstream" for the upstream switch port, > and > > > "-device xio3130-downstream" for the downstream port: > > > > > > -device pxb-pcie,bus_nr=250,id=pci.1,bus=pcie.0,addr=0x1 \ > > > -device pcie-root-port,id=pci.2,bus=pci.1,addr=0x0 \ > > > -device x3130-upstream,id=pci.3,bus=pci.2,addr=0x0 \ > > > -device xio3130- > > > downstream,id=pci.4,bus=pci.3,addr=0x0,chassis=17,port=1 \ > > > -device vfio-pci,host=0009:01:00.0,id=hostdev0,bus=pci.4,addr=0x0 \ > > > -device arm-smmuv3-nested,pci-bus=pci.1 > > > > Thanks. Just wondering why libvirt mandates usage of pcie-switch for > multiple > > device plugging rather than just using pcie-root-ports? > > > > Please let me if there is any advantage in doing so that you are > aware > of. > > Actually it seems like that documentation I quoted is out of date. > That section of the documentation for pcie-expander-bus was written > before a patch that revised libvirt's pxb to have 32 slots instead of > just 1 slot, and it wasn't updated afterwards. you mean read QEMU documentation in qemu/docs/pcie.txt (esp PCI Express only hierarchy) Thanks Eric > > With your branch and my libvirt prototype, I was still able to attach > a passthrough device behind a PCIe switch and see it attached to a > vSMMU in the VM, so I'm not sure if you need to make additional > changes to your solution to support this. But I think we should still > support/test the case where VFIO devices are behind a switch, > otherwise we're placing a limitation on end users who have a use case > for it. > > -Nathan
> -----Original Message----- > From: Nicolin Chen <nicolinc@nvidia.com> > Sent: Tuesday, November 12, 2024 11:00 PM > To: Shameerali Kolothum Thodi > <shameerali.kolothum.thodi@huawei.com>; nathanc@nvidia.com > Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org; > eric.auger@redhat.com; peter.maydell@linaro.org; jgg@nvidia.com; > ddutile@redhat.com; Linuxarm <linuxarm@huawei.com>; Wangzhou (B) > <wangzhou1@hisilicon.com>; jiangkunkun <jiangkunkun@huawei.com>; > Jonathan Cameron <jonathan.cameron@huawei.com>; > zhangfei.gao@linaro.org > Subject: Re: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable > nested SMMUv3 > > On Fri, Nov 08, 2024 at 12:52:37PM +0000, Shameer Kolothum wrote: > > Few ToDos to note, > > 1. At present default-bus-bypass-iommu=on should be set when > > arm-smmuv3-nested dev is specified. Otherwise you may get an IORT > > related boot error. Requires fixing. > > 2. Hot adding a device is not working at the moment. Looks like pcihp irq > issue. > > Could be a bug in IORT id mappings. > > Do we have enough bus number space for each pbx bus in IORT? > > The bus range is defined by min_/max_bus in hort_host_bridges(), > where the pci_bus_range() function call might not leave enough > space in the range for hotplugs IIRC. Ok. Thanks for the pointer. I will debug that. > > ./qemu-system-aarch64 -machine virt,gic-version=3,default-bus-bypass- > iommu=on \ > > -enable-kvm -cpu host -m 4G -smp cpus=8,maxcpus=8 \ > > -object iommufd,id=iommufd0 \ > > -bios QEMU_EFI.fd \ > > -kernel Image \ > > -device virtio-blk-device,drive=fs \ > > -drive if=none,file=rootfs.qcow2,id=fs \ > > -device pxb-pcie,id=pcie.1,bus_nr=8,bus=pcie.0 \ > > -device pcie-root-port,id=pcie.port1,bus=pcie.1,chassis=1 \ > > -device arm-smmuv3-nested,id=smmuv1,pci-bus=pcie.1 \ > > -device vfio-pci,host=0000:7d:02.1,bus=pcie.port1,iommufd=iommufd0 \ > > -device pxb-pcie,id=pcie.2,bus_nr=16,bus=pcie.0 \ > > -device pcie-root-port,id=pcie.port2,bus=pcie.2,chassis=2 \ > > -device arm-smmuv3-nested,id=smmuv2,pci-bus=pcie.2 \ > > -device vfio-pci,host=0000:75:00.1,bus=pcie.port2,iommufd=iommufd0 \ > > -append "rdinit=init console=ttyAMA0 root=/dev/vda2 rw > earlycon=pl011,0x9000000" \ > > -device virtio-9p-pci,fsdev=p9fs2,mount_tag=p9,bus=pcie.0 \ > > -fsdev local,id=p9fs2,path=p9root,security_model=mapped \ > > -net none \ > > -nographic > .. > > With a pci topology like below, > > [root@localhost ~]# lspci -tv > > -+-[0000:00]-+-00.0 Red Hat, Inc. QEMU PCIe Host bridge > > | +-01.0 Red Hat, Inc. QEMU PCIe Expander bridge > > | +-02.0 Red Hat, Inc. QEMU PCIe Expander bridge > > | \-03.0 Virtio: Virtio filesystem > > +-[0000:08]---00.0-[09]----00.0 Huawei Technologies Co., Ltd. HNS > Network Controller (Virtual Function) > > \-[0000:10]---00.0-[11]----00.0 Huawei Technologies Co., Ltd. HiSilicon ZIP > Engine(Virtual Function) > > [root@localhost ~]# > > > > And if you want to add another HNS VF, it should be added to the same > SMMUv3 > > as of the first HNS dev, > > > > -device pcie-root-port,id=pcie.port3,bus=pcie.1,chassis=3 \ > > -device vfio-pci,host=0000:7d:02.2,bus=pcie.port3,iommufd=iommufd0 \ > .. > > At present Qemu is not doing any extra validation other than the above > > failure to make sure the user configuration is correct or not. The > > assumption is libvirt will take care of this. > > Nathan from NVIDIA side is working on the libvirt. And he already > did some prototype coding in libvirt that could generate required > PCI topology. I think he can take this patches for a combined test. Cool. That's good to know. Thanks, SHameer
© 2016 - 2026 Red Hat, Inc.