Series comparison

 RE: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable nested SMMUv3
 > -----Original Message-----
-> From: Nathan Chen <nathanc@nvidia.com>
+> From: Daniel P. Berrangé <berrange@redhat.com>
-> Sent: Saturday, January 25, 2025 2:44 AM
+> Sent: Thursday, February 6, 2025 2:47 PM
 > To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
-> Cc: ddutile@redhat.com; eric.auger@redhat.com; jgg@nvidia.com;
+> Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
 > eric.auger@redhat.com; peter.maydell@linaro.org; jgg@nvidia.com;
 > nicolinc@nvidia.com; ddutile@redhat.com; Linuxarm
 > <linuxarm@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>;
 > jiangkunkun <jiangkunkun@huawei.com>; Jonathan Cameron
-> <jonathan.cameron@huawei.com>; Linuxarm <linuxarm@huawei.com>;
+> <jonathan.cameron@huawei.com>; zhangfei.gao@linaro.org;
-> nathanc@nvidia.com; nicolinc@nvidia.com; peter.maydell@linaro.org;
+> nathanc@nvidia.com
-> qemu-arm@nongnu.org; Wangzhou (B) <wangzhou1@hisilicon.com>;
+> Subject: Re: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable
 > zhangfei.gao@linaro.org; qemu-devel@nongnu.org
 > Subject: RE: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable
 > nested SMMUv3
 >
-> >>  >with an error message indicating DMA mapping failed for the
+> On Thu, Feb 06, 2025 at 01:51:15PM +0000, Shameerali Kolothum Thodi
-> >> passthrough >devices.
+> wrote:
-> >>
+> > Hmm..I don’t think just swapping the order will change the association
-> >> A correction - the message indicates UEFI failed to find a mapping for
+> with
-> >> the boot partition ("map: no mapping found"), not that DMA mapping
+> > Guest SMMU here. Because, we have,
 > >> failed. But earlier EDK debug logs still show PCI host bridge resource
 > >> conflicts for the passthrough devices that seem related to the VM boot
 > >> failure.
 > >
-> > I have tried a 2023 version EFI which works. And for more recent tests I
+> > >  -device arm-smmuv3-accel,id=smmuv2,bus=pcie.2
 > am
 > > using a one built directly from,
 > > https://github.com/tianocore/edk2.git master
 > >
-> > Commit: 0f3867fa6ef0("UefiPayloadPkg/UefiPayloadEntry: Fix PT
+> > During smmuv3-accel realize time, this will result in,
-> protection
+> >  pci_setup_iommu(primary_bus, ops, smmu_state);
 > > in 5 level paging"
 > >
-> > With both, I don’t remember seeing any boot failure and the above UEFI
+> > And when the vfio dev realization happens,
-> > related "map: no mapping found" error. But the Guest kernel at times
+> >  set_iommu_device()
-> > complaints about pci bridge window memory assignment failures.
+> >    smmu_dev_set_iommu_device(bus, smmu_state, ,)
-> > ...
+> >       --> this is where the guest smmuv3-->host smmuv3 association is first
-> > pci 0000:10:01.0: bridge window [mem size 0x00200000 64bit pref]: can't
+> >             established. And any further vfio dev to this Guest SMMU will
-> assign; no space
+> >             only succeeds if it belongs to the same phys SMMU.
 > > pci 0000:10:01.0: bridge window [mem size 0x00200000 64bit pref]: failed
 > to assign
 > > pci 0000:10:00.0: bridge window [io  size 0x1000]:can't assign; no space
 > > ...
 > >
-> > But Guest still boots and worked fine so far.
+> > ie, the Guest SMMU to pci bus association, actually make sure you have
 > the
 > > same Guest SMMU for the device.
 >
-> Hi Shameer,
+> Ok, so at time of VFIO device realize, QEMU is telling the kernel
 > to associate a physical SMMU, and its doing this with the virtual
 > SMMU attached to PXB parenting the VFIO device.
 >
-> Just letting you know I resolved this by increasing the MMIO region size
+> > smmuv2 --> pcie.2 --> (pxb-pcie, numa_id = 1)
-> in hw/arm/virt.c to support passing through GPUs with large BAR regions
+> > 0000:dev2 -->  pcie.port2 --> pcie.2 --> smmuv2 (pxb-pcie, numa_id = 1)
-> (VIRT_HIGH_PCIE_MMIO). Thanks for taking a look.
+> >
 > > Hence the association of 0000:dev2 to Guest SMMUv2 remain same.
 >
+> Yes, I concur the SMMU physical <-> virtual association should
+> be fixed, as long as the same VFIO device is always added to
+> the same virtual SMMU.
+>
+> > I hope this is clear. And I am not sure the association will be broken in
+> any
+> > other way unless Qemu CLI specify the dev to a different PXB.
+>
+> Although the ordering is at least predictable, I remain uncomfortable
+> about the idea of the virtual SMMU association with the physical SMMU
+> being a side effect of the VFIO device placement.
+>
+> There is still the open door for admin mis-configuration that will not
+> be diagnosed. eg consider we attached VFIO device 1 from the host NUMA
+> node 1 to  a PXB associated with host NUMA node 0. As long as that's
+> the first VFIO device, the kernel will happily associate the physical
+> and guest SMMUs.
-Ok. Thanks for that. Does that mean may be an optional property to specify
+Yes. A mis-configuration can place it on a wrong one.
-the size for VIRT_HIGH_PCIE_MMIO is worth adding?
 > If we set the physical/guest SMMU relationship directly, then at the
 > time the VFIO device is plugged, we can diagnose the incorrectly
 > placed VFIO device, and better reason about behaviour.
-And for the PCI bridge window specific errors that I mentioned above,
+Agree.
->>pci 0000:10:01.0: bridge window [mem size 0x00200000 64bit pref]: can't assign; no space
+> I've another question about unplug behaviour..
 >
 >  1. Plug a VFIO device for host SMMU 1 into a PXB with guest SMMU 1.
 >       => Kernel associates host SMMU 1 and guest SMMU 1 together
 >  2. Unplug this VFIO device
 >  3. Plug a VFIO device for host SMMU 2 into a PXB with guest SMMU 1.
 >
 > Does the host/guest SMMU 1<-> 1 association remain set after step 2,
 > implying step 3 will fail ? Or does it get unset, allowing step 3
 > to succeed, and establish a new mapping host SMMU 2 to guest SMMU 1.
-adding  ""mem-reserve=X" and "io-reserve=X" to pcie-root-port helps.
+At the moment the first association is not persistent. So a new mapping
 is possible.
 > If step 2 does NOT break the association, do we preserve that
 > across a savevm+loadvm sequence of QEMU. If we don't, then step
 > 3 would fail before the savevm, but succeed after the loadvm.
-Thanks,
+Right. I haven't attempted migration tests yet. But agree that an
 explicit association is better to make migration compatible. Also
 I am not sure if the target has a different phys SMMUV3<--> dev
 mapping how we handle that.
 > Explicitly representing the host SMMU association on the guest SMMU
 > config makes this behaviour unambiguous. The host / guest SMMU
 > relationship is fixed for the lifetime of the VM and invariant of
 > whatever VFIO device is (or was previously) plugged.
 >
 > So I still go back to my general principle that automatic side effects
 > are an undesirable idea in QEMU configuration. We have a long tradition
 > of making everything entirely explicit to produce easily predictable
 > behaviour.
 Ok. Convinced 😊. Thanks for explaining.
 Shameer

> -----Original Message-----
> From: Nathan Chen <nathanc@nvidia.com>
> Sent: Saturday, January 25, 2025 2:44 AM
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: ddutile@redhat.com; eric.auger@redhat.com; jgg@nvidia.com;
> jiangkunkun <jiangkunkun@huawei.com>; Jonathan Cameron
> <jonathan.cameron@huawei.com>; Linuxarm <linuxarm@huawei.com>;
> nathanc@nvidia.com; nicolinc@nvidia.com; peter.maydell@linaro.org;
> qemu-arm@nongnu.org; Wangzhou (B) <wangzhou1@hisilicon.com>;
> zhangfei.gao@linaro.org; qemu-devel@nongnu.org
> Subject: RE: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable
> nested SMMUv3
> 
> >>  >with an error message indicating DMA mapping failed for the
> >> passthrough >devices.
> >>
> >> A correction - the message indicates UEFI failed to find a mapping for
> >> the boot partition ("map: no mapping found"), not that DMA mapping
> >> failed. But earlier EDK debug logs still show PCI host bridge resource
> >> conflicts for the passthrough devices that seem related to the VM boot
> >> failure.
> >
> > I have tried a 2023 version EFI which works. And for more recent tests I
> am
> > using a one built directly from,
> > https://github.com/tianocore/edk2.git master
> >
> > Commit: 0f3867fa6ef0("UefiPayloadPkg/UefiPayloadEntry: Fix PT
> protection
> > in 5 level paging"
> >
> > With both, I don’t remember seeing any boot failure and the above UEFI
> > related "map: no mapping found" error. But the Guest kernel at times
> > complaints about pci bridge window memory assignment failures.
> > ...
> > pci 0000:10:01.0: bridge window [mem size 0x00200000 64bit pref]: can't
> assign; no space
> > pci 0000:10:01.0: bridge window [mem size 0x00200000 64bit pref]: failed
> to assign
> > pci 0000:10:00.0: bridge window [io  size 0x1000]:can't assign; no space
> > ...
> >
> > But Guest still boots and worked fine so far.
> 
> Hi Shameer,
> 
> Just letting you know I resolved this by increasing the MMIO region size
> in hw/arm/virt.c to support passing through GPUs with large BAR regions
> (VIRT_HIGH_PCIE_MMIO). Thanks for taking a look.
>

Ok. Thanks for that. Does that mean may be an optional property to specify
the size for VIRT_HIGH_PCIE_MMIO is worth adding?

And for the PCI bridge window specific errors that I mentioned above,

>>pci 0000:10:01.0: bridge window [mem size 0x00200000 64bit pref]: can't assign; no space

adding  ""mem-reserve=X" and "io-reserve=X" to pcie-root-port helps.

Thanks,
Shameer

> -----Original Message-----
> From: Daniel P. Berrangé <berrange@redhat.com>
> Sent: Thursday, February 6, 2025 2:47 PM
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: qemu-arm@nongnu.org; qemu-devel@nongnu.org;
> eric.auger@redhat.com; peter.maydell@linaro.org; jgg@nvidia.com;
> nicolinc@nvidia.com; ddutile@redhat.com; Linuxarm
> <linuxarm@huawei.com>; Wangzhou (B) <wangzhou1@hisilicon.com>;
> jiangkunkun <jiangkunkun@huawei.com>; Jonathan Cameron
> <jonathan.cameron@huawei.com>; zhangfei.gao@linaro.org;
> nathanc@nvidia.com
> Subject: Re: [RFC PATCH 0/5] hw/arm/virt: Add support for user-creatable
> nested SMMUv3
> 
> On Thu, Feb 06, 2025 at 01:51:15PM +0000, Shameerali Kolothum Thodi
> wrote:
> > Hmm..I don’t think just swapping the order will change the association
> with
> > Guest SMMU here. Because, we have,
> >
> > >  -device arm-smmuv3-accel,id=smmuv2,bus=pcie.2
> >
> > During smmuv3-accel realize time, this will result in,
> >  pci_setup_iommu(primary_bus, ops, smmu_state);
> >
> > And when the vfio dev realization happens,
> >  set_iommu_device()
> >    smmu_dev_set_iommu_device(bus, smmu_state, ,)
> >       --> this is where the guest smmuv3-->host smmuv3 association is first
> >             established. And any further vfio dev to this Guest SMMU will
> >             only succeeds if it belongs to the same phys SMMU.
> >
> > ie, the Guest SMMU to pci bus association, actually make sure you have
> the
> > same Guest SMMU for the device.
> 
> Ok, so at time of VFIO device realize, QEMU is telling the kernel
> to associate a physical SMMU, and its doing this with the virtual
> SMMU attached to PXB parenting the VFIO device.
> 
> > smmuv2 --> pcie.2 --> (pxb-pcie, numa_id = 1)
> > 0000:dev2 -->  pcie.port2 --> pcie.2 --> smmuv2 (pxb-pcie, numa_id = 1)
> >
> > Hence the association of 0000:dev2 to Guest SMMUv2 remain same.
> 
> Yes, I concur the SMMU physical <-> virtual association should
> be fixed, as long as the same VFIO device is always added to
> the same virtual SMMU.
> 
> > I hope this is clear. And I am not sure the association will be broken in
> any
> > other way unless Qemu CLI specify the dev to a different PXB.
> 
> Although the ordering is at least predictable, I remain uncomfortable
> about the idea of the virtual SMMU association with the physical SMMU
> being a side effect of the VFIO device placement.
> 
> There is still the open door for admin mis-configuration that will not
> be diagnosed. eg consider we attached VFIO device 1 from the host NUMA
> node 1 to  a PXB associated with host NUMA node 0. As long as that's
> the first VFIO device, the kernel will happily associate the physical
> and guest SMMUs.

Yes. A mis-configuration can place it on a wrong one. 
 
> If we set the physical/guest SMMU relationship directly, then at the
> time the VFIO device is plugged, we can diagnose the incorrectly
> placed VFIO device, and better reason about behaviour.

Agree.

> I've another question about unplug behaviour..
> 
>  1. Plug a VFIO device for host SMMU 1 into a PXB with guest SMMU 1.
>       => Kernel associates host SMMU 1 and guest SMMU 1 together
>  2. Unplug this VFIO device
>  3. Plug a VFIO device for host SMMU 2 into a PXB with guest SMMU 1.
> 
> Does the host/guest SMMU 1<-> 1 association remain set after step 2,
> implying step 3 will fail ? Or does it get unset, allowing step 3
> to succeed, and establish a new mapping host SMMU 2 to guest SMMU 1.

At the moment the first association is not persistent. So a new mapping 
is possible.
 
> If step 2 does NOT break the association, do we preserve that
> across a savevm+loadvm sequence of QEMU. If we don't, then step
> 3 would fail before the savevm, but succeed after the loadvm.

Right. I haven't attempted migration tests yet. But agree that an 
explicit association is better to make migration compatible. Also
I am not sure if the target has a different phys SMMUV3<--> dev
mapping how we handle that.

> Explicitly representing the host SMMU association on the guest SMMU
> config makes this behaviour unambiguous. The host / guest SMMU
> relationship is fixed for the lifetime of the VM and invariant of
> whatever VFIO device is (or was previously) plugged.
> 
> So I still go back to my general principle that automatic side effects
> are an undesirable idea in QEMU configuration. We have a long tradition
> of making everything entirely explicit to produce easily predictable
> behaviour.

Ok. Convinced 😊. Thanks for explaining.

Shameer