include/exec/memory.h | 32 ------- include/hw/pci/pci.h | 16 ++++ hw/pci/pci.c | 16 ++++ hw/vfio/common.c | 10 -- hw/vfio/pci.c | 27 ++++++ hw/virtio/virtio-iommu.c | 201 ++++++++++++++++++++------------------- system/memory.c | 13 --- 7 files changed, 160 insertions(+), 155 deletions(-)
In [1] we attempted to fix a case where a VFIO-PCI device protected with a virtio-iommu was assigned to an x86 guest. On x86 the physical IOMMU may have an address width (gaw) of 39 or 48 bits whereas the virtio-iommu used to expose a 64b address space by default. Hence the guest was trying to use the full 64b space and we hit DMA MAP failures. To work around this issue we managed to pass usable IOVA regions (excluding the out of range space) from VFIO to the virtio-iommu device. This was made feasible by introducing a new IOMMU Memory Region callback dubbed iommu_set_iova_regions(). This latter gets called when the IOMMU MR is enabled which causes the vfio_listener_region_add() to be called. However with VFIO-PCI hotplug, this technique fails due to the race between the call to the callback in the add memory listener and the virtio-iommu probe request. Indeed the probe request gets called before the attach to the domain. So in that case the usable regions are communicated after the probe request and fail to be conveyed to the guest. To be honest the problem was hinted by Jean-Philippe in [1] and I should have been more careful at listening to him and testing with hotplug :-( For coldplugged device the technique works because we make sure all the IOMMU MR are enabled once on the machine init done: 94df5b2180 ("virtio-iommu: Fix 64kB host page size VFIO device assignment") for granule freeze. But I would be keen to get rid of this trick. Using an IOMMU MR Ops is unpractical because this relies on the IOMMU MR to have been enabled and the corresponding vfio_listener_region_add() to be executed. Instead this series proposes to replace the usage of this API by the recently introduced PCIIOMMUOps: ba7d12eb8c ("hw/pci: modify pci_setup_iommu() to set PCIIOMMUOps"). That way, the callback can be called earlier, once the usable IOVA regions have been collected by VFIO, without the need for the IOMMU MR to be enabled. This looks cleaner. In the short term this may also be used for passing the page size mask, which would allow to get rid of the hacky transient IOMMU MR enablement mentionned above. [1] [PATCH v4 00/12] VIRTIO-IOMMU/VFIO: Don't assume 64b IOVA space https://lore.kernel.org/all/20231019134651.842175-1-eric.auger@redhat.com/ [2] https://lore.kernel.org/all/20230929161547.GB2957297@myrica/ Extra Notes: With that series, the reserved memory regions are communicated on time so that the virtio-iommu probe request grabs them. However this is not sufficient. In some cases (my case), I still see some DMA MAP failures and the guest keeps on using IOVA ranges outside the geometry of the physical IOMMU. This is due to the fact the VFIO-PCI device is in the same iommu group as the pcie root port. Normally the kernel iova_reserve_iommu_regions (dma-iommu.c) is supposed to call reserve_iova() for each reserved IOVA, which carves them out of the allocator. When iommu_dma_init_domain() gets called for the hotplugged vfio-pci device the iova domain is already allocated and set and we don't call iova_reserve_iommu_regions() again for the vfio-pci device. So its corresponding reserved regions are not properly taken into account. This is not trivial to fix because theoretically the 1st attached devices could already have allocated IOVAs within the reserved regions of the second device. Also we are somehow hijacking the reserved memory regions to model the geometry of the physical IOMMU so not sure any attempt to fix that upstream will be accepted. At the moment one solution is to make sure assigned devices end up in singleton group. Another solution is to work on a different approach where the gaw can be passed as an option to the virtio-iommu device, similarly at what is done with intel iommu. This series can be found at: https://github.com/eauger/qemu/tree/hotplug-resv-rfc Eric Auger (7): hw/pci: Introduce PCIIOMMUOps::set_host_iova_regions hw/pci: Introduce pci_device_iommu_bus vfio/pci: Pass the usable IOVA ranges through PCIIOMMUOps virtio-iommu: Implement PCIIOMMUOps set_host_resv_regions virtio-iommu: Remove the implementation of iommu_set_iova_ranges hw/vfio: Remove memory_region_iommu_set_iova_ranges() call memory: Remove IOMMU MR iommu_set_iova_range API include/exec/memory.h | 32 ------- include/hw/pci/pci.h | 16 ++++ hw/pci/pci.c | 16 ++++ hw/vfio/common.c | 10 -- hw/vfio/pci.c | 27 ++++++ hw/virtio/virtio-iommu.c | 201 ++++++++++++++++++++------------------- system/memory.c | 13 --- 7 files changed, 160 insertions(+), 155 deletions(-) -- 2.41.0
Hi Eric, >-----Original Message----- >From: Eric Auger <eric.auger@redhat.com> >Cc: mst@redhat.com; clg@redhat.com >Subject: [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling >for hotplugged devices > >In [1] we attempted to fix a case where a VFIO-PCI device protected >with a virtio-iommu was assigned to an x86 guest. On x86 the physical >IOMMU may have an address width (gaw) of 39 or 48 bits whereas the >virtio-iommu used to expose a 64b address space by default. >Hence the guest was trying to use the full 64b space and we hit >DMA MAP failures. To work around this issue we managed to pass >usable IOVA regions (excluding the out of range space) from VFIO >to the virtio-iommu device. This was made feasible by introducing >a new IOMMU Memory Region callback dubbed iommu_set_iova_regions(). >This latter gets called when the IOMMU MR is enabled which >causes the vfio_listener_region_add() to be called. > >However with VFIO-PCI hotplug, this technique fails due to the >race between the call to the callback in the add memory listener >and the virtio-iommu probe request. Indeed the probe request gets >called before the attach to the domain. So in that case the usable >regions are communicated after the probe request and fail to be >conveyed to the guest. To be honest the problem was hinted by >Jean-Philippe in [1] and I should have been more careful at >listening to him and testing with hotplug :-( It looks the global virtio_iommu_config.bypass is never cleared in guest. When guest virtio_iommu driver enable IOMMU, should it clear this bypass attribute? If it could be cleared in viommu_probe(), then qemu will call virtio_iommu_set_config() then virtio_iommu_switch_address_space_all() to enable IOMMU MR. Then both coldplugged and hotplugged devices will work. Intel iommu has a similar bit in register GCMD_REG.TE, when guest intel_iommu driver probe set it, on qemu side, vtd_address_space_refresh_all() is called to enable IOMMU MRs. > >For coldplugged device the technique works because we make sure all >the IOMMU MR are enabled once on the machine init done: 94df5b2180 >("virtio-iommu: Fix 64kB host page size VFIO device assignment") >for granule freeze. But I would be keen to get rid of this trick. > >Using an IOMMU MR Ops is unpractical because this relies on the IOMMU >MR to have been enabled and the corresponding vfio_listener_region_add() >to be executed. Instead this series proposes to replace the usage of this >API by the recently introduced PCIIOMMUOps: ba7d12eb8c ("hw/pci: >modify >pci_setup_iommu() to set PCIIOMMUOps"). That way, the callback can be >called earlier, once the usable IOVA regions have been collected by >VFIO, without the need for the IOMMU MR to be enabled. > >This looks cleaner. In the short term this may also be used for >passing the page size mask, which would allow to get rid of the >hacky transient IOMMU MR enablement mentionned above. > >[1] [PATCH v4 00/12] VIRTIO-IOMMU/VFIO: Don't assume 64b IOVA space > https://lore.kernel.org/all/20231019134651.842175-1- >eric.auger@redhat.com/ > >[2] https://lore.kernel.org/all/20230929161547.GB2957297@myrica/ > >Extra Notes: >With that series, the reserved memory regions are communicated on time >so that the virtio-iommu probe request grabs them. However this is not >sufficient. In some cases (my case), I still see some DMA MAP failures >and the guest keeps on using IOVA ranges outside the geometry of the >physical IOMMU. This is due to the fact the VFIO-PCI device is in the >same iommu group as the pcie root port. Normally the kernel >iova_reserve_iommu_regions (dma-iommu.c) is supposed to call >reserve_iova() >for each reserved IOVA, which carves them out of the allocator. When >iommu_dma_init_domain() gets called for the hotplugged vfio-pci device >the iova domain is already allocated and set and we don't call >iova_reserve_iommu_regions() again for the vfio-pci device. So its >corresponding reserved regions are not properly taken into account. I suspect there is same issue with coldplugged devices. If those devices are in same group, get iova_reserve_iommu_regions() is only called for first device. But other devices's reserved regions are missed. Curious how you make passthrough device and pcie root port under same group. When I start a x86 guest with passthrough device, I see passthrough device and pcie root port are in different group. -[0000:00]-+-00.0 +-01.0 +-02.0 +-03.0-[01]----00.0 /sys/kernel/iommu_groups/3/devices: 0000:00:03.0 /sys/kernel/iommu_groups/7/devices: 0000:01:00.0 My qemu cmdline: -device pcie-root-port,id=root0,slot=0 -device vfio-pci,host=6f:01.0,id=vfio0,bus=root0 Thanks Zhenzhong > >This is not trivial to fix because theoretically the 1st attached >devices could already have allocated IOVAs within the reserved regions >of the second device. Also we are somehow hijacking the reserved >memory regions to model the geometry of the physical IOMMU so not sure >any attempt to fix that upstream will be accepted. At the moment one >solution is to make sure assigned devices end up in singleton group. >Another solution is to work on a different approach where the gaw >can be passed as an option to the virtio-iommu device, similarly at >what is done with intel iommu. > >This series can be found at: >https://github.com/eauger/qemu/tree/hotplug-resv-rfc > > >Eric Auger (7): > hw/pci: Introduce PCIIOMMUOps::set_host_iova_regions > hw/pci: Introduce pci_device_iommu_bus > vfio/pci: Pass the usable IOVA ranges through PCIIOMMUOps > virtio-iommu: Implement PCIIOMMUOps set_host_resv_regions > virtio-iommu: Remove the implementation of iommu_set_iova_ranges > hw/vfio: Remove memory_region_iommu_set_iova_ranges() call > memory: Remove IOMMU MR iommu_set_iova_range API > > include/exec/memory.h | 32 ------- > include/hw/pci/pci.h | 16 ++++ > hw/pci/pci.c | 16 ++++ > hw/vfio/common.c | 10 -- > hw/vfio/pci.c | 27 ++++++ > hw/virtio/virtio-iommu.c | 201 ++++++++++++++++++++------------------- > system/memory.c | 13 --- > 7 files changed, 160 insertions(+), 155 deletions(-) > >-- >2.41.0
Hi Zhenzhong, On 1/18/24 08:10, Duan, Zhenzhong wrote: > Hi Eric, > >> -----Original Message----- >> From: Eric Auger <eric.auger@redhat.com> >> Cc: mst@redhat.com; clg@redhat.com >> Subject: [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling >> for hotplugged devices >> >> In [1] we attempted to fix a case where a VFIO-PCI device protected >> with a virtio-iommu was assigned to an x86 guest. On x86 the physical >> IOMMU may have an address width (gaw) of 39 or 48 bits whereas the >> virtio-iommu used to expose a 64b address space by default. >> Hence the guest was trying to use the full 64b space and we hit >> DMA MAP failures. To work around this issue we managed to pass >> usable IOVA regions (excluding the out of range space) from VFIO >> to the virtio-iommu device. This was made feasible by introducing >> a new IOMMU Memory Region callback dubbed iommu_set_iova_regions(). >> This latter gets called when the IOMMU MR is enabled which >> causes the vfio_listener_region_add() to be called. >> >> However with VFIO-PCI hotplug, this technique fails due to the >> race between the call to the callback in the add memory listener >> and the virtio-iommu probe request. Indeed the probe request gets >> called before the attach to the domain. So in that case the usable >> regions are communicated after the probe request and fail to be >> conveyed to the guest. To be honest the problem was hinted by >> Jean-Philippe in [1] and I should have been more careful at >> listening to him and testing with hotplug :-( > It looks the global virtio_iommu_config.bypass is never cleared in guest. > When guest virtio_iommu driver enable IOMMU, should it clear this > bypass attribute? > If it could be cleared in viommu_probe(), then qemu will call > virtio_iommu_set_config() then virtio_iommu_switch_address_space_all() > to enable IOMMU MR. Then both coldplugged and hotplugged devices will work. this field is iommu wide while the probe applies on a one device.In general I would prefer not to be dependent on the MR enablement. We know that the device is likely to be protected and we can collect its requirements beforehand. > > Intel iommu has a similar bit in register GCMD_REG.TE, when guest > intel_iommu driver probe set it, on qemu side, vtd_address_space_refresh_all() > is called to enable IOMMU MRs. interesting. Would be curious to get Jean Philippe's pov. > >> For coldplugged device the technique works because we make sure all >> the IOMMU MR are enabled once on the machine init done: 94df5b2180 >> ("virtio-iommu: Fix 64kB host page size VFIO device assignment") >> for granule freeze. But I would be keen to get rid of this trick. >> >> Using an IOMMU MR Ops is unpractical because this relies on the IOMMU >> MR to have been enabled and the corresponding vfio_listener_region_add() >> to be executed. Instead this series proposes to replace the usage of this >> API by the recently introduced PCIIOMMUOps: ba7d12eb8c ("hw/pci: >> modify >> pci_setup_iommu() to set PCIIOMMUOps"). That way, the callback can be >> called earlier, once the usable IOVA regions have been collected by >> VFIO, without the need for the IOMMU MR to be enabled. >> >> This looks cleaner. In the short term this may also be used for >> passing the page size mask, which would allow to get rid of the >> hacky transient IOMMU MR enablement mentionned above. >> >> [1] [PATCH v4 00/12] VIRTIO-IOMMU/VFIO: Don't assume 64b IOVA space >> https://lore.kernel.org/all/20231019134651.842175-1- >> eric.auger@redhat.com/ >> >> [2] https://lore.kernel.org/all/20230929161547.GB2957297@myrica/ >> >> Extra Notes: >> With that series, the reserved memory regions are communicated on time >> so that the virtio-iommu probe request grabs them. However this is not >> sufficient. In some cases (my case), I still see some DMA MAP failures >> and the guest keeps on using IOVA ranges outside the geometry of the >> physical IOMMU. This is due to the fact the VFIO-PCI device is in the >> same iommu group as the pcie root port. Normally the kernel >> iova_reserve_iommu_regions (dma-iommu.c) is supposed to call >> reserve_iova() >> for each reserved IOVA, which carves them out of the allocator. When >> iommu_dma_init_domain() gets called for the hotplugged vfio-pci device >> the iova domain is already allocated and set and we don't call >> iova_reserve_iommu_regions() again for the vfio-pci device. So its >> corresponding reserved regions are not properly taken into account. > I suspect there is same issue with coldplugged devices. If those devices > are in same group, get iova_reserve_iommu_regions() is only called > for first device. But other devices's reserved regions are missed. Correct > > Curious how you make passthrough device and pcie root port under same > group. > When I start a x86 guest with passthrough device, I see passthrough > device and pcie root port are in different group. > > -[0000:00]-+-00.0 > +-01.0 > +-02.0 > +-03.0-[01]----00.0 > > /sys/kernel/iommu_groups/3/devices: > 0000:00:03.0 > /sys/kernel/iommu_groups/7/devices: > 0000:01:00.0 > > My qemu cmdline: > -device pcie-root-port,id=root0,slot=0 > -device vfio-pci,host=6f:01.0,id=vfio0,bus=root0 I just replayed the scenario: - if you have a coldplugged vfio-pci device, the pci root port and the passthroughed device end up in different iommu groups. On my end I use ioh3420 but you confirmed that's the same for the generic pcie-root-port - however if you hotplug the vfio-pci device that's a different story: they end up in the same group. Don't ask me why. I tried with both virtio-iommu and intel iommu and I end up with the same topology. That looks really weird to me. I initially thought this was an ACS issue but I am now puzzled. Thanks! Eric > > Thanks > Zhenzhong > >> This is not trivial to fix because theoretically the 1st attached >> devices could already have allocated IOVAs within the reserved regions >> of the second device. Also we are somehow hijacking the reserved >> memory regions to model the geometry of the physical IOMMU so not sure >> any attempt to fix that upstream will be accepted. At the moment one >> solution is to make sure assigned devices end up in singleton group. >> Another solution is to work on a different approach where the gaw >> can be passed as an option to the virtio-iommu device, similarly at >> what is done with intel iommu. >> >> This series can be found at: >> https://github.com/eauger/qemu/tree/hotplug-resv-rfc >> >> >> Eric Auger (7): >> hw/pci: Introduce PCIIOMMUOps::set_host_iova_regions >> hw/pci: Introduce pci_device_iommu_bus >> vfio/pci: Pass the usable IOVA ranges through PCIIOMMUOps >> virtio-iommu: Implement PCIIOMMUOps set_host_resv_regions >> virtio-iommu: Remove the implementation of iommu_set_iova_ranges >> hw/vfio: Remove memory_region_iommu_set_iova_ranges() call >> memory: Remove IOMMU MR iommu_set_iova_range API >> >> include/exec/memory.h | 32 ------- >> include/hw/pci/pci.h | 16 ++++ >> hw/pci/pci.c | 16 ++++ >> hw/vfio/common.c | 10 -- >> hw/vfio/pci.c | 27 ++++++ >> hw/virtio/virtio-iommu.c | 201 ++++++++++++++++++++------------------- >> system/memory.c | 13 --- >> 7 files changed, 160 insertions(+), 155 deletions(-) >> >> -- >> 2.41.0
Hi, On Thu, Jan 18, 2024 at 10:43:55AM +0100, Eric Auger wrote: > Hi Zhenzhong, > On 1/18/24 08:10, Duan, Zhenzhong wrote: > > Hi Eric, > > > >> -----Original Message----- > >> From: Eric Auger <eric.auger@redhat.com> > >> Cc: mst@redhat.com; clg@redhat.com > >> Subject: [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling > >> for hotplugged devices > >> > >> In [1] we attempted to fix a case where a VFIO-PCI device protected > >> with a virtio-iommu was assigned to an x86 guest. On x86 the physical > >> IOMMU may have an address width (gaw) of 39 or 48 bits whereas the > >> virtio-iommu used to expose a 64b address space by default. > >> Hence the guest was trying to use the full 64b space and we hit > >> DMA MAP failures. To work around this issue we managed to pass > >> usable IOVA regions (excluding the out of range space) from VFIO > >> to the virtio-iommu device. This was made feasible by introducing > >> a new IOMMU Memory Region callback dubbed iommu_set_iova_regions(). > >> This latter gets called when the IOMMU MR is enabled which > >> causes the vfio_listener_region_add() to be called. > >> > >> However with VFIO-PCI hotplug, this technique fails due to the > >> race between the call to the callback in the add memory listener > >> and the virtio-iommu probe request. Indeed the probe request gets > >> called before the attach to the domain. So in that case the usable > >> regions are communicated after the probe request and fail to be > >> conveyed to the guest. To be honest the problem was hinted by > >> Jean-Philippe in [1] and I should have been more careful at > >> listening to him and testing with hotplug :-( > > It looks the global virtio_iommu_config.bypass is never cleared in guest. > > When guest virtio_iommu driver enable IOMMU, should it clear this > > bypass attribute? > > If it could be cleared in viommu_probe(), then qemu will call > > virtio_iommu_set_config() then virtio_iommu_switch_address_space_all() > > to enable IOMMU MR. Then both coldplugged and hotplugged devices will work. > > this field is iommu wide while the probe applies on a one device.In > general I would prefer not to be dependent on the MR enablement. We know > that the device is likely to be protected and we can collect its > requirements beforehand. > > > > > Intel iommu has a similar bit in register GCMD_REG.TE, when guest > > intel_iommu driver probe set it, on qemu side, vtd_address_space_refresh_all() > > is called to enable IOMMU MRs. > interesting. > > Would be curious to get Jean Philippe's pov. I'd rather not rely on this, it's hard to justify a driver change based only on QEMU internals. And QEMU can't count on the driver always clearing bypass. There could be situations where the guest can't afford to do it, like if an endpoint is owned by the firmware and has to keep running. There may be a separate argument for clearing bypass. With a coldplugged VFIO device the flow is: 1. Map the whole guest address space in VFIO to implement boot-bypass. This allocates all guest pages, which takes a while and is wasteful. I've actually crashed a host that way, when spawning a guest with too much RAM. 2. Start the VM 3. When the virtio-iommu driver attaches a (non-identity) domain to the assigned endpoint, then unmap the whole address space in VFIO, and most pages are given back to the host. We can't disable boot-bypass because the BIOS needs it. But instead the flow could be: 1. Start the VM, with only the virtual endpoints. Nothing to pin. 2. The virtio-iommu driver disables bypass during boot 3. Hotplug the VFIO device. With bypass disabled there is no need to pin the whole guest address space, unless the guest explicitly asks for an identity domain. However, I don't know if this is a realistic scenario that will actually be used. By the way, do you have an easy way to reproduce the issue described here? I've had to enable iommu.forcedac=1 on the command-line, otherwise Linux just allocates 32-bit IOVAs. > > > >> For coldplugged device the technique works because we make sure all > >> the IOMMU MR are enabled once on the machine init done: 94df5b2180 > >> ("virtio-iommu: Fix 64kB host page size VFIO device assignment") > >> for granule freeze. But I would be keen to get rid of this trick. > >> > >> Using an IOMMU MR Ops is unpractical because this relies on the IOMMU > >> MR to have been enabled and the corresponding vfio_listener_region_add() > >> to be executed. Instead this series proposes to replace the usage of this > >> API by the recently introduced PCIIOMMUOps: ba7d12eb8c ("hw/pci: > >> modify > >> pci_setup_iommu() to set PCIIOMMUOps"). That way, the callback can be > >> called earlier, once the usable IOVA regions have been collected by > >> VFIO, without the need for the IOMMU MR to be enabled. > >> > >> This looks cleaner. In the short term this may also be used for > >> passing the page size mask, which would allow to get rid of the > >> hacky transient IOMMU MR enablement mentionned above. > >> > >> [1] [PATCH v4 00/12] VIRTIO-IOMMU/VFIO: Don't assume 64b IOVA space > >> https://lore.kernel.org/all/20231019134651.842175-1- > >> eric.auger@redhat.com/ > >> > >> [2] https://lore.kernel.org/all/20230929161547.GB2957297@myrica/ > >> > >> Extra Notes: > >> With that series, the reserved memory regions are communicated on time > >> so that the virtio-iommu probe request grabs them. However this is not > >> sufficient. In some cases (my case), I still see some DMA MAP failures > >> and the guest keeps on using IOVA ranges outside the geometry of the > >> physical IOMMU. This is due to the fact the VFIO-PCI device is in the > >> same iommu group as the pcie root port. Normally the kernel > >> iova_reserve_iommu_regions (dma-iommu.c) is supposed to call > >> reserve_iova() > >> for each reserved IOVA, which carves them out of the allocator. When > >> iommu_dma_init_domain() gets called for the hotplugged vfio-pci device > >> the iova domain is already allocated and set and we don't call > >> iova_reserve_iommu_regions() again for the vfio-pci device. So its > >> corresponding reserved regions are not properly taken into account. > > I suspect there is same issue with coldplugged devices. If those devices > > are in same group, get iova_reserve_iommu_regions() is only called > > for first device. But other devices's reserved regions are missed. > > Correct > > > > Curious how you make passthrough device and pcie root port under same > > group. > > When I start a x86 guest with passthrough device, I see passthrough > > device and pcie root port are in different group. > > > > -[0000:00]-+-00.0 > > +-01.0 > > +-02.0 > > +-03.0-[01]----00.0 > > > > /sys/kernel/iommu_groups/3/devices: > > 0000:00:03.0 > > /sys/kernel/iommu_groups/7/devices: > > 0000:01:00.0 > > > > My qemu cmdline: > > -device pcie-root-port,id=root0,slot=0 > > -device vfio-pci,host=6f:01.0,id=vfio0,bus=root0 > > I just replayed the scenario: > - if you have a coldplugged vfio-pci device, the pci root port and the > passthroughed device end up in different iommu groups. On my end I use > ioh3420 but you confirmed that's the same for the generic pcie-root-port > - however if you hotplug the vfio-pci device that's a different story: > they end up in the same group. Don't ask me why. I tried with > both virtio-iommu and intel iommu and I end up with the same topology. > That looks really weird to me. It also took me a while to get hotplug to work on x86: pcie_cap_slot_plug_cb() didn't get called, instead it would call ich9_pm_device_plug_cb(). Not sure what I'm doing wrong. To work around that I instantiated a second pxb-pcie root bus and then a pcie root port on there. So my command-line looks like: -device virtio-iommu -device pxb-pcie,id=pcie.1,bus_nr=1 -device pcie-root-port,chassis=2,id=pcie.2,bus=pcie.1 device_add vfio-pci,host=00:04.0,bus=pcie.2 And somehow pcieport and the assigned device do end up in separate IOMMU groups. Thanks, Jean
Hi Jean, On 1/25/24 19:48, Jean-Philippe Brucker wrote: > Hi, > > On Thu, Jan 18, 2024 at 10:43:55AM +0100, Eric Auger wrote: >> Hi Zhenzhong, >> On 1/18/24 08:10, Duan, Zhenzhong wrote: >>> Hi Eric, >>> >>>> -----Original Message----- >>>> From: Eric Auger <eric.auger@redhat.com> >>>> Cc: mst@redhat.com; clg@redhat.com >>>> Subject: [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry handling >>>> for hotplugged devices >>>> >>>> In [1] we attempted to fix a case where a VFIO-PCI device protected >>>> with a virtio-iommu was assigned to an x86 guest. On x86 the physical >>>> IOMMU may have an address width (gaw) of 39 or 48 bits whereas the >>>> virtio-iommu used to expose a 64b address space by default. >>>> Hence the guest was trying to use the full 64b space and we hit >>>> DMA MAP failures. To work around this issue we managed to pass >>>> usable IOVA regions (excluding the out of range space) from VFIO >>>> to the virtio-iommu device. This was made feasible by introducing >>>> a new IOMMU Memory Region callback dubbed iommu_set_iova_regions(). >>>> This latter gets called when the IOMMU MR is enabled which >>>> causes the vfio_listener_region_add() to be called. >>>> >>>> However with VFIO-PCI hotplug, this technique fails due to the >>>> race between the call to the callback in the add memory listener >>>> and the virtio-iommu probe request. Indeed the probe request gets >>>> called before the attach to the domain. So in that case the usable >>>> regions are communicated after the probe request and fail to be >>>> conveyed to the guest. To be honest the problem was hinted by >>>> Jean-Philippe in [1] and I should have been more careful at >>>> listening to him and testing with hotplug :-( >>> It looks the global virtio_iommu_config.bypass is never cleared in guest. >>> When guest virtio_iommu driver enable IOMMU, should it clear this >>> bypass attribute? >>> If it could be cleared in viommu_probe(), then qemu will call >>> virtio_iommu_set_config() then virtio_iommu_switch_address_space_all() >>> to enable IOMMU MR. Then both coldplugged and hotplugged devices will work. >> this field is iommu wide while the probe applies on a one device.In >> general I would prefer not to be dependent on the MR enablement. We know >> that the device is likely to be protected and we can collect its >> requirements beforehand. >> >>> Intel iommu has a similar bit in register GCMD_REG.TE, when guest >>> intel_iommu driver probe set it, on qemu side, vtd_address_space_refresh_all() >>> is called to enable IOMMU MRs. >> interesting. >> >> Would be curious to get Jean Philippe's pov. > I'd rather not rely on this, it's hard to justify a driver change based > only on QEMU internals. And QEMU can't count on the driver always clearing > bypass. There could be situations where the guest can't afford to do it, > like if an endpoint is owned by the firmware and has to keep running. > > There may be a separate argument for clearing bypass. With a coldplugged > VFIO device the flow is: > > 1. Map the whole guest address space in VFIO to implement boot-bypass. > This allocates all guest pages, which takes a while and is wasteful. > I've actually crashed a host that way, when spawning a guest with too > much RAM. interesting > 2. Start the VM > 3. When the virtio-iommu driver attaches a (non-identity) domain to the > assigned endpoint, then unmap the whole address space in VFIO, and most > pages are given back to the host. > > We can't disable boot-bypass because the BIOS needs it. But instead the > flow could be: > > 1. Start the VM, with only the virtual endpoints. Nothing to pin. > 2. The virtio-iommu driver disables bypass during boot We needed this boot-bypass mode for booting with virtio-blk-scsi protected with virtio-iommu for instance. That was needed because we don't have any virtio-iommu driver in edk2 as opposed to intel iommu driver, right? > 3. Hotplug the VFIO device. With bypass disabled there is no need to pin > the whole guest address space, unless the guest explicitly asks for an > identity domain. > > However, I don't know if this is a realistic scenario that will actually > be used. > > By the way, do you have an easy way to reproduce the issue described here? > I've had to enable iommu.forcedac=1 on the command-line, otherwise Linux > just allocates 32-bit IOVAs. I don't have a simple generic reproducer. It happens when assigning this device: Ethernet Controller E810-C for QSFP (Ethernet Network Adapter E810-C-Q2) I have not encountered that issue with another device yet. I see on guest side in dmesg: [ 6.849292] ice 0000:00:05.0: Using 64-bit DMA addresses That's emitted in dma-iommu.c iommu_dma_alloc_iova(). Looks like the guest first tries to allocate an iova in the 32-bit AS and if this fails use the whole dma_limit. Seems the 32b IOVA alloc failed here ;-) Thanks Eric > >>>> For coldplugged device the technique works because we make sure all >>>> the IOMMU MR are enabled once on the machine init done: 94df5b2180 >>>> ("virtio-iommu: Fix 64kB host page size VFIO device assignment") >>>> for granule freeze. But I would be keen to get rid of this trick. >>>> >>>> Using an IOMMU MR Ops is unpractical because this relies on the IOMMU >>>> MR to have been enabled and the corresponding vfio_listener_region_add() >>>> to be executed. Instead this series proposes to replace the usage of this >>>> API by the recently introduced PCIIOMMUOps: ba7d12eb8c ("hw/pci: >>>> modify >>>> pci_setup_iommu() to set PCIIOMMUOps"). That way, the callback can be >>>> called earlier, once the usable IOVA regions have been collected by >>>> VFIO, without the need for the IOMMU MR to be enabled. >>>> >>>> This looks cleaner. In the short term this may also be used for >>>> passing the page size mask, which would allow to get rid of the >>>> hacky transient IOMMU MR enablement mentionned above. >>>> >>>> [1] [PATCH v4 00/12] VIRTIO-IOMMU/VFIO: Don't assume 64b IOVA space >>>> https://lore.kernel.org/all/20231019134651.842175-1- >>>> eric.auger@redhat.com/ >>>> >>>> [2] https://lore.kernel.org/all/20230929161547.GB2957297@myrica/ >>>> >>>> Extra Notes: >>>> With that series, the reserved memory regions are communicated on time >>>> so that the virtio-iommu probe request grabs them. However this is not >>>> sufficient. In some cases (my case), I still see some DMA MAP failures >>>> and the guest keeps on using IOVA ranges outside the geometry of the >>>> physical IOMMU. This is due to the fact the VFIO-PCI device is in the >>>> same iommu group as the pcie root port. Normally the kernel >>>> iova_reserve_iommu_regions (dma-iommu.c) is supposed to call >>>> reserve_iova() >>>> for each reserved IOVA, which carves them out of the allocator. When >>>> iommu_dma_init_domain() gets called for the hotplugged vfio-pci device >>>> the iova domain is already allocated and set and we don't call >>>> iova_reserve_iommu_regions() again for the vfio-pci device. So its >>>> corresponding reserved regions are not properly taken into account. >>> I suspect there is same issue with coldplugged devices. If those devices >>> are in same group, get iova_reserve_iommu_regions() is only called >>> for first device. But other devices's reserved regions are missed. >> Correct >>> Curious how you make passthrough device and pcie root port under same >>> group. >>> When I start a x86 guest with passthrough device, I see passthrough >>> device and pcie root port are in different group. >>> >>> -[0000:00]-+-00.0 >>> +-01.0 >>> +-02.0 >>> +-03.0-[01]----00.0 >>> >>> /sys/kernel/iommu_groups/3/devices: >>> 0000:00:03.0 >>> /sys/kernel/iommu_groups/7/devices: >>> 0000:01:00.0 >>> >>> My qemu cmdline: >>> -device pcie-root-port,id=root0,slot=0 >>> -device vfio-pci,host=6f:01.0,id=vfio0,bus=root0 >> I just replayed the scenario: >> - if you have a coldplugged vfio-pci device, the pci root port and the >> passthroughed device end up in different iommu groups. On my end I use >> ioh3420 but you confirmed that's the same for the generic pcie-root-port >> - however if you hotplug the vfio-pci device that's a different story: >> they end up in the same group. Don't ask me why. I tried with >> both virtio-iommu and intel iommu and I end up with the same topology. >> That looks really weird to me. > It also took me a while to get hotplug to work on x86: > pcie_cap_slot_plug_cb() didn't get called, instead it would call > ich9_pm_device_plug_cb(). Not sure what I'm doing wrong. > To work around that I instantiated a second pxb-pcie root bus and then a > pcie root port on there. So my command-line looks like: > > -device virtio-iommu > -device pxb-pcie,id=pcie.1,bus_nr=1 > -device pcie-root-port,chassis=2,id=pcie.2,bus=pcie.1 > > device_add vfio-pci,host=00:04.0,bus=pcie.2 > > And somehow pcieport and the assigned device do end up in separate IOMMU > groups. > > Thanks, > Jean >
On Mon, Jan 29, 2024 at 05:38:55PM +0100, Eric Auger wrote: > > There may be a separate argument for clearing bypass. With a coldplugged > > VFIO device the flow is: > > > > 1. Map the whole guest address space in VFIO to implement boot-bypass. > > This allocates all guest pages, which takes a while and is wasteful. > > I've actually crashed a host that way, when spawning a guest with too > > much RAM. > interesting > > 2. Start the VM > > 3. When the virtio-iommu driver attaches a (non-identity) domain to the > > assigned endpoint, then unmap the whole address space in VFIO, and most > > pages are given back to the host. > > > > We can't disable boot-bypass because the BIOS needs it. But instead the > > flow could be: > > > > 1. Start the VM, with only the virtual endpoints. Nothing to pin. > > 2. The virtio-iommu driver disables bypass during boot > We needed this boot-bypass mode for booting with virtio-blk-scsi > protected with virtio-iommu for instance. > That was needed because we don't have any virtio-iommu driver in edk2 as > opposed to intel iommu driver, right? Yes. What I had in mind is the x86 SeaBIOS which doesn't have any IOMMU driver and accesses the default SATA device: $ qemu-system-x86_64 -M q35 -device virtio-iommu,boot-bypass=off qemu: virtio_iommu_translate sid=250 is not known!! qemu: no buffer available in event queue to report event qemu: AHCI: Failed to start FIS receive engine: bad FIS receive buffer address But it's the same problem with edk2. Also a guest OS without a virtio-iommu driver needs boot-bypass. Once firmware boot is complete, the OS with a virtio-iommu driver normally can turn bypass off in the config space, it's not useful anymore. If it needs to put some endpoints in bypass, then it can attach them to a bypass domain. > > 3. Hotplug the VFIO device. With bypass disabled there is no need to pin > > the whole guest address space, unless the guest explicitly asks for an > > identity domain. > > > > However, I don't know if this is a realistic scenario that will actually > > be used. > > > > By the way, do you have an easy way to reproduce the issue described here? > > I've had to enable iommu.forcedac=1 on the command-line, otherwise Linux > > just allocates 32-bit IOVAs. > I don't have a simple generic reproducer. It happens when assigning this > device: > Ethernet Controller E810-C for QSFP (Ethernet Network Adapter E810-C-Q2) > > I have not encountered that issue with another device yet. > I see on guest side in dmesg: > [ 6.849292] ice 0000:00:05.0: Using 64-bit DMA addresses > > That's emitted in dma-iommu.c iommu_dma_alloc_iova(). > Looks like the guest first tries to allocate an iova in the 32-bit AS > and if this fails use the whole dma_limit. > Seems the 32b IOVA alloc failed here ;-) Interesting, are you running some demanding workload and a lot of CPUs? That's a lot of IOVAs used up, I'm curious about what kind of DMA pattern does that. Thanks, Jean
Hi Jean, On 1/30/24 19:22, Jean-Philippe Brucker wrote: > On Mon, Jan 29, 2024 at 05:38:55PM +0100, Eric Auger wrote: >>> There may be a separate argument for clearing bypass. With a coldplugged >>> VFIO device the flow is: >>> >>> 1. Map the whole guest address space in VFIO to implement boot-bypass. >>> This allocates all guest pages, which takes a while and is wasteful. >>> I've actually crashed a host that way, when spawning a guest with too >>> much RAM. >> interesting >>> 2. Start the VM >>> 3. When the virtio-iommu driver attaches a (non-identity) domain to the >>> assigned endpoint, then unmap the whole address space in VFIO, and most >>> pages are given back to the host. >>> >>> We can't disable boot-bypass because the BIOS needs it. But instead the >>> flow could be: >>> >>> 1. Start the VM, with only the virtual endpoints. Nothing to pin. >>> 2. The virtio-iommu driver disables bypass during boot >> We needed this boot-bypass mode for booting with virtio-blk-scsi >> protected with virtio-iommu for instance. >> That was needed because we don't have any virtio-iommu driver in edk2 as >> opposed to intel iommu driver, right? > Yes. What I had in mind is the x86 SeaBIOS which doesn't have any IOMMU > driver and accesses the default SATA device: > > $ qemu-system-x86_64 -M q35 -device virtio-iommu,boot-bypass=off > qemu: virtio_iommu_translate sid=250 is not known!! > qemu: no buffer available in event queue to report event > qemu: AHCI: Failed to start FIS receive engine: bad FIS receive buffer address > > But it's the same problem with edk2. Also a guest OS without a > virtio-iommu driver needs boot-bypass. Once firmware boot is complete, the > OS with a virtio-iommu driver normally can turn bypass off in the config > space, it's not useful anymore. If it needs to put some endpoints in > bypass, then it can attach them to a bypass domain. yup > >>> 3. Hotplug the VFIO device. With bypass disabled there is no need to pin >>> the whole guest address space, unless the guest explicitly asks for an >>> identity domain. >>> >>> However, I don't know if this is a realistic scenario that will actually >>> be used. >>> >>> By the way, do you have an easy way to reproduce the issue described here? >>> I've had to enable iommu.forcedac=1 on the command-line, otherwise Linux >>> just allocates 32-bit IOVAs. >> I don't have a simple generic reproducer. It happens when assigning this >> device: >> Ethernet Controller E810-C for QSFP (Ethernet Network Adapter E810-C-Q2) >> >> I have not encountered that issue with another device yet. >> I see on guest side in dmesg: >> [ 6.849292] ice 0000:00:05.0: Using 64-bit DMA addresses >> >> That's emitted in dma-iommu.c iommu_dma_alloc_iova(). >> Looks like the guest first tries to allocate an iova in the 32-bit AS >> and if this fails use the whole dma_limit. >> Seems the 32b IOVA alloc failed here ;-) > Interesting, are you running some demanding workload and a lot of CPUs? > That's a lot of IOVAs used up, I'm curious about what kind of DMA pattern > does that. Well nothing smart, just booting the guest with the assigned NIC. 8 vcpus Thanks Eric > > Thanks, > Jean >
>-----Original Message----- >From: Eric Auger <eric.auger@redhat.com> >Subject: Re: [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry >handling for hotplugged devices > >Hi Zhenzhong, >On 1/18/24 08:10, Duan, Zhenzhong wrote: >> Hi Eric, >> >>> -----Original Message----- >>> From: Eric Auger <eric.auger@redhat.com> >>> Cc: mst@redhat.com; clg@redhat.com >>> Subject: [RFC 0/7] VIRTIO-IOMMU/VFIO: Fix host iommu geometry >handling >>> for hotplugged devices >>> >>> In [1] we attempted to fix a case where a VFIO-PCI device protected >>> with a virtio-iommu was assigned to an x86 guest. On x86 the physical >>> IOMMU may have an address width (gaw) of 39 or 48 bits whereas the >>> virtio-iommu used to expose a 64b address space by default. >>> Hence the guest was trying to use the full 64b space and we hit >>> DMA MAP failures. To work around this issue we managed to pass >>> usable IOVA regions (excluding the out of range space) from VFIO >>> to the virtio-iommu device. This was made feasible by introducing >>> a new IOMMU Memory Region callback dubbed >iommu_set_iova_regions(). >>> This latter gets called when the IOMMU MR is enabled which >>> causes the vfio_listener_region_add() to be called. >>> >>> However with VFIO-PCI hotplug, this technique fails due to the >>> race between the call to the callback in the add memory listener >>> and the virtio-iommu probe request. Indeed the probe request gets >>> called before the attach to the domain. So in that case the usable >>> regions are communicated after the probe request and fail to be >>> conveyed to the guest. To be honest the problem was hinted by >>> Jean-Philippe in [1] and I should have been more careful at >>> listening to him and testing with hotplug :-( >> It looks the global virtio_iommu_config.bypass is never cleared in guest. >> When guest virtio_iommu driver enable IOMMU, should it clear this >> bypass attribute? >> If it could be cleared in viommu_probe(), then qemu will call >> virtio_iommu_set_config() then virtio_iommu_switch_address_space_all() >> to enable IOMMU MR. Then both coldplugged and hotplugged devices will >work. > >this field is iommu wide while the probe applies on a one device.In >general I would prefer not to be dependent on the MR enablement. We know >that the device is likely to be protected and we can collect its >requirements beforehand. Agree that your new patch is cleaner. >> >> Intel iommu has a similar bit in register GCMD_REG.TE, when guest >> intel_iommu driver probe set it, on qemu side, >vtd_address_space_refresh_all() >> is called to enable IOMMU MRs. >interesting. > >Would be curious to get Jean Philippe's pov. >> >>> For coldplugged device the technique works because we make sure all >>> the IOMMU MR are enabled once on the machine init done: 94df5b2180 >>> ("virtio-iommu: Fix 64kB host page size VFIO device assignment") >>> for granule freeze. But I would be keen to get rid of this trick. >>> >>> Using an IOMMU MR Ops is unpractical because this relies on the >IOMMU >>> MR to have been enabled and the corresponding >vfio_listener_region_add() >>> to be executed. Instead this series proposes to replace the usage of this >>> API by the recently introduced PCIIOMMUOps: ba7d12eb8c ("hw/pci: >>> modify >>> pci_setup_iommu() to set PCIIOMMUOps"). That way, the callback can be >>> called earlier, once the usable IOVA regions have been collected by >>> VFIO, without the need for the IOMMU MR to be enabled. >>> >>> This looks cleaner. In the short term this may also be used for >>> passing the page size mask, which would allow to get rid of the >>> hacky transient IOMMU MR enablement mentionned above. >>> >>> [1] [PATCH v4 00/12] VIRTIO-IOMMU/VFIO: Don't assume 64b IOVA >space >>> https://lore.kernel.org/all/20231019134651.842175-1- >>> eric.auger@redhat.com/ >>> >>> [2] https://lore.kernel.org/all/20230929161547.GB2957297@myrica/ >>> >>> Extra Notes: >>> With that series, the reserved memory regions are communicated on >time >>> so that the virtio-iommu probe request grabs them. However this is not >>> sufficient. In some cases (my case), I still see some DMA MAP failures >>> and the guest keeps on using IOVA ranges outside the geometry of the >>> physical IOMMU. This is due to the fact the VFIO-PCI device is in the >>> same iommu group as the pcie root port. Normally the kernel >>> iova_reserve_iommu_regions (dma-iommu.c) is supposed to call >>> reserve_iova() >>> for each reserved IOVA, which carves them out of the allocator. When >>> iommu_dma_init_domain() gets called for the hotplugged vfio-pci device >>> the iova domain is already allocated and set and we don't call >>> iova_reserve_iommu_regions() again for the vfio-pci device. So its >>> corresponding reserved regions are not properly taken into account. >> I suspect there is same issue with coldplugged devices. If those devices >> are in same group, get iova_reserve_iommu_regions() is only called >> for first device. But other devices's reserved regions are missed. > >Correct >> >> Curious how you make passthrough device and pcie root port under same >> group. >> When I start a x86 guest with passthrough device, I see passthrough >> device and pcie root port are in different group. >> >> -[0000:00]-+-00.0 >> +-01.0 >> +-02.0 >> +-03.0-[01]----00.0 >> >> /sys/kernel/iommu_groups/3/devices: >> 0000:00:03.0 >> /sys/kernel/iommu_groups/7/devices: >> 0000:01:00.0 >> >> My qemu cmdline: >> -device pcie-root-port,id=root0,slot=0 >> -device vfio-pci,host=6f:01.0,id=vfio0,bus=root0 > >I just replayed the scenario: >- if you have a coldplugged vfio-pci device, the pci root port and the >passthroughed device end up in different iommu groups. On my end I use >ioh3420 but you confirmed that's the same for the generic pcie-root-port >- however if you hotplug the vfio-pci device that's a different story: >they end up in the same group. Don't ask me why. I tried with >both virtio-iommu and intel iommu and I end up with the same topology. >That looks really weird to me. That's strange. I tested two vfio devices with ioh3420, one coldplug, the other hotplug. ioh3420 and vfio devices are in same group. -[0000:00]-+-00.0 +-01.0 +-02.0 +-03.0-[01]----00.0 +-04.0-[02]----00.0 /sys/kernel/iommu_groups/3/devices: 0000:00:03.0 0000:01:00.0 /sys/kernel/iommu_groups/4/devices: 0000:00:04.0 0000:02:00.0 Thanks Zhenzhong > >I initially thought this was an ACS issue but I am now puzzled. > >Thanks! > >Eric >> >> Thanks >> Zhenzhong >> >>> This is not trivial to fix because theoretically the 1st attached >>> devices could already have allocated IOVAs within the reserved regions >>> of the second device. Also we are somehow hijacking the reserved >>> memory regions to model the geometry of the physical IOMMU so not >sure >>> any attempt to fix that upstream will be accepted. At the moment one >>> solution is to make sure assigned devices end up in singleton group. >>> Another solution is to work on a different approach where the gaw >>> can be passed as an option to the virtio-iommu device, similarly at >>> what is done with intel iommu. >>> >>> This series can be found at: >>> https://github.com/eauger/qemu/tree/hotplug-resv-rfc >>> >>> >>> Eric Auger (7): >>> hw/pci: Introduce PCIIOMMUOps::set_host_iova_regions >>> hw/pci: Introduce pci_device_iommu_bus >>> vfio/pci: Pass the usable IOVA ranges through PCIIOMMUOps >>> virtio-iommu: Implement PCIIOMMUOps set_host_resv_regions >>> virtio-iommu: Remove the implementation of iommu_set_iova_ranges >>> hw/vfio: Remove memory_region_iommu_set_iova_ranges() call >>> memory: Remove IOMMU MR iommu_set_iova_range API >>> >>> include/exec/memory.h | 32 ------- >>> include/hw/pci/pci.h | 16 ++++ >>> hw/pci/pci.c | 16 ++++ >>> hw/vfio/common.c | 10 -- >>> hw/vfio/pci.c | 27 ++++++ >>> hw/virtio/virtio-iommu.c | 201 ++++++++++++++++++++------------------- >>> system/memory.c | 13 --- >>> 7 files changed, 160 insertions(+), 155 deletions(-) >>> >>> -- >>> 2.41.0
© 2016 - 2024 Red Hat, Inc.