include/exec/memory.h | 32 ---- include/hw/virtio/virtio-iommu.h | 2 + include/sysemu/host_iommu_device.h | 11 ++ hw/pci/pci.c | 8 +- hw/vfio/common.c | 10 - hw/vfio/container.c | 17 ++ hw/vfio/iommufd.c | 18 ++ hw/virtio/virtio-iommu.c | 296 +++++++++++++++++++---------- system/memory.c | 13 -- 9 files changed, 249 insertions(+), 158 deletions(-)
This series is based on Zhenzhong HostIOMMUDevice: [PATCH v7 00/17] Add a host IOMMU device abstraction to check with vIOMMU https://lore.kernel.org/all/20240605083043.317831-1-zhenzhong.duan@intel.com/ It allows to convey host IOVA reserved regions to the virtio-iommu and uses the HostIOMMUDevice infrastructure. This replaces the usage of IOMMU MR ops which fail to satisfy this need for hotplugged devices. See below for additional background. In [1] we attempted to fix a case where a VFIO-PCI device protected with a virtio-iommu was assigned to an x86 guest. On x86 the physical IOMMU may have an address width (gaw) of 39 or 48 bits whereas the virtio-iommu used to expose a 64b address space by default. Hence the guest was trying to use the full 64b space and we hit DMA MAP failures. To work around this issue we managed to pass usable IOVA regions (excluding the out of range space) from VFIO to the virtio-iommu device. This was made feasible by introducing a new IOMMU Memory Region callback dubbed iommu_set_iova_regions(). This latter gets called when the IOMMU MR is enabled which causes the vfio_listener_region_add() to be called. For coldplugged devices the technique works because we make sure all the IOMMU MR are enabled once on the machine init done: 94df5b2180 ("virtio-iommu: Fix 64kB host page size VFIO device assignment") for granule freeze. But I would be keen to get rid of this trick. However with VFIO-PCI hotplug, this technique fails due to the race between the call to the callback in the add memory listener and the virtio-iommu probe request. Indeed the probe request gets called before the attach to the domain. So in that case the usable regions are communicated after the probe request and fail to be conveyed to the guest. Using an IOMMU MR Ops is unpractical because this relies on the IOMMU MR to have been enabled and the corresponding vfio_listener_region_add() to be executed. Instead this series proposes to replace the usage of this API by the recently introduced PCIIOMMUOps: ba7d12eb8c ("hw/pci: modify pci_setup_iommu() to set PCIIOMMUOps"). That way, the callback can be called earlier, once the usable IOVA regions have been collected by VFIO, without the need for the IOMMU MR to be enabled. This series also removes the spurious message: qemu-system-aarch64: warning: virtio-iommu-memory-region-7-0: Notified about new host reserved regions after probe In the short term this may also be used for passing the page size mask, which would allow to get rid of the hacky transient IOMMU MR enablement mentionned above. [1] [PATCH v4 00/12] VIRTIO-IOMMU/VFIO: Don't assume 64b IOVA space https://lore.kernel.org/all/20231019134651.842175-1-eric.auger@redhat.com/ Extra Notes: With that series, the reserved memory regions are communicated on time so that the virtio-iommu probe request grabs them. However this is not sufficient. In some cases (my case), I still see some DMA MAP failures and the guest keeps on using IOVA ranges outside the geometry of the physical IOMMU. This is due to the fact the VFIO-PCI device is in the same iommu group as the pcie root port. Normally the kernel iova_reserve_iommu_regions (dma-iommu.c) is supposed to call reserve_iova() for each reserved IOVA, which carves them out of the allocator. When iommu_dma_init_domain() gets called for the hotplugged vfio-pci device the iova domain is already allocated and set and we don't call iova_reserve_iommu_regions() again for the vfio-pci device. So its corresponding reserved regions are not properly taken into account. This is not trivial to fix because theoretically the 1st attached devices could already have allocated IOVAs within the reserved regions of the second device. Also we are somehow hijacking the reserved memory regions to model the geometry of the physical IOMMU so not sure any attempt to fix that upstream will be accepted. At the moment one solution is to make sure assigned devices end up in singleton group. Another solution is to work on a different approach where the gaw can be passed as an option to the virtio-iommu device, similarly at what is done with intel iommu. This series can be found at: https://github.com/eauger/qemu/tree/iommufd_nesting_preq_v7_resv_regions_v4 History: v3 -> v4: - add one patch to add aliased pci bus and devfn in the HostIOMMUDevice - Use those for resv regions computation - Remove VirtioHostIOMMUDevice and simply use the base object v2 -> v3: - moved the series from RFC to patch - collected Zhenzhong's R-bs and took into account most of his comments (see replies on v2) Eric Auger (8): HostIOMMUDevice: Store the VFIO/VDPA agent virtio-iommu: Implement set|unset]_iommu_device() callbacks HostIOMMUDevice: Introduce get_iova_ranges callback HostIOMMUDevice: Store the aliased bus and devfn virtio-iommu: Compute host reserved regions virtio-iommu: Remove the implementation of iommu_set_iova_range hw/vfio: Remove memory_region_iommu_set_iova_ranges() call memory: Remove IOMMU MR iommu_set_iova_range API include/exec/memory.h | 32 ---- include/hw/virtio/virtio-iommu.h | 2 + include/sysemu/host_iommu_device.h | 11 ++ hw/pci/pci.c | 8 +- hw/vfio/common.c | 10 - hw/vfio/container.c | 17 ++ hw/vfio/iommufd.c | 18 ++ hw/virtio/virtio-iommu.c | 296 +++++++++++++++++++---------- system/memory.c | 13 -- 9 files changed, 249 insertions(+), 158 deletions(-) -- 2.41.0
On Fri, Jun 14, 2024 at 11:52:50AM +0200, Eric Auger wrote: > This series is based on Zhenzhong HostIOMMUDevice: > > [PATCH v7 00/17] Add a host IOMMU device abstraction to check with vIOMMU > https://lore.kernel.org/all/20240605083043.317831-1-zhenzhong.duan@intel.com/ > > It allows to convey host IOVA reserved regions to the virtio-iommu and > uses the HostIOMMUDevice infrastructure. This replaces the usage of > IOMMU MR ops which fail to satisfy this need for hotplugged devices. > > See below for additional background. Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Should likely be merged together with the dependency. I can either merge both this one and the dependency, or Alex can do that because of the vfio changes. > In [1] we attempted to fix a case where a VFIO-PCI device protected > with a virtio-iommu was assigned to an x86 guest. On x86 the physical > IOMMU may have an address width (gaw) of 39 or 48 bits whereas the > virtio-iommu used to expose a 64b address space by default. > Hence the guest was trying to use the full 64b space and we hit > DMA MAP failures. To work around this issue we managed to pass > usable IOVA regions (excluding the out of range space) from VFIO > to the virtio-iommu device. This was made feasible by introducing > a new IOMMU Memory Region callback dubbed iommu_set_iova_regions(). > This latter gets called when the IOMMU MR is enabled which > causes the vfio_listener_region_add() to be called. > > For coldplugged devices the technique works because we make sure all > the IOMMU MR are enabled once on the machine init done: 94df5b2180 > ("virtio-iommu: Fix 64kB host page size VFIO device assignment") > for granule freeze. But I would be keen to get rid of this trick. > > However with VFIO-PCI hotplug, this technique fails due to the > race between the call to the callback in the add memory listener > and the virtio-iommu probe request. Indeed the probe request gets > called before the attach to the domain. So in that case the usable > regions are communicated after the probe request and fail to be > conveyed to the guest. > > Using an IOMMU MR Ops is unpractical because this relies on the IOMMU > MR to have been enabled and the corresponding vfio_listener_region_add() > to be executed. Instead this series proposes to replace the usage of this > API by the recently introduced PCIIOMMUOps: ba7d12eb8c ("hw/pci: modify > pci_setup_iommu() to set PCIIOMMUOps"). That way, the callback can be > called earlier, once the usable IOVA regions have been collected by > VFIO, without the need for the IOMMU MR to be enabled. > > This series also removes the spurious message: > qemu-system-aarch64: warning: virtio-iommu-memory-region-7-0: Notified about new host reserved regions after probe > > In the short term this may also be used for passing the page size > mask, which would allow to get rid of the hacky transient IOMMU > MR enablement mentionned above. > > [1] [PATCH v4 00/12] VIRTIO-IOMMU/VFIO: Don't assume 64b IOVA space > https://lore.kernel.org/all/20231019134651.842175-1-eric.auger@redhat.com/ > > Extra Notes: > With that series, the reserved memory regions are communicated on time > so that the virtio-iommu probe request grabs them. However this is not > sufficient. In some cases (my case), I still see some DMA MAP failures > and the guest keeps on using IOVA ranges outside the geometry of the > physical IOMMU. This is due to the fact the VFIO-PCI device is in the > same iommu group as the pcie root port. Normally the kernel > iova_reserve_iommu_regions (dma-iommu.c) is supposed to call reserve_iova() > for each reserved IOVA, which carves them out of the allocator. When > iommu_dma_init_domain() gets called for the hotplugged vfio-pci device > the iova domain is already allocated and set and we don't call > iova_reserve_iommu_regions() again for the vfio-pci device. So its > corresponding reserved regions are not properly taken into account. > > This is not trivial to fix because theoretically the 1st attached > devices could already have allocated IOVAs within the reserved regions > of the second device. Also we are somehow hijacking the reserved > memory regions to model the geometry of the physical IOMMU so not sure > any attempt to fix that upstream will be accepted. At the moment one > solution is to make sure assigned devices end up in singleton group. > Another solution is to work on a different approach where the gaw > can be passed as an option to the virtio-iommu device, similarly at > what is done with intel iommu. > > This series can be found at: > https://github.com/eauger/qemu/tree/iommufd_nesting_preq_v7_resv_regions_v4 > > History: > v3 -> v4: > - add one patch to add aliased pci bus and devfn in the HostIOMMUDevice > - Use those for resv regions computation > - Remove VirtioHostIOMMUDevice and simply use the base object > > v2 -> v3: > - moved the series from RFC to patch > - collected Zhenzhong's R-bs and took into account most of his comments > (see replies on v2) > > > Eric Auger (8): > HostIOMMUDevice: Store the VFIO/VDPA agent > virtio-iommu: Implement set|unset]_iommu_device() callbacks > HostIOMMUDevice: Introduce get_iova_ranges callback > HostIOMMUDevice: Store the aliased bus and devfn > virtio-iommu: Compute host reserved regions > virtio-iommu: Remove the implementation of iommu_set_iova_range > hw/vfio: Remove memory_region_iommu_set_iova_ranges() call > memory: Remove IOMMU MR iommu_set_iova_range API > > include/exec/memory.h | 32 ---- > include/hw/virtio/virtio-iommu.h | 2 + > include/sysemu/host_iommu_device.h | 11 ++ > hw/pci/pci.c | 8 +- > hw/vfio/common.c | 10 - > hw/vfio/container.c | 17 ++ > hw/vfio/iommufd.c | 18 ++ > hw/virtio/virtio-iommu.c | 296 +++++++++++++++++++---------- > system/memory.c | 13 -- > 9 files changed, 249 insertions(+), 158 deletions(-) > > -- > 2.41.0
>-----Original Message----- >From: Eric Auger <eric.auger@redhat.com> >Subject: [PATCH v4 0/8] VIRTIO-IOMMU/VFIO: Fix host iommu geometry >handling for hotplugged devices > >This series is based on Zhenzhong HostIOMMUDevice: > >[PATCH v7 00/17] Add a host IOMMU device abstraction to check with >vIOMMU >https://lore.kernel.org/all/20240605083043.317831-1- >zhenzhong.duan@intel.com/ > >It allows to convey host IOVA reserved regions to the virtio-iommu and >uses the HostIOMMUDevice infrastructure. This replaces the usage of >IOMMU MR ops which fail to satisfy this need for hotplugged devices. > >See below for additional background. > >In [1] we attempted to fix a case where a VFIO-PCI device protected >with a virtio-iommu was assigned to an x86 guest. On x86 the physical >IOMMU may have an address width (gaw) of 39 or 48 bits whereas the >virtio-iommu used to expose a 64b address space by default. >Hence the guest was trying to use the full 64b space and we hit >DMA MAP failures. To work around this issue we managed to pass >usable IOVA regions (excluding the out of range space) from VFIO >to the virtio-iommu device. This was made feasible by introducing >a new IOMMU Memory Region callback dubbed iommu_set_iova_regions(). >This latter gets called when the IOMMU MR is enabled which >causes the vfio_listener_region_add() to be called. > >For coldplugged devices the technique works because we make sure all >the IOMMU MR are enabled once on the machine init done: 94df5b2180 >("virtio-iommu: Fix 64kB host page size VFIO device assignment") >for granule freeze. But I would be keen to get rid of this trick. > >However with VFIO-PCI hotplug, this technique fails due to the >race between the call to the callback in the add memory listener >and the virtio-iommu probe request. Indeed the probe request gets >called before the attach to the domain. So in that case the usable >regions are communicated after the probe request and fail to be >conveyed to the guest. > >Using an IOMMU MR Ops is unpractical because this relies on the IOMMU >MR to have been enabled and the corresponding vfio_listener_region_add() >to be executed. Instead this series proposes to replace the usage of this >API by the recently introduced PCIIOMMUOps: ba7d12eb8c ("hw/pci: >modify >pci_setup_iommu() to set PCIIOMMUOps"). That way, the callback can be >called earlier, once the usable IOVA regions have been collected by >VFIO, without the need for the IOMMU MR to be enabled. > >This series also removes the spurious message: >qemu-system-aarch64: warning: virtio-iommu-memory-region-7-0: Notified >about new host reserved regions after probe > >In the short term this may also be used for passing the page size >mask, which would allow to get rid of the hacky transient IOMMU >MR enablement mentionned above. > >[1] [PATCH v4 00/12] VIRTIO-IOMMU/VFIO: Don't assume 64b IOVA space > https://lore.kernel.org/all/20231019134651.842175-1- >eric.auger@redhat.com/ > >Extra Notes: >With that series, the reserved memory regions are communicated on time >so that the virtio-iommu probe request grabs them. However this is not >sufficient. In some cases (my case), I still see some DMA MAP failures >and the guest keeps on using IOVA ranges outside the geometry of the >physical IOMMU. This is due to the fact the VFIO-PCI device is in the >same iommu group as the pcie root port. Normally the kernel >iova_reserve_iommu_regions (dma-iommu.c) is supposed to call >reserve_iova() >for each reserved IOVA, which carves them out of the allocator. When >iommu_dma_init_domain() gets called for the hotplugged vfio-pci device >the iova domain is already allocated and set and we don't call >iova_reserve_iommu_regions() again for the vfio-pci device. So its >corresponding reserved regions are not properly taken into account. > >This is not trivial to fix because theoretically the 1st attached >devices could already have allocated IOVAs within the reserved regions >of the second device. Also we are somehow hijacking the reserved >memory regions to model the geometry of the physical IOMMU so not sure >any attempt to fix that upstream will be accepted. At the moment one >solution is to make sure assigned devices end up in singleton group. >Another solution is to work on a different approach where the gaw >can be passed as an option to the virtio-iommu device, similarly at >what is done with intel iommu. > >This series can be found at: >https://github.com/eauger/qemu/tree/iommufd_nesting_preq_v7_resv_re >gions_v4 For the whole series, Reviewed-by: Zhenzhong Duan <zhenzhong.duan@intel.com> Thanks Zhenzhong > >History: >v3 -> v4: >- add one patch to add aliased pci bus and devfn in the HostIOMMUDevice >- Use those for resv regions computation >- Remove VirtioHostIOMMUDevice and simply use the base object > >v2 -> v3: >- moved the series from RFC to patch >- collected Zhenzhong's R-bs and took into account most of his comments > (see replies on v2) > > >Eric Auger (8): > HostIOMMUDevice: Store the VFIO/VDPA agent > virtio-iommu: Implement set|unset]_iommu_device() callbacks > HostIOMMUDevice: Introduce get_iova_ranges callback > HostIOMMUDevice: Store the aliased bus and devfn > virtio-iommu: Compute host reserved regions > virtio-iommu: Remove the implementation of iommu_set_iova_range > hw/vfio: Remove memory_region_iommu_set_iova_ranges() call > memory: Remove IOMMU MR iommu_set_iova_range API > > include/exec/memory.h | 32 ---- > include/hw/virtio/virtio-iommu.h | 2 + > include/sysemu/host_iommu_device.h | 11 ++ > hw/pci/pci.c | 8 +- > hw/vfio/common.c | 10 - > hw/vfio/container.c | 17 ++ > hw/vfio/iommufd.c | 18 ++ > hw/virtio/virtio-iommu.c | 296 +++++++++++++++++++---------- > system/memory.c | 13 -- > 9 files changed, 249 insertions(+), 158 deletions(-) > >-- >2.41.0
On 6/14/24 11:52 AM, Eric Auger wrote: > This series is based on Zhenzhong HostIOMMUDevice: > > [PATCH v7 00/17] Add a host IOMMU device abstraction to check with vIOMMU > https://lore.kernel.org/all/20240605083043.317831-1-zhenzhong.duan@intel.com/ > > It allows to convey host IOVA reserved regions to the virtio-iommu and > uses the HostIOMMUDevice infrastructure. This replaces the usage of > IOMMU MR ops which fail to satisfy this need for hotplugged devices. > > See below for additional background. > > In [1] we attempted to fix a case where a VFIO-PCI device protected > with a virtio-iommu was assigned to an x86 guest. On x86 the physical > IOMMU may have an address width (gaw) of 39 or 48 bits whereas the > virtio-iommu used to expose a 64b address space by default. > Hence the guest was trying to use the full 64b space and we hit > DMA MAP failures. To work around this issue we managed to pass > usable IOVA regions (excluding the out of range space) from VFIO > to the virtio-iommu device. This was made feasible by introducing > a new IOMMU Memory Region callback dubbed iommu_set_iova_regions(). > This latter gets called when the IOMMU MR is enabled which > causes the vfio_listener_region_add() to be called. > > For coldplugged devices the technique works because we make sure all > the IOMMU MR are enabled once on the machine init done: 94df5b2180 > ("virtio-iommu: Fix 64kB host page size VFIO device assignment") > for granule freeze. But I would be keen to get rid of this trick. > > However with VFIO-PCI hotplug, this technique fails due to the > race between the call to the callback in the add memory listener > and the virtio-iommu probe request. Indeed the probe request gets > called before the attach to the domain. So in that case the usable > regions are communicated after the probe request and fail to be > conveyed to the guest. > > Using an IOMMU MR Ops is unpractical because this relies on the IOMMU > MR to have been enabled and the corresponding vfio_listener_region_add() > to be executed. Instead this series proposes to replace the usage of this > API by the recently introduced PCIIOMMUOps: ba7d12eb8c ("hw/pci: modify > pci_setup_iommu() to set PCIIOMMUOps"). That way, the callback can be > called earlier, once the usable IOVA regions have been collected by > VFIO, without the need for the IOMMU MR to be enabled. > > This series also removes the spurious message: > qemu-system-aarch64: warning: virtio-iommu-memory-region-7-0: Notified about new host reserved regions after probe > > In the short term this may also be used for passing the page size > mask, which would allow to get rid of the hacky transient IOMMU > MR enablement mentionned above. > > [1] [PATCH v4 00/12] VIRTIO-IOMMU/VFIO: Don't assume 64b IOVA space > https://lore.kernel.org/all/20231019134651.842175-1-eric.auger@redhat.com/ > > Extra Notes: > With that series, the reserved memory regions are communicated on time > so that the virtio-iommu probe request grabs them. However this is not > sufficient. In some cases (my case), I still see some DMA MAP failures > and the guest keeps on using IOVA ranges outside the geometry of the > physical IOMMU. This is due to the fact the VFIO-PCI device is in the > same iommu group as the pcie root port. Normally the kernel > iova_reserve_iommu_regions (dma-iommu.c) is supposed to call reserve_iova() > for each reserved IOVA, which carves them out of the allocator. When > iommu_dma_init_domain() gets called for the hotplugged vfio-pci device > the iova domain is already allocated and set and we don't call > iova_reserve_iommu_regions() again for the vfio-pci device. So its > corresponding reserved regions are not properly taken into account. > > This is not trivial to fix because theoretically the 1st attached > devices could already have allocated IOVAs within the reserved regions > of the second device. Also we are somehow hijacking the reserved > memory regions to model the geometry of the physical IOMMU so not sure > any attempt to fix that upstream will be accepted. At the moment one > solution is to make sure assigned devices end up in singleton group. > Another solution is to work on a different approach where the gaw > can be passed as an option to the virtio-iommu device, similarly at > what is done with intel iommu. > > This series can be found at: > https://github.com/eauger/qemu/tree/iommufd_nesting_preq_v7_resv_regions_v4 > > History: > v3 -> v4: > - add one patch to add aliased pci bus and devfn in the HostIOMMUDevice > - Use those for resv regions computation > - Remove VirtioHostIOMMUDevice and simply use the base object > > v2 -> v3: > - moved the series from RFC to patch > - collected Zhenzhong's R-bs and took into account most of his comments > (see replies on v2) > > > Eric Auger (8): > HostIOMMUDevice: Store the VFIO/VDPA agent > virtio-iommu: Implement set|unset]_iommu_device() callbacks > HostIOMMUDevice: Introduce get_iova_ranges callback > HostIOMMUDevice: Store the aliased bus and devfn > virtio-iommu: Compute host reserved regions > virtio-iommu: Remove the implementation of iommu_set_iova_range > hw/vfio: Remove memory_region_iommu_set_iova_ranges() call > memory: Remove IOMMU MR iommu_set_iova_range API > > include/exec/memory.h | 32 ---- > include/hw/virtio/virtio-iommu.h | 2 + > include/sysemu/host_iommu_device.h | 11 ++ > hw/pci/pci.c | 8 +- > hw/vfio/common.c | 10 - > hw/vfio/container.c | 17 ++ > hw/vfio/iommufd.c | 18 ++ > hw/virtio/virtio-iommu.c | 296 +++++++++++++++++++---------- > system/memory.c | 13 -- > 9 files changed, 249 insertions(+), 158 deletions(-) > Applied to vfio-next. Thanks, C.
© 2016 - 2024 Red Hat, Inc.