intel_iommu: Reset vIOMMU after all the rest of devices

[PATCH 4/4] intel_iommu: Reset vIOMMU at the last stage of system reset

Posted by peterx@redhat.com 2 years ago

From: Peter Xu <peterx@redhat.com>

We got report from Yanghang Liu on an unexpected host DMA error when system
resets with VFIO attached to vIOMMU in the VM context.  Alex Williamson
quickly spot that there can be ordering issues on resets.  A further test
verified that the issue is indeed caused by such wrong ordering.

vIOMMU is a fundamentally infrustructural device, its reset is currently
problematic because no ordering is guaranteed against other PCI devices
which may DMA through the vIOMMU device.

The reset order is tricky, not only because it's current representation as
a normal "-device" (so it kind of follow the qdev tree depth-first reset,
but at a wrong place in the qtree; ideally it should be the parent
somewhere for all pci buses, or just part of pci host bridge), but also
because customized device reset hooks registered over the system reset
framework, so that the ordering of the vIOMMU reset is not guaranteed.

For example, VFIO can register its reset hook with vfio_reset_handler() if
some device does not support FLR.  That will not so far follow the
depth-first travelsal reset mechanism provided by QEMU reset framework.

To remedy both of the issues with limited code changes, leverage the newly
introduced reset stage framework to reset vIOMMUs at the last stage of the
rest devices.  More information can be found in the comments in the patch,
which I decided to persist even with the code to make the problem even
clearer (with potential TODOs for the future, if possible).

Buglink: https://issues.redhat.com/browse/RHEL-7188
Analyzed-by: Alex Williamson <alex.williamson@redhat.com>
Reported-by: YangHang Liu <yanghliu@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 54 +++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 52 insertions(+), 2 deletions(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 8b467cbbd2..5a8fbcad7a 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -35,6 +35,7 @@
 #include "sysemu/kvm.h"
 #include "sysemu/dma.h"
 #include "sysemu/sysemu.h"
+#include "sysemu/reset.h"
 #include "hw/i386/apic_internal.h"
 #include "kvm/kvm_i386.h"
 #include "migration/vmstate.h"
@@ -4086,7 +4087,7 @@ static void vtd_init(IntelIOMMUState *s)
 /* Should not reset address_spaces when reset because devices will still use
  * the address space they got at first (won't ask the bus again).
  */
-static void vtd_reset(DeviceState *dev)
+static void vtd_reset(void *dev)
 {
     IntelIOMMUState *s = INTEL_IOMMU_DEVICE(dev);
 
@@ -4242,7 +4243,6 @@ static void vtd_class_init(ObjectClass *klass, void *data)
     DeviceClass *dc = DEVICE_CLASS(klass);
     X86IOMMUClass *x86_class = X86_IOMMU_DEVICE_CLASS(klass);
 
-    dc->reset = vtd_reset;
     dc->vmsd = &vtd_vmstate;
     device_class_set_props(dc, vtd_properties);
     dc->hotpluggable = false;
@@ -4254,10 +4254,60 @@ static void vtd_class_init(ObjectClass *klass, void *data)
     dc->desc = "Intel IOMMU (VT-d) DMA Remapping device";
 }
 
+static void vtd_instance_init(Object *obj)
+{
+    IntelIOMMUState *s = INTEL_IOMMU_DEVICE(obj);
+
+    /*
+     * vIOMMU reset may require proper ordering with other devices.  There
+     * are two complexities so that normal DeviceState.reset() may not
+     * work properly for vIOMMUs:
+     *
+     * (1) Device depth-first reset hierachy doesn't yet work for vIOMMUs
+     *     (reference: resettable_cold_reset_fn())
+     *
+     *     Currently, vIOMMU devices are created as normal '-device'
+     *     cmdlines.  It means in many ways it has the same attributes with
+     *     most of the rest devices, even if the rest devices should
+     *     logically be under control of the vIOMMU unit.
+     *
+     *     One side effect of it is vIOMMU devices will be currently put
+     *     randomly under qdev tree hierarchy, which is the source of
+     *     device reset ordering in current QEMU (depth-first traversal).
+     *     It means vIOMMU now can be reset before some devices.  For fully
+     *     emulated devices that's not a problem, because the traversal
+     *     holds BQL for the whole process.  However it is a problem if DMA
+     *     can happen without BQL, like VFIO, vDPA or remote device process.
+     *
+     *     TODO: one ideal solution can be that we make vIOMMU the parent
+     *     of the whole pci host bridge.  Hence vIOMMU can be reset after
+     *     all the devices are reset and quiesced.
+     *
+     * (2) Some devices register its own reset functions
+     *
+     *     Even if above issue solved, if devices register its own reset
+     *     functions for some reason via QEMU reset hooks, vIOMMU can still
+     *     be reset before the device. One example is vfio_reset_handler()
+     *     where FLR is not supported on the device.
+     *
+     *     TODO: merge relevant reset functions into the device tree reset
+     *     framework.
+     *
+     * Neither of the above TODO may be trivial.  To make it work for now,
+     * leverage reset stages and reset vIOMMU always at latter stage of the
+     * default.  It means it needs to be reset after at least:
+     *
+     *   - resettable_cold_reset_fn(): machine qdev tree reset
+     *   - vfio_reset_handler():       VFIO reset for !FLR
+     */
+    qemu_register_reset_one(vtd_reset, s, false, 1);
+}
+
 static const TypeInfo vtd_info = {
     .name          = TYPE_INTEL_IOMMU_DEVICE,
     .parent        = TYPE_X86_IOMMU_DEVICE,
     .instance_size = sizeof(IntelIOMMUState),
+    .instance_init = vtd_instance_init,
     .class_init    = vtd_class_init,
 };
 
-- 
2.43.0

Re: [PATCH 4/4] intel_iommu: Reset vIOMMU at the last stage of system reset

Posted by Eric Auger 2 years ago

Hi Peter,

On 1/17/24 10:15, peterx@redhat.com wrote:
> From: Peter Xu <peterx@redhat.com>
>
> We got report from Yanghang Liu on an unexpected host DMA error when system
> resets with VFIO attached to vIOMMU in the VM context.  Alex Williamson
> quickly spot that there can be ordering issues on resets.  A further test
> verified that the issue is indeed caused by such wrong ordering.
nit: not sure the commit msg should contain people info (cover letter
does it already + credits below)
>
> vIOMMU is a fundamentally infrustructural device, its reset is currently

infrastructural

> problematic because no ordering is guaranteed against other PCI devices
> which may DMA through the vIOMMU device.
>
> The reset order is tricky, not only because it's current representation as
s/it's/its
> a normal "-device" (so it kind of follow the qdev tree depth-first reset,
> but at a wrong place in the qtree; ideally it should be the parent
> somewhere for all pci buses, or just part of pci host bridge), but also
> because customized device reset hooks registered over the system reset
> framework, so that the ordering of the vIOMMU reset is not guaranteed.
>
> For example, VFIO can register its reset hook with vfio_reset_handler() if
> some device does not support FLR.  That will not so far follow the
> depth-first travelsal reset mechanism provided by QEMU reset framework.
traversal
>
> To remedy both of the issues with limited code changes, leverage the newly
> introduced reset stage framework to reset vIOMMUs at the last stage of the
> rest devices.  More information can be found in the comments in the patch,
> which I decided to persist even with the code to make the problem even
> clearer (with potential TODOs for the future, if possible).
>
> Buglink: https://issues.redhat.com/browse/RHEL-7188
> Analyzed-by: Alex Williamson <alex.williamson@redhat.com>
> Reported-by: YangHang Liu <yanghliu@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  hw/i386/intel_iommu.c | 54 +++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 52 insertions(+), 2 deletions(-)
>
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 8b467cbbd2..5a8fbcad7a 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -35,6 +35,7 @@
>  #include "sysemu/kvm.h"
>  #include "sysemu/dma.h"
>  #include "sysemu/sysemu.h"
> +#include "sysemu/reset.h"
>  #include "hw/i386/apic_internal.h"
>  #include "kvm/kvm_i386.h"
>  #include "migration/vmstate.h"
> @@ -4086,7 +4087,7 @@ static void vtd_init(IntelIOMMUState *s)
>  /* Should not reset address_spaces when reset because devices will still use
>   * the address space they got at first (won't ask the bus again).
>   */
> -static void vtd_reset(DeviceState *dev)
> +static void vtd_reset(void *dev)
>  {
>      IntelIOMMUState *s = INTEL_IOMMU_DEVICE(dev);
>  
> @@ -4242,7 +4243,6 @@ static void vtd_class_init(ObjectClass *klass, void *data)
>      DeviceClass *dc = DEVICE_CLASS(klass);
>      X86IOMMUClass *x86_class = X86_IOMMU_DEVICE_CLASS(klass);
>  
> -    dc->reset = vtd_reset;
>      dc->vmsd = &vtd_vmstate;
>      device_class_set_props(dc, vtd_properties);
>      dc->hotpluggable = false;
> @@ -4254,10 +4254,60 @@ static void vtd_class_init(ObjectClass *klass, void *data)
>      dc->desc = "Intel IOMMU (VT-d) DMA Remapping device";
>  }
>  
> +static void vtd_instance_init(Object *obj)
> +{
> +    IntelIOMMUState *s = INTEL_IOMMU_DEVICE(obj);
> +
> +    /*
> +     * vIOMMU reset may require proper ordering with other devices.  There
> +     * are two complexities so that normal DeviceState.reset() may not
> +     * work properly for vIOMMUs:
> +     *
> +     * (1) Device depth-first reset hierachy doesn't yet work for vIOMMUs
> +     *     (reference: resettable_cold_reset_fn())
> +     *
> +     *     Currently, vIOMMU devices are created as normal '-device'
> +     *     cmdlines.  It means in many ways it has the same attributes with
s/with/as
> +     *     most of the rest devices, even if the rest devices should
s/rest/other
> +     *     logically be under control of the vIOMMU unit.
> +     *
> +     *     One side effect of it is vIOMMU devices will be currently put
> +     *     randomly under qdev tree hierarchy, which is the source of
> +     *     device reset ordering in current QEMU (depth-first traversal).
> +     *     It means vIOMMU now can be reset before some devices.  For fully
> +     *     emulated devices that's not a problem, because the traversal
> +     *     holds BQL for the whole process.  However it is a problem if DMA
> +     *     can happen without BQL, like VFIO, vDPA or remote device process.
> +     *
> +     *     TODO: one ideal solution can be that we make vIOMMU the parent
> +     *     of the whole pci host bridge.  Hence vIOMMU can be reset after
> +     *     all the devices are reset and quiesced.
in hw/pci/pci.c there is also the info iommu_bus? When resetting a pci
device know whether it is attached to a viommu. I don't know if we could
plug the reset priority mechanism at this level.
> +     *
> +     * (2) Some devices register its own reset functions
> +     *
> +     *     Even if above issue solved, if devices register its own reset
s/its/their
> +     *     functions for some reason via QEMU reset hooks, vIOMMU can still
> +     *     be reset before the device. One example is vfio_reset_handler()
> +     *     where FLR is not supported on the device.
> +     *
> +     *     TODO: merge relevant reset functions into the device tree reset
> +     *     framework.
> +     *
> +     * Neither of the above TODO may be trivial.  To make it work for now,
> +     * leverage reset stages and reset vIOMMU always at latter stage of the
> +     * default.  It means it needs to be reset after at least:
> +     *
> +     *   - resettable_cold_reset_fn(): machine qdev tree reset
> +     *   - vfio_reset_handler():       VFIO reset for !FLR
> +     */
> +    qemu_register_reset_one(vtd_reset, s, false, 1);
introducing enum values may be better ( just as we have for migration prio)
> +}
> +
>  static const TypeInfo vtd_info = {
>      .name          = TYPE_INTEL_IOMMU_DEVICE,
>      .parent        = TYPE_X86_IOMMU_DEVICE,
>      .instance_size = sizeof(IntelIOMMUState),
> +    .instance_init = vtd_instance_init,
>      .class_init    = vtd_class_init,
>  };
>  
Thanks

Eric

[PATCH 1/4] reset: qemu_register_reset_one()
[PATCH 2/4] reset: Allow multiple stages of system resets
[PATCH 3/4] intel_iommu: Tear down address spaces before IOMMU reset
[PATCH 4/4] intel_iommu: Reset vIOMMU at the last stage of system reset