hotplug: fix premature rebinding of VFIO devices to host

[libvirt] [RFC PATCH 0/5] hotplug: fix premature rebinding of VFIO devices to host

Posted by Michael Roth 8 years, 7 months ago

Hi everyone. Hoping to get some feedback on this approach, or some
alternatives proposed below, to the following issue:

Currently libvirt immediately attempts to rebind a managed device back to the
host driver when it receives a DEVICE_DELETED event from QEMU. This is
problematic for 2 reasons:

1) If multiple devices from a group are attached to a guest, this can move
   the group into a "non-viable" state where some devices are assigned to
   the host and some to the guest.

2) When QEMU emits the DEVICE_DELETED event, there's still a "finalize" phase
   where additional cleanup occurs. In most cases libvirt can ignore this
   cleanup, but in the case of VFIO devices this is where closing of a VFIO
   group FD occurs, and failing to wait before rebinding the device to the
   host driver can result in unexpected behavior. In the case of powernv
   hosts at least, this can lead to a host driver crashing due to the default
   DMA windows not having been fully-restored yet. The window between this is
   and the initial DEVICE_DELETED seems to be ~6 seconds in practice. We've
   seen host dumps with Mellanox CX4 VFs being rebound to host driver during
   this period (on powernv hosts).

Patches 1-4 address 1) by deferring rebinding of a hostdev to the host driver
until all the devices in the group have been detached, at which point all
the hostdevs are rebound as a group. Until that point, the devices are traced
by the drvManager's inactiveList in a similar manner to hostdevs that are
assigned to VFIO via the nodedev-detach interface.

Patch 5 addresses 2) by adding an additional check that, when the last device
from a group is detached, polls /proc for open FDs referencing the VFIO group
path in /dev/vfio/<iommu_group> and waiting for the FD to be closed. If we
time out, we abandon rebinding the hostdevs back to the host.

There are a couple alternatives to Patch 5 that might be worth considering:

a) Add a DEVICE_FINALIZED event to QEMU and wait for that instead of
   DEVICE_DELETED. Paired with patches 1-4 this would let us drop patch 5 in
   favor of minimal changes to libvirt's event handlers.

   The downsides are:
    - that we'd incur some latency for all device-detach calls, but it's not
      apparent to me whether this delay is significant for anything outside
      of VFIO.
    - there may be cases where finalization after DEVICE_DELETE/unparent are
      is not guaranteed, and I'm not sure QOM would encourage such
      expectations even if that's currently the case.

b) Add a GROUP_DELETED event to VFIO's finalize callback. This is the most
   direct solution. With this we could completely separate out the handling
   of rebinding to host driver based on receival of this event.

   The downsides are:
    - this would only work for newer versions of QEMU, though we could use
      the poll-wait in patch 5 as a fallback.
    - synchronizing sync/async device-detach threads with sync/async
      handlers for this would be a bit hairy, but I have a WIP in progress
      that seems *fairly reasonable*

c) Take the approach in Patch 5, either as a precursor to implementing b) or
   something else, or just sticking with that for now.

d) ???

Personally I'm leaning toward c), but I'm wondering if that's "good enough"
for now, or if we should pursue something more robust from the start, or
perhaps something else entirely?

Any feedback is greatly appreciated!

 src/libvirt_private.syms |   5 ++
 src/qemu/qemu_hostdev.c  |  16 +++++
 src/qemu/qemu_hostdev.h  |   4 ++
 src/qemu/qemu_hotplug.c  |  56 ++++++++++++++----
 src/util/virfile.c       |  52 +++++++++++++++++
 src/util/virfile.h       |   1 +
 src/util/virhostdev.c    | 173 ++++++++++++++++++++++++++++++++++++++++++++++++++-----
 src/util/virhostdev.h    |  16 +++++
 src/util/virpci.c        |  69 ++++++++++++++++++----
 src/util/virpci.h        |   4 ++
 10 files changed, 360 insertions(+), 36 deletions(-)

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list

Re: [Qemu-devel] [RFC PATCH 0/5] hotplug: fix premature rebinding of VFIO devices to host

Posted by Daniel P. Berrange 8 years, 7 months ago

On Wed, Jun 28, 2017 at 07:24:55PM -0500, Michael Roth wrote:
> Hi everyone. Hoping to get some feedback on this approach, or some
> alternatives proposed below, to the following issue:
> 
> Currently libvirt immediately attempts to rebind a managed device back to the
> host driver when it receives a DEVICE_DELETED event from QEMU. This is
> problematic for 2 reasons:
> 
> 1) If multiple devices from a group are attached to a guest, this can move
>    the group into a "non-viable" state where some devices are assigned to
>    the host and some to the guest.
> 
> 2) When QEMU emits the DEVICE_DELETED event, there's still a "finalize" phase
>    where additional cleanup occurs. In most cases libvirt can ignore this
>    cleanup, but in the case of VFIO devices this is where closing of a VFIO
>    group FD occurs, and failing to wait before rebinding the device to the
>    host driver can result in unexpected behavior. In the case of powernv
>    hosts at least, this can lead to a host driver crashing due to the default
>    DMA windows not having been fully-restored yet. The window between this is
>    and the initial DEVICE_DELETED seems to be ~6 seconds in practice. We've
>    seen host dumps with Mellanox CX4 VFs being rebound to host driver during
>    this period (on powernv hosts).

Why on earth does QEMU's device finalization take 6 seconds to complete.
That feels very broken to me, unless QEMU is not being schedled due to
host being overcomitted. If that's not the case, then we have a bug to
investigate in QEMU to find out why cleanup is delayed so long.


From libvirt's POV, we consider 'DEVICE_DELETED' as meaning both that the
frontend has gone *and* the corresponding backend has gone. Aside from
cleaning the VFIO group, we use this as a trigger for all other device
related cleanup like SELinux labelling, cgroup device ACLs, etc. If the
backend is not guaranteed to be closed in QEMU when this emit is emitted
then either QEMU needs to delay the event until it is really cleaned up,
or QEMU needs to add a further event to emit when the backend is clean.


> Patches 1-4 address 1) by deferring rebinding of a hostdev to the host driver
> until all the devices in the group have been detached, at which point all
> the hostdevs are rebound as a group. Until that point, the devices are traced
> by the drvManager's inactiveList in a similar manner to hostdevs that are
> assigned to VFIO via the nodedev-detach interface.
> 
> Patch 5 addresses 2) by adding an additional check that, when the last device
> from a group is detached, polls /proc for open FDs referencing the VFIO group
> path in /dev/vfio/<iommu_group> and waiting for the FD to be closed. If we
> time out, we abandon rebinding the hostdevs back to the host.

That is just gross - it is tieing libvirt to details of the QEMU internal
implementation. I really don't think we should be doing that. So NACK to
this from my POV.

> There are a couple alternatives to Patch 5 that might be worth considering:
> 
> a) Add a DEVICE_FINALIZED event to QEMU and wait for that instead of
>    DEVICE_DELETED. Paired with patches 1-4 this would let us drop patch 5 in
>    favor of minimal changes to libvirt's event handlers.
> 
>    The downsides are:
>     - that we'd incur some latency for all device-detach calls, but it's not
>       apparent to me whether this delay is significant for anything outside
>       of VFIO.
>     - there may be cases where finalization after DEVICE_DELETE/unparent are
>       is not guaranteed, and I'm not sure QOM would encourage such
>       expectations even if that's currently the case.
> 
> b) Add a GROUP_DELETED event to VFIO's finalize callback. This is the most
>    direct solution. With this we could completely separate out the handling
>    of rebinding to host driver based on receival of this event.
> 
>    The downsides are:
>     - this would only work for newer versions of QEMU, though we could use
>       the poll-wait in patch 5 as a fallback.
>     - synchronizing sync/async device-detach threads with sync/async
>       handlers for this would be a bit hairy, but I have a WIP in progress
>       that seems *fairly reasonable*
> 
> c) Take the approach in Patch 5, either as a precursor to implementing b) or
>    something else, or just sticking with that for now.
> 
> d) ???

Fix DEVICE_DELETE so its only emitted when the backend associated with
the device is fully cleaned up.


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [Qemu-devel] [RFC PATCH 0/5] hotplug: fix premature rebinding of VFIO devices to host

Posted by Michael Roth 8 years, 7 months ago

Quoting Daniel P. Berrange (2017-06-29 03:33:19)
> On Wed, Jun 28, 2017 at 07:24:55PM -0500, Michael Roth wrote:
> > Hi everyone. Hoping to get some feedback on this approach, or some
> > alternatives proposed below, to the following issue:
> > 
> > Currently libvirt immediately attempts to rebind a managed device back to the
> > host driver when it receives a DEVICE_DELETED event from QEMU. This is
> > problematic for 2 reasons:
> > 
> > 1) If multiple devices from a group are attached to a guest, this can move
> >    the group into a "non-viable" state where some devices are assigned to
> >    the host and some to the guest.
> > 
> > 2) When QEMU emits the DEVICE_DELETED event, there's still a "finalize" phase
> >    where additional cleanup occurs. In most cases libvirt can ignore this
> >    cleanup, but in the case of VFIO devices this is where closing of a VFIO
> >    group FD occurs, and failing to wait before rebinding the device to the
> >    host driver can result in unexpected behavior. In the case of powernv
> >    hosts at least, this can lead to a host driver crashing due to the default
> >    DMA windows not having been fully-restored yet. The window between this is
> >    and the initial DEVICE_DELETED seems to be ~6 seconds in practice. We've
> >    seen host dumps with Mellanox CX4 VFs being rebound to host driver during
> >    this period (on powernv hosts).
> 
> Why on earth does QEMU's device finalization take 6 seconds to complete.
> That feels very broken to me, unless QEMU is not being schedled due to
> host being overcomitted. If that's not the case, then we have a bug to
> investigate in QEMU to find out why cleanup is delayed so long.

In this particular case QEMU starts finalization almost immediately
after the DEVICE_DELETED, but it looks like most of the time between that and
closing of the group FD is spent in the kernel. Here's what it looks
like from QEMU with vfio*/monitor* traces enabled (in this case
unplugging a Mellanox CX3 PF):

# device_del called
...
# vfio device's device_unparent() called:
  # unrealize->exit->vfio_exitfn called:
61948@1498759308.951038:vfio_intx_disable  (0002:01:00.0)
61948@1498759308.953657:vfio_region_exit Device 0002:01:00.0, region 0
61948@1498759308.954532:vfio_region_exit Device 0002:01:00.0, region 2
  # unrealize->exit->vfio_exitfn returns, DEVICE_DELETED sent:
61948@1498759308.954633:monitor_protocol_event_queue event=9 data=0x3fff6c508690 rate=0
61948@1498759308.954669:monitor_protocol_event_emit event=9 data=0x3fff6c508690
  # last unref of vfio device, vfio_instance_finalize() called:
61948@1498759308.955660:vfio_region_finalize Device 0002:01:00.0, region 0
61948@1498759308.955742:vfio_region_finalize Device 0002:01:00.0, region 2
61948@1498759308.955751:vfio_put_base_device close vdev->fd=30

    # close(VFIODevice.fd) <- 5 SECOND DELAY

61948@1498759313.140515:vfio_put_base_device_completed close completed vdev->fd=30
61948@1498759313.181124:vfio_disconnect_container close container->fd=102
61948@1498759313.181152:vfio_put_group close group->fd=101
    # close(VFIOGroup.fd)
61948@1498759313.181157:vfio_put_group_completed close completed group->fd=101
  # vfio_instance_finalize() returns
# vfio device's device_unparent() returns

I poked around in the VFIO group close path and figured restoring ownership
of IOMMU to the host via vfio_iommu_driver_ops.release() (via
close(groupfd) was where all the time was spent, but didn't expect it to
be the close(VFIODevice.fd). Maybe Alexey/Alex have a better idea. I'll
look into it more as well.

But suffice to say there's not much QEMU can do about the delay other
than moving deferring the DEVICE_DELETED event or adding a later-stage
event.

> 
> 
> From libvirt's POV, we consider 'DEVICE_DELETED' as meaning both that the
> frontend has gone *and* the corresponding backend has gone. Aside from
> cleaning the VFIO group, we use this as a trigger for all other device
> related cleanup like SELinux labelling, cgroup device ACLs, etc. If the
> backend is not guaranteed to be closed in QEMU when this emit is emitted
> then either QEMU needs to delay the event until it is really cleaned up,
> or QEMU needs to add a further event to emit when the backend is clean.
> 
> 
> > Patches 1-4 address 1) by deferring rebinding of a hostdev to the host driver
> > until all the devices in the group have been detached, at which point all
> > the hostdevs are rebound as a group. Until that point, the devices are traced
> > by the drvManager's inactiveList in a similar manner to hostdevs that are
> > assigned to VFIO via the nodedev-detach interface.
> > 
> > Patch 5 addresses 2) by adding an additional check that, when the last device
> > from a group is detached, polls /proc for open FDs referencing the VFIO group
> > path in /dev/vfio/<iommu_group> and waiting for the FD to be closed. If we
> > time out, we abandon rebinding the hostdevs back to the host.
> 
> That is just gross - it is tieing libvirt to details of the QEMU internal
> implementation. I really don't think we should be doing that. So NACK to
> this from my POV.
> 
> > There are a couple alternatives to Patch 5 that might be worth considering:
> > 
> > a) Add a DEVICE_FINALIZED event to QEMU and wait for that instead of
> >    DEVICE_DELETED. Paired with patches 1-4 this would let us drop patch 5 in
> >    favor of minimal changes to libvirt's event handlers.
> > 
> >    The downsides are:
> >     - that we'd incur some latency for all device-detach calls, but it's not
> >       apparent to me whether this delay is significant for anything outside
> >       of VFIO.
> >     - there may be cases where finalization after DEVICE_DELETE/unparent are
> >       is not guaranteed, and I'm not sure QOM would encourage such
> >       expectations even if that's currently the case.
> > 
> > b) Add a GROUP_DELETED event to VFIO's finalize callback. This is the most
> >    direct solution. With this we could completely separate out the handling
> >    of rebinding to host driver based on receival of this event.
> > 
> >    The downsides are:
> >     - this would only work for newer versions of QEMU, though we could use
> >       the poll-wait in patch 5 as a fallback.
> >     - synchronizing sync/async device-detach threads with sync/async
> >       handlers for this would be a bit hairy, but I have a WIP in progress
> >       that seems *fairly reasonable*
> > 
> > c) Take the approach in Patch 5, either as a precursor to implementing b) or
> >    something else, or just sticking with that for now.
> > 
> > d) ???
> 
> Fix DEVICE_DELETE so its only emitted when the backend associated with
> the device is fully cleaned up.

Need to explore same concerns I had WRT to DEVICE_FINALIZED above, but
assuming those aren't an issue this would indeed makes things even
simpler. Will look into this more.

Would patch 5 be out of the question even as a fallback for downlevel
QEMUs? Or is that scenario too unlikely to warrant it?

Thanks!

> 
> 
> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
>