[Qemu-devel] [PATCH v2 0/3] qdev/vfio: defer DEVICE_DEL to avoid races with libvirt

Michael Roth posted 3 patches 6 years, 6 months ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20171009170607.4155-1-mdroth@linux.vnet.ibm.com
Test checkpatch passed
Test docker passed
Test s390x passed
There is a newer version of this series
hw/core/qdev.c         | 31 ++++++++++++++++++++-----------
include/hw/qdev-core.h |  1 +
2 files changed, 21 insertions(+), 11 deletions(-)
[Qemu-devel] [PATCH v2 0/3] qdev/vfio: defer DEVICE_DEL to avoid races with libvirt
Posted by Michael Roth 6 years, 6 months ago
This series was motivated by the discussion in this thread:

  https://www.redhat.com/archives/libvir-list/2017-June/msg01370.html

The issue this series addresses is that when libvirt unplugs a VFIO PCI device,
it may attempt to bind the host device back to the host driver when QEMU emits
the DEVICE_DELETED event for the corresponding vfio-pci device. However, the
VFIO group FD is not actually cleaned up until vfio-pci device is *finalized*
by QEMU, whereas the event is emitted earlier during device_unparent.
Depending on the host device and how long certain operations like resetting the
device might take, this can in result in libvirt trying to rebind the device
back to the host while it is still in use by VFIO, leading to host crashes or
other unexpected behavior.

In particular, Mellanox CX4 adapters on PowerNV hosts might not be fully
quiesced by vfio-pci's finalize() routine until up to 6s after the
DEVICE_DELETED was emitted, leading to detach-device on the libvirt side pretty
much always crashing the host.

Implementing this change requires 2 prereqs to ensure the same information is
available when the DEVICE_DELETED is finally emitted:

1) Storing the path in the composition patch, which is addressed by PATCH 1,
   which was plucked from another pending series from Greg Kurz:

   https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg07922.html

   since we are now "disconnected" at the time the event is emitted, and

2) Deferring qemu_opts_del of the DeviceState->QemuOpts till finalize, since
   that is where DeviceState->id is stored. This was actually how it was
   done in the past, so PATCH 2 simply reverts the change which moved it to
   device_unparent.

From there it's just a mechanical move of the event from device_unparent to
device_finalize.

Since this was originally posted a kernel fix was merged to address the race
on the kernel side (6586b561), but it would still be good to fix this on the
QEMU side for older host kernel and for clearer semantics on the
libvirt/management side.

v2:
 - rebased on master
 - fixed up inaccurate comment in PATCH 1 (Eric)

 hw/core/qdev.c         | 31 ++++++++++++++++++++-----------
 include/hw/qdev-core.h |  1 +
 2 files changed, 21 insertions(+), 11 deletions(-)