[Qemu-devel] [PATCH for-2.10 0/3] qdev/vfio: defer DEVICE_DEL to avoid races with libvirt

Michael Roth posted 3 patches 6 years, 8 months ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/1501119055-4060-1-git-send-email-mdroth@linux.vnet.ibm.com
Test FreeBSD passed
Test checkpatch passed
Test s390x passed
There is a newer version of this series
hw/core/qdev.c         | 30 +++++++++++++++++++-----------
include/hw/qdev-core.h |  1 +
2 files changed, 20 insertions(+), 11 deletions(-)
[Qemu-devel] [PATCH for-2.10 0/3] qdev/vfio: defer DEVICE_DEL to avoid races with libvirt
Posted by Michael Roth 6 years, 8 months ago
This series was motivated by the discussion in this thread:

  https://www.redhat.com/archives/libvir-list/2017-June/msg01370.html

The issue this series addresses is that when libvirt unplugs a VFIO PCI device,
it may attempt to bind the host device back to the host driver when QEMU emits
the DEVICE_DELETED event for the corresponding vfio-pci device. However, the
VFIO group FD is not actually cleaned up until vfio-pci device is *finalized*
by QEMU, whereas the event is emitted earlier during device_unparent.
Depending on the host device and how long certain operations like resetting the
device might take, this can in result in libvirt trying to rebind the device
back to the host while it is still in use by VFIO, leading to host crashes or
other unexpected behavior.

In particular, Mellanox CX4 adapters on PowerNV hosts might not be fully
quiesced by vfio-pci's finalize() routine until up to 6s after the
DEVICE_DELETED was emitted, leading to detach-device on the libvirt side pretty
much always crashing the host.

Implementing this change requires 2 prereqs to ensure the same information is
available when the DEVICE_DELETED is finally emitted:

1) Storing the path in the composition patch, which is addressed by PATCH 1,
   which was plucked from another pending series from Greg Kurz:

   https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg07922.html

   since we are now "disconnected" at the time the event is emitted, and

2) Deferring qemu_opts_del of the DeviceState->QemuOpts till finalize, since
   that is where DeviceState->id is stored. This was actually how it was
   done in the past, so PATCH 2 simply reverts the change which moved it to
   device_unparent.

From there it's just a mechanical move of the event from device_unparent to
device_finalize.

 hw/core/qdev.c         | 30 +++++++++++++++++++-----------
 include/hw/qdev-core.h |  1 +
 2 files changed, 20 insertions(+), 11 deletions(-)


Re: [Qemu-devel] [PATCH for-2.10 0/3] qdev/vfio: defer DEVICE_DEL to avoid races with libvirt
Posted by Auger Eric 6 years, 8 months ago
Hi Michael,

On 27/07/2017 03:30, Michael Roth wrote:
> This series was motivated by the discussion in this thread:
> 
>   https://www.redhat.com/archives/libvir-list/2017-June/msg01370.html
> 
> The issue this series addresses is that when libvirt unplugs a VFIO PCI device,
> it may attempt to bind the host device back to the host driver when QEMU emits
> the DEVICE_DELETED event for the corresponding vfio-pci device. However, the
> VFIO group FD is not actually cleaned up until vfio-pci device is *finalized*
> by QEMU, whereas the event is emitted earlier during device_unparent.
> Depending on the host device and how long certain operations like resetting the
> device might take, this can in result in libvirt trying to rebind the device
> back to the host while it is still in use by VFIO, leading to host crashes or
> other unexpected behavior.
> 
> In particular, Mellanox CX4 adapters on PowerNV hosts might not be fully
> quiesced by vfio-pci's finalize() routine until up to 6s after the
> DEVICE_DELETED was emitted, leading to detach-device on the libvirt side pretty
> much always crashing the host.
> 
> Implementing this change requires 2 prereqs to ensure the same information is
> available when the DEVICE_DELETED is finally emitted:
> 
> 1) Storing the path in the composition patch, which is addressed by PATCH 1,
>    which was plucked from another pending series from Greg Kurz:
> 
>    https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg07922.html
> 
>    since we are now "disconnected" at the time the event is emitted, and
> 
> 2) Deferring qemu_opts_del of the DeviceState->QemuOpts till finalize, since
>    that is where DeviceState->id is stored. This was actually how it was
>    done in the past, so PATCH 2 simply reverts the change which moved it to
>    device_unparent.
> 
> From there it's just a mechanical move of the event from device_unparent to
> device_finalize.
> 
>  hw/core/qdev.c         | 30 +++++++++++++++++++-----------
>  include/hw/qdev-core.h |  1 +
>  2 files changed, 20 insertions(+), 11 deletions(-)
> 
> 

Tested-by: Eric Auger <eric.auger@redhat.com>


Without the series I get the following warning when detaching a pci host
dev from virt-manager:
2017-08-09T14:28:44.825925Z qemu-system-aarch64: Failed to remove group
53 from KVM VFIO device: No such file or directory

as KVM_DEV_VFIO_GROUP_DEL returns -ENOENT due to the fact the fd for the
VFIO group does not exist anymore as libvirt already started to unbind
the device from vfio-pci at vfio_instance_finalize() time.

With the series I don't see the message anymore.

Thanks

Eric




Re: [Qemu-devel] [PATCH for-2.10 0/3] qdev/vfio: defer DEVICE_DEL to avoid races with libvirt
Posted by Michael Roth 6 years, 6 months ago
Quoting Michael Roth (2017-07-26 20:30:52)
> This series was motivated by the discussion in this thread:
> 
>   https://www.redhat.com/archives/libvir-list/2017-June/msg01370.html
> 
> The issue this series addresses is that when libvirt unplugs a VFIO PCI device,
> it may attempt to bind the host device back to the host driver when QEMU emits
> the DEVICE_DELETED event for the corresponding vfio-pci device. However, the
> VFIO group FD is not actually cleaned up until vfio-pci device is *finalized*
> by QEMU, whereas the event is emitted earlier during device_unparent.
> Depending on the host device and how long certain operations like resetting the
> device might take, this can in result in libvirt trying to rebind the device
> back to the host while it is still in use by VFIO, leading to host crashes or
> other unexpected behavior.
> 
> In particular, Mellanox CX4 adapters on PowerNV hosts might not be fully
> quiesced by vfio-pci's finalize() routine until up to 6s after the
> DEVICE_DELETED was emitted, leading to detach-device on the libvirt side pretty
> much always crashing the host.
> 
> Implementing this change requires 2 prereqs to ensure the same information is
> available when the DEVICE_DELETED is finally emitted:
> 
> 1) Storing the path in the composition patch, which is addressed by PATCH 1,
>    which was plucked from another pending series from Greg Kurz:
> 
>    https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg07922.html
> 
>    since we are now "disconnected" at the time the event is emitted, and
> 
> 2) Deferring qemu_opts_del of the DeviceState->QemuOpts till finalize, since
>    that is where DeviceState->id is stored. This was actually how it was
>    done in the past, so PATCH 2 simply reverts the change which moved it to
>    device_unparent.
> 
> From there it's just a mechanical move of the event from device_unparent to
> device_finalize.

Ping.

The situation has changed somewhat since original posting as Alex now
has a fix on the kernel side for the VFIO issue noted above:

  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6586b561a91cd80a91c8f107ed0d144feb3eadc2

However, I think this series would still be useful for addressing the
issue for hosts using older kernels, and there seems to be general
interest from the libvirt side in aligning "DEVICE_DELETED" events
with the notion that QEMU is completely finished with a device.

Patch 1/3 is also needed for another series that Greg is working on
and I don't want to hold that up for this series, so if it's
preferred that we just post that patch separately in the meantime
please let me know.

Thanks!

> 
>  hw/core/qdev.c         | 30 +++++++++++++++++++-----------
>  include/hw/qdev-core.h |  1 +
>  2 files changed, 20 insertions(+), 11 deletions(-)
> 
> 


Re: [Qemu-devel] [PATCH for-2.10 0/3] qdev/vfio: defer DEVICE_DEL to avoid races with libvirt
Posted by David Gibson 6 years, 6 months ago
On Tue, Oct 03, 2017 at 05:21:57PM -0500, Michael Roth wrote:
> Quoting Michael Roth (2017-07-26 20:30:52)
> > This series was motivated by the discussion in this thread:
> > 
> >   https://www.redhat.com/archives/libvir-list/2017-June/msg01370.html
> > 
> > The issue this series addresses is that when libvirt unplugs a VFIO PCI device,
> > it may attempt to bind the host device back to the host driver when QEMU emits
> > the DEVICE_DELETED event for the corresponding vfio-pci device. However, the
> > VFIO group FD is not actually cleaned up until vfio-pci device is *finalized*
> > by QEMU, whereas the event is emitted earlier during device_unparent.
> > Depending on the host device and how long certain operations like resetting the
> > device might take, this can in result in libvirt trying to rebind the device
> > back to the host while it is still in use by VFIO, leading to host crashes or
> > other unexpected behavior.
> > 
> > In particular, Mellanox CX4 adapters on PowerNV hosts might not be fully
> > quiesced by vfio-pci's finalize() routine until up to 6s after the
> > DEVICE_DELETED was emitted, leading to detach-device on the libvirt side pretty
> > much always crashing the host.
> > 
> > Implementing this change requires 2 prereqs to ensure the same information is
> > available when the DEVICE_DELETED is finally emitted:
> > 
> > 1) Storing the path in the composition patch, which is addressed by PATCH 1,
> >    which was plucked from another pending series from Greg Kurz:
> > 
> >    https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg07922.html
> > 
> >    since we are now "disconnected" at the time the event is emitted, and
> > 
> > 2) Deferring qemu_opts_del of the DeviceState->QemuOpts till finalize, since
> >    that is where DeviceState->id is stored. This was actually how it was
> >    done in the past, so PATCH 2 simply reverts the change which moved it to
> >    device_unparent.
> > 
> > From there it's just a mechanical move of the event from device_unparent to
> > device_finalize.
> 
> Ping.
> 
> The situation has changed somewhat since original posting as Alex now
> has a fix on the kernel side for the VFIO issue noted above:
> 
>   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6586b561a91cd80a91c8f107ed0d144feb3eadc2
> 
> However, I think this series would still be useful for addressing the
> issue for hosts using older kernels, and there seems to be general
> interest from the libvirt side in aligning "DEVICE_DELETED" events
> with the notion that QEMU is completely finished with a device.
> 
> Patch 1/3 is also needed for another series that Greg is working on
> and I don't want to hold that up for this series, so if it's
> preferred that we just post that patch separately in the meantime
> please let me know.

Yeah, I'd been meaning to ask you about this.  I think it just got
lost in the rush of things for qemu-2.10.  I think it would be worth
rebasing and reposting.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson
Re: [Qemu-devel] [PATCH for-2.10 0/3] qdev/vfio: defer DEVICE_DEL to avoid races with libvirt
Posted by David Gibson 6 years, 6 months ago
On Tue, Oct 03, 2017 at 05:21:57PM -0500, Michael Roth wrote:
> Quoting Michael Roth (2017-07-26 20:30:52)
> > This series was motivated by the discussion in this thread:
> > 
> >   https://www.redhat.com/archives/libvir-list/2017-June/msg01370.html
> > 
> > The issue this series addresses is that when libvirt unplugs a VFIO PCI device,
> > it may attempt to bind the host device back to the host driver when QEMU emits
> > the DEVICE_DELETED event for the corresponding vfio-pci device. However, the
> > VFIO group FD is not actually cleaned up until vfio-pci device is *finalized*
> > by QEMU, whereas the event is emitted earlier during device_unparent.
> > Depending on the host device and how long certain operations like resetting the
> > device might take, this can in result in libvirt trying to rebind the device
> > back to the host while it is still in use by VFIO, leading to host crashes or
> > other unexpected behavior.
> > 
> > In particular, Mellanox CX4 adapters on PowerNV hosts might not be fully
> > quiesced by vfio-pci's finalize() routine until up to 6s after the
> > DEVICE_DELETED was emitted, leading to detach-device on the libvirt side pretty
> > much always crashing the host.
> > 
> > Implementing this change requires 2 prereqs to ensure the same information is
> > available when the DEVICE_DELETED is finally emitted:
> > 
> > 1) Storing the path in the composition patch, which is addressed by PATCH 1,
> >    which was plucked from another pending series from Greg Kurz:
> > 
> >    https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg07922.html
> > 
> >    since we are now "disconnected" at the time the event is emitted, and
> > 
> > 2) Deferring qemu_opts_del of the DeviceState->QemuOpts till finalize, since
> >    that is where DeviceState->id is stored. This was actually how it was
> >    done in the past, so PATCH 2 simply reverts the change which moved it to
> >    device_unparent.
> > 
> > From there it's just a mechanical move of the event from device_unparent to
> > device_finalize.
> 
> Ping.

I think it got forgotten amongst the rush to get v2.10 out.

> The situation has changed somewhat since original posting as Alex now
> has a fix on the kernel side for the VFIO issue noted above:
> 
>   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6586b561a91cd80a91c8f107ed0d144feb3eadc2
> 
> However, I think this series would still be useful for addressing the
> issue for hosts using older kernels, and there seems to be general
> interest from the libvirt side in aligning "DEVICE_DELETED" events
> with the notion that QEMU is completely finished with a device.
> 
> Patch 1/3 is also needed for another series that Greg is working on
> and I don't want to hold that up for this series, so if it's
> preferred that we just post that patch separately in the meantime
> please let me know.

Yeah, now that I've understood the issue properly, I also think this
is a good idea.  I recommend reposting.


-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson
Re: [Qemu-devel] [PATCH for-2.10 0/3] qdev/vfio: defer DEVICE_DEL to avoid races with libvirt
Posted by Paolo Bonzini 6 years, 6 months ago
On 06/10/2017 12:23, David Gibson wrote:
> On Tue, Oct 03, 2017 at 05:21:57PM -0500, Michael Roth wrote:
>> Patch 1/3 is also needed for another series that Greg is working on
>> and I don't want to hold that up for this series, so if it's
>> preferred that we just post that patch separately in the meantime
>> please let me know.
> 
> Yeah, now that I've understood the issue properly, I also think this
> is a good idea.  I recommend reposting.

Yes, please do!

Thanks,

Paolo

Re: [Qemu-devel] [PATCH for-2.10 0/3] qdev/vfio: defer DEVICE_DEL to avoid races with libvirt
Posted by Peter Maydell 6 years, 8 months ago
On 27 July 2017 at 02:30, Michael Roth <mdroth@linux.vnet.ibm.com> wrote:
> In particular, Mellanox CX4 adapters on PowerNV hosts might not be fully
> quiesced by vfio-pci's finalize() routine until up to 6s after the
> DEVICE_DELETED was emitted, leading to detach-device on the libvirt side pretty
> much always crashing the host.

My initial naive thought is that if the host kernel can crash then
this is a host kernel bug... shouldn't the host kernel refuse
the subsequent libvirt rebind if it would cause a crash ?

thanks
-- PMM

Re: [Qemu-devel] [PATCH for-2.10 0/3] qdev/vfio: defer DEVICE_DEL to avoid races with libvirt
Posted by David Gibson 6 years, 8 months ago
On Thu, Jul 27, 2017 at 10:11:48AM +0100, Peter Maydell wrote:
> On 27 July 2017 at 02:30, Michael Roth <mdroth@linux.vnet.ibm.com> wrote:
> > In particular, Mellanox CX4 adapters on PowerNV hosts might not be fully
> > quiesced by vfio-pci's finalize() routine until up to 6s after the
> > DEVICE_DELETED was emitted, leading to detach-device on the libvirt side pretty
> > much always crashing the host.
> 
> My initial naive thought is that if the host kernel can crash then
> this is a host kernel bug... shouldn't the host kernel refuse
> the subsequent libvirt rebind if it would cause a crash ?

I think so too, but I haven't been able to convince Alex.  Nor
find time to fix it in the kernel myself.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson
Re: [Qemu-devel] [PATCH for-2.10 0/3] qdev/vfio: defer DEVICE_DEL to avoid races with libvirt
Posted by Daniel P. Berrange 6 years, 8 months ago
On Thu, Jul 27, 2017 at 08:53:48PM +1000, David Gibson wrote:
> On Thu, Jul 27, 2017 at 10:11:48AM +0100, Peter Maydell wrote:
> > On 27 July 2017 at 02:30, Michael Roth <mdroth@linux.vnet.ibm.com> wrote:
> > > In particular, Mellanox CX4 adapters on PowerNV hosts might not be fully
> > > quiesced by vfio-pci's finalize() routine until up to 6s after the
> > > DEVICE_DELETED was emitted, leading to detach-device on the libvirt side pretty
> > > much always crashing the host.
> > 
> > My initial naive thought is that if the host kernel can crash then
> > this is a host kernel bug... shouldn't the host kernel refuse
> > the subsequent libvirt rebind if it would cause a crash ?
> 
> I think so too, but I haven't been able to convince Alex.  Nor
> find time to fix it in the kernel myself.

I think we need to fix both the QEMU premature sending of DEVICE_DELETED
and the kernel bug that allowed the crash.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [Qemu-devel] [PATCH for-2.10 0/3] qdev/vfio: defer DEVICE_DEL to avoid races with libvirt
Posted by Alex Williamson 6 years, 8 months ago
On Thu, 27 Jul 2017 12:50:42 +0100
"Daniel P. Berrange" <berrange@redhat.com> wrote:

> On Thu, Jul 27, 2017 at 08:53:48PM +1000, David Gibson wrote:
> > On Thu, Jul 27, 2017 at 10:11:48AM +0100, Peter Maydell wrote:  
> > > On 27 July 2017 at 02:30, Michael Roth <mdroth@linux.vnet.ibm.com> wrote:  
> > > > In particular, Mellanox CX4 adapters on PowerNV hosts might not be fully
> > > > quiesced by vfio-pci's finalize() routine until up to 6s after the
> > > > DEVICE_DELETED was emitted, leading to detach-device on the libvirt side pretty
> > > > much always crashing the host.  
> > > 
> > > My initial naive thought is that if the host kernel can crash then
> > > this is a host kernel bug... shouldn't the host kernel refuse
> > > the subsequent libvirt rebind if it would cause a crash ?  
> > 
> > I think so too, but I haven't been able to convince Alex.  Nor
> > find time to fix it in the kernel myself.  
> 
> I think we need to fix both the QEMU premature sending of DEVICE_DELETED
> and the kernel bug that allowed the crash.


Where do we stand on this for v2.10?  I'd like to see it get in.  There
may be things to fix in the kernel, some of them may already be fixed
in the latest development kernel, but ultimately the kernel considers
driver binding to be a trusted operation and if userspace doesn't
understand all the dependencies, they shouldn't be doing it.  In this
case libvirt is using the DEVICE_DELETED signal with the assumption
that the device has been fully released by QEMU, which is of course not
accurate (libvirt could test this, but chooses not to).  libvirt
therefore begins trying to unbind a device that is still in use, we try
to handle it, but see official kernel stance that userspace is
responsible for understanding device dependencies, so we can only do so
much.

IMO, the next step along those lines would be that libvirt needs to
understand that even once a device is fully released from QEMU, it's
not necessarily safe to re-bind the device to a host driver.  If the
device is a member of a group where other devices are still in use by
userspace, this will violate user/host device isolation and the kernel
will crash to protect itself.  At best I may be able to improve this to
killing the userspace process making use of the conflicting device, but
the kernel view is that userspace (libvirt) has mandated to bind the
device to the host driver and we must make it so, the user is
responsible for the consequences.  Thanks,

Alex

Re: [Qemu-devel] [PATCH for-2.10 0/3] qdev/vfio: defer DEVICE_DEL to avoid races with libvirt
Posted by David Gibson 6 years, 8 months ago
On Tue, Aug 08, 2017 at 01:40:08PM -0600, Alex Williamson wrote:
> On Thu, 27 Jul 2017 12:50:42 +0100
> "Daniel P. Berrange" <berrange@redhat.com> wrote:
> 
> > On Thu, Jul 27, 2017 at 08:53:48PM +1000, David Gibson wrote:
> > > On Thu, Jul 27, 2017 at 10:11:48AM +0100, Peter Maydell wrote:  
> > > > On 27 July 2017 at 02:30, Michael Roth <mdroth@linux.vnet.ibm.com> wrote:  
> > > > > In particular, Mellanox CX4 adapters on PowerNV hosts might not be fully
> > > > > quiesced by vfio-pci's finalize() routine until up to 6s after the
> > > > > DEVICE_DELETED was emitted, leading to detach-device on the libvirt side pretty
> > > > > much always crashing the host.  
> > > > 
> > > > My initial naive thought is that if the host kernel can crash then
> > > > this is a host kernel bug... shouldn't the host kernel refuse
> > > > the subsequent libvirt rebind if it would cause a crash ?  
> > > 
> > > I think so too, but I haven't been able to convince Alex.  Nor
> > > find time to fix it in the kernel myself.  
> > 
> > I think we need to fix both the QEMU premature sending of DEVICE_DELETED
> > and the kernel bug that allowed the crash.
> 
> 
> Where do we stand on this for v2.10?  I'd like to see it get in.  There
> may be things to fix in the kernel, some of them may already be fixed
> in the latest development kernel, but ultimately the kernel considers
> driver binding to be a trusted operation and if userspace doesn't
> understand all the dependencies, they shouldn't be doing it.  In this
> case libvirt is using the DEVICE_DELETED signal with the assumption
> that the device has been fully released by QEMU, which is of course not
> accurate (libvirt could test this, but chooses not to).  libvirt
> therefore begins trying to unbind a device that is still in use, we try
> to handle it, but see official kernel stance that userspace is
> responsible for understanding device dependencies, so we can only do so
> much.
> 
> IMO, the next step along those lines would be that libvirt needs to
> understand that even once a device is fully released from QEMU, it's
> not necessarily safe to re-bind the device to a host driver.  If the
> device is a member of a group where other devices are still in use by
> userspace, this will violate user/host device isolation and the kernel
> will crash to protect itself.  At best I may be able to improve this to
> killing the userspace process making use of the conflicting device, but
> the kernel view is that userspace (libvirt) has mandated to bind the
> device to the host driver and we must make it so, the user is
> responsible for the consequences.  Thanks,

Merging it for 2.10 seems like a good idea to me to, but it's not
really my area of expertise, and therefore not my call.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson
Re: [Qemu-devel] [PATCH for-2.10 0/3] qdev/vfio: defer DEVICE_DEL to avoid races with libvirt
Posted by Greg Kurz 6 years, 7 months ago
On Wed, 9 Aug 2017 15:08:42 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Tue, Aug 08, 2017 at 01:40:08PM -0600, Alex Williamson wrote:
> > On Thu, 27 Jul 2017 12:50:42 +0100
> > "Daniel P. Berrange" <berrange@redhat.com> wrote:
> >   
> > > On Thu, Jul 27, 2017 at 08:53:48PM +1000, David Gibson wrote:  
> > > > On Thu, Jul 27, 2017 at 10:11:48AM +0100, Peter Maydell wrote:    
> > > > > On 27 July 2017 at 02:30, Michael Roth <mdroth@linux.vnet.ibm.com> wrote:    
> > > > > > In particular, Mellanox CX4 adapters on PowerNV hosts might not be fully
> > > > > > quiesced by vfio-pci's finalize() routine until up to 6s after the
> > > > > > DEVICE_DELETED was emitted, leading to detach-device on the libvirt side pretty
> > > > > > much always crashing the host.    
> > > > > 
> > > > > My initial naive thought is that if the host kernel can crash then
> > > > > this is a host kernel bug... shouldn't the host kernel refuse
> > > > > the subsequent libvirt rebind if it would cause a crash ?    
> > > > 
> > > > I think so too, but I haven't been able to convince Alex.  Nor
> > > > find time to fix it in the kernel myself.    
> > > 
> > > I think we need to fix both the QEMU premature sending of DEVICE_DELETED
> > > and the kernel bug that allowed the crash.  
> > 
> > 
> > Where do we stand on this for v2.10?  I'd like to see it get in.  There
> > may be things to fix in the kernel, some of them may already be fixed
> > in the latest development kernel, but ultimately the kernel considers
> > driver binding to be a trusted operation and if userspace doesn't
> > understand all the dependencies, they shouldn't be doing it.  In this
> > case libvirt is using the DEVICE_DELETED signal with the assumption
> > that the device has been fully released by QEMU, which is of course not
> > accurate (libvirt could test this, but chooses not to).  libvirt
> > therefore begins trying to unbind a device that is still in use, we try
> > to handle it, but see official kernel stance that userspace is
> > responsible for understanding device dependencies, so we can only do so
> > much.
> > 
> > IMO, the next step along those lines would be that libvirt needs to
> > understand that even once a device is fully released from QEMU, it's
> > not necessarily safe to re-bind the device to a host driver.  If the
> > device is a member of a group where other devices are still in use by
> > userspace, this will violate user/host device isolation and the kernel
> > will crash to protect itself.  At best I may be able to improve this to
> > killing the userspace process making use of the conflicting device, but
> > the kernel view is that userspace (libvirt) has mandated to bind the
> > device to the host driver and we must make it so, the user is
> > responsible for the consequences.  Thanks,  
> 
> Merging it for 2.10 seems like a good idea to me to, but it's not
> really my area of expertise, and therefore not my call.
> 

It looks like there was some kind of consensus on this series, but it
didn't make it for 2.10. Is there something more to discuss ?

Cheers,

--
Greg
Re: [Qemu-devel] [PATCH for-2.10 0/3] qdev/vfio: defer DEVICE_DEL to avoid races with libvirt
Posted by Michael Roth 6 years, 8 months ago
Quoting David Gibson (2017-07-27 05:53:48)
> On Thu, Jul 27, 2017 at 10:11:48AM +0100, Peter Maydell wrote:
> > On 27 July 2017 at 02:30, Michael Roth <mdroth@linux.vnet.ibm.com> wrote:
> > > In particular, Mellanox CX4 adapters on PowerNV hosts might not be fully
> > > quiesced by vfio-pci's finalize() routine until up to 6s after the
> > > DEVICE_DELETED was emitted, leading to detach-device on the libvirt side pretty
> > > much always crashing the host.
> > 
> > My initial naive thought is that if the host kernel can crash then
> > this is a host kernel bug... shouldn't the host kernel refuse
> > the subsequent libvirt rebind if it would cause a crash ?
> 
> I think so too, but I haven't been able to convince Alex.  Nor
> find time to fix it in the kernel myself.

In the thread I linked to Alex had mentioned he was pursuing something on the
kernel side, but my understanding what that we'd simply have the kernel fail
more gracefully when attempting to rebind in this situation.

But that still leaves the matter of libvirt failing to rebind the device to
the host. This series addresses that aspect of it, so I think the 2 approaches
are complementary.

> 
> -- 
> David Gibson                    | I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
>                                 | _way_ _around_!
> http://www.ozlabs.org/~dgibson


Re: [Qemu-devel] [PATCH for-2.10 0/3] qdev/vfio: defer DEVICE_DEL to avoid races with libvirt
Posted by Alex Williamson 6 years, 8 months ago
On Thu, 27 Jul 2017 20:53:48 +1000
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Thu, Jul 27, 2017 at 10:11:48AM +0100, Peter Maydell wrote:
> > On 27 July 2017 at 02:30, Michael Roth <mdroth@linux.vnet.ibm.com> wrote:  
> > > In particular, Mellanox CX4 adapters on PowerNV hosts might not be fully
> > > quiesced by vfio-pci's finalize() routine until up to 6s after the
> > > DEVICE_DELETED was emitted, leading to detach-device on the libvirt side pretty
> > > much always crashing the host.  
> > 
> > My initial naive thought is that if the host kernel can crash then
> > this is a host kernel bug... shouldn't the host kernel refuse
> > the subsequent libvirt rebind if it would cause a crash ?  
> 
> I think so too, but I haven't been able to convince Alex.  Nor
> find time to fix it in the kernel myself.

It's not me you need to convince, it's GregKH[1].  That interpretation
is that the user bind request is a mandate and we'll fall over
ourselves to try to do as they ask.  I think the best I might be able
to do is to kill the QEMU process to avoid compromising the kernel
rather than killing the kernel after the isolation compromise has
occurred.  Messing with driver binding is a privileged operation, and
the kernel believes you get to keep all the pieces when it fails.
Sorry.  Thanks,

Alex

[1] https://lkml.org/lkml/2017/7/10/728

Re: [Qemu-devel] [PATCH for-2.10 0/3] qdev/vfio: defer DEVICE_DEL to avoid races with libvirt
Posted by David Gibson 6 years, 8 months ago
On Thu, Jul 27, 2017 at 08:47:33AM -0600, Alex Williamson wrote:
> On Thu, 27 Jul 2017 20:53:48 +1000
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Thu, Jul 27, 2017 at 10:11:48AM +0100, Peter Maydell wrote:
> > > On 27 July 2017 at 02:30, Michael Roth <mdroth@linux.vnet.ibm.com> wrote:  
> > > > In particular, Mellanox CX4 adapters on PowerNV hosts might not be fully
> > > > quiesced by vfio-pci's finalize() routine until up to 6s after the
> > > > DEVICE_DELETED was emitted, leading to detach-device on the libvirt side pretty
> > > > much always crashing the host.  
> > > 
> > > My initial naive thought is that if the host kernel can crash then
> > > this is a host kernel bug... shouldn't the host kernel refuse
> > > the subsequent libvirt rebind if it would cause a crash ?  
> > 
> > I think so too, but I haven't been able to convince Alex.  Nor
> > find time to fix it in the kernel myself.
> 
> It's not me you need to convince, it's GregKH[1].  That interpretation
> is that the user bind request is a mandate and we'll fall over
> ourselves to try to do as they ask.  I think the best I might be able
> to do is to kill the QEMU process to avoid compromising the kernel
> rather than killing the kernel after the isolation compromise has
> occurred.  Messing with driver binding is a privileged operation, and
> the kernel believes you get to keep all the pieces when it fails.
> Sorry.  Thanks,

Ah, my mistake.  Well, I guess I'll whinge at GregKH at some point.

Anyway, the basic point remains - I still think we should address this
in the kernel, but that's not going to happen soon.  So we're left
with addressing it in qemu and/or libvirt.  And as others have pointed
out, there are reasons to do that even if the kernel does get changed
to protect itself better here.

> 
> Alex
> 
> [1] https://lkml.org/lkml/2017/7/10/728
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson