[Qemu-devel] [PATCH 0/4] add failover feature for assigned network devices

Jens Freimann posted 4 patches 4 years, 11 months ago
Test asan passed
Test docker-mingw@fedora passed
Test docker-clang@ubuntu passed
Test checkpatch passed
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20190517125820.2885-1-jfreimann@redhat.com
Maintainers: "Michael S. Tsirkin" <mst@redhat.com>, Alex Williamson <alex.williamson@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>, Jason Wang <jasowang@redhat.com>
There is a newer version of this series
hw/core/qdev.c                 |  20 ++++++
hw/net/virtio-net.c            | 117 +++++++++++++++++++++++++++++++++
hw/vfio/pci.c                  |  25 ++++++-
hw/vfio/pci.h                  |   2 +
include/hw/qdev-core.h         |  10 +++
include/hw/virtio/virtio-net.h |  12 ++++
qdev-monitor.c                 |  43 ++++++++++--
vl.c                           |   6 +-
8 files changed, 228 insertions(+), 7 deletions(-)
[Qemu-devel] [PATCH 0/4] add failover feature for assigned network devices
Posted by Jens Freimann 4 years, 11 months ago
This is another attempt at implementing the host side of the
net_failover concept
(https://www.kernel.org/doc/html/latest/networking/net_failover.html)

Changes since last RFC:
- work around circular dependency of commandline options. Just add
  failover=on to the virtio-net standby options and reference it from
  primary (vfio-pci) device with standby=<id>  
- add patch 3/4 to allow migration of vfio-pci device when it is part of a
  failover pair, still disallow for all other devices
- add patch 4/4 to allow unplug of device during migrationm, make an
  exception for failover primary devices. I'd like feedback on how to
  solve this more elegant. I added a boolean to DeviceState, have it
  default to false for all devices except for primary devices. 
- not tested yet with surprise removal
- I don't expect this to go in as it is, still needs more testing but
  I'd like to get feedback on above mentioned changes.

The general idea is that we have a pair of devices, a vfio-pci and a
emulated device. Before migration the vfio device is unplugged and data
flows to the emulated device, on the target side another vfio-pci device
is plugged in to take over the data-path. In the guest the net_failover
module will pair net devices with the same MAC address.

* In the first patch the infrastructure for hiding the device is added
  for the qbus and qdev APIs. 

* In the second patch the virtio-net uses the API to defer adding the vfio
  device until the VIRTIO_NET_F_STANDBY feature is acked.

Previous discussion: 
  RFC v1 https://patchwork.ozlabs.org/cover/989098/
  RFC v2 https://www.mail-archive.com/qemu-devel@nongnu.org/msg606906.html

To summarize concerns/feedback from previous discussion:
1.- guest OS can reject or worse _delay_ unplug by any amount of time.
  Migration might get stuck for unpredictable time with unclear reason.
  This approach combines two tricky things, hot/unplug and migration. 
  -> We can surprise-remove the PCI device and in QEMU we can do all
     necessary rollbacks transparent to management software. Will it be
     easy, probably not.
2. PCI devices are a precious ressource. The primary device should never
  be added to QEMU if it won't be used by guest instead of hiding it in
  QEMU. 
  -> We only hotplug the device when the standby feature bit was
     negotiated. We save the device cmdline options until we need it for
     qdev_device_add()
     Hiding a device can be a useful concept to model. For example a
     pci device in a powered-off slot could be marked as hidden until the slot is
     powered on (mst).
3. Management layer software should handle this. Open Stack already has
  components/code to handle unplug/replug VFIO devices and metadata to
  provide to the guest for detecting which devices should be paired.
  -> An approach that includes all software from firmware to
     higher-level management software wasn't tried in the last years. This is
     an attempt to keep it simple and contained in QEMU as much as possible.
4. Hotplugging a device and then making it part of a failover setup is
   not possible
  -> addressed by extending qdev hotplug functions to check for hidden
     attribute, so e.g. device_add can be used to plug a device.


I have tested this with a mlx5 NIC and was able to migrate the VM with
above mentioned workarounds for open problems.

Command line example:

qemu-system-x86_64 -enable-kvm -m 3072 -smp 3 \
        -machine q35,kernel-irqchip=split -cpu host   \
        -k fr   \
        -serial stdio   \
        -net none \
        -qmp unix:/tmp/qmp.socket,server,nowait \
        -monitor telnet:127.0.0.1:5555,server,nowait \
        -device pcie-root-port,id=root0,multifunction=on,chassis=0,addr=0xa \
        -device pcie-root-port,id=root1,bus=pcie.0,chassis=1 \
        -device pcie-root-port,id=root2,bus=pcie.0,chassis=2 \
        -netdev tap,script=/root/bin/bridge.sh,downscript=no,id=hostnet1,vhost=on \
        -device virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:6f:55:cc,bus=root2,failover=on \                                                                                    
        /root/rhel-guest-image-8.0-1781.x86_64.qcow2

Then the primary device can be hotplugged via
 (qemu) device_add vfio-pci,host=5e:00.2,id=hostdev0,bus=root1,standby=net1


I'm grateful for any remarks or ideas!

Thanks!


Jens Freimann (4):
  migration: allow unplug during migration for failover devices
  qdev/qbus: Add hidden device support
  net/virtio: add failover support
  vfio: unplug failover primary device before migration

 hw/core/qdev.c                 |  20 ++++++
 hw/net/virtio-net.c            | 117 +++++++++++++++++++++++++++++++++
 hw/vfio/pci.c                  |  25 ++++++-
 hw/vfio/pci.h                  |   2 +
 include/hw/qdev-core.h         |  10 +++
 include/hw/virtio/virtio-net.h |  12 ++++
 qdev-monitor.c                 |  43 ++++++++++--
 vl.c                           |   6 +-
 8 files changed, 228 insertions(+), 7 deletions(-)

-- 
2.21.0


Re: [Qemu-devel] [PATCH 0/4] add failover feature for assigned network devices
Posted by Daniel P. Berrangé 4 years, 11 months ago
On Fri, May 17, 2019 at 02:58:16PM +0200, Jens Freimann wrote:
> This is another attempt at implementing the host side of the
> net_failover concept
> (https://www.kernel.org/doc/html/latest/networking/net_failover.html)
> 
> Changes since last RFC:
> - work around circular dependency of commandline options. Just add
>   failover=on to the virtio-net standby options and reference it from
>   primary (vfio-pci) device with standby=<id>  
> - add patch 3/4 to allow migration of vfio-pci device when it is part of a
>   failover pair, still disallow for all other devices
> - add patch 4/4 to allow unplug of device during migrationm, make an
>   exception for failover primary devices. I'd like feedback on how to
>   solve this more elegant. I added a boolean to DeviceState, have it
>   default to false for all devices except for primary devices. 
> - not tested yet with surprise removal
> - I don't expect this to go in as it is, still needs more testing but
>   I'd like to get feedback on above mentioned changes.
> 
> The general idea is that we have a pair of devices, a vfio-pci and a
> emulated device. Before migration the vfio device is unplugged and data
> flows to the emulated device, on the target side another vfio-pci device
> is plugged in to take over the data-path. In the guest the net_failover
> module will pair net devices with the same MAC address.
> 
> * In the first patch the infrastructure for hiding the device is added
>   for the qbus and qdev APIs. 
> 
> * In the second patch the virtio-net uses the API to defer adding the vfio
>   device until the VIRTIO_NET_F_STANDBY feature is acked.
> 
> Previous discussion: 
>   RFC v1 https://patchwork.ozlabs.org/cover/989098/
>   RFC v2 https://www.mail-archive.com/qemu-devel@nongnu.org/msg606906.html
> 
> To summarize concerns/feedback from previous discussion:
> 1.- guest OS can reject or worse _delay_ unplug by any amount of time.
>   Migration might get stuck for unpredictable time with unclear reason.
>   This approach combines two tricky things, hot/unplug and migration. 
>   -> We can surprise-remove the PCI device and in QEMU we can do all
>      necessary rollbacks transparent to management software. Will it be
>      easy, probably not.
> 2. PCI devices are a precious ressource. The primary device should never
>   be added to QEMU if it won't be used by guest instead of hiding it in
>   QEMU. 
>   -> We only hotplug the device when the standby feature bit was
>      negotiated. We save the device cmdline options until we need it for
>      qdev_device_add()
>      Hiding a device can be a useful concept to model. For example a
>      pci device in a powered-off slot could be marked as hidden until the slot is
>      powered on (mst).
> 3. Management layer software should handle this. Open Stack already has
>   components/code to handle unplug/replug VFIO devices and metadata to
>   provide to the guest for detecting which devices should be paired.
>   -> An approach that includes all software from firmware to
>      higher-level management software wasn't tried in the last years. This is
>      an attempt to keep it simple and contained in QEMU as much as possible.
> 4. Hotplugging a device and then making it part of a failover setup is
>    not possible
>   -> addressed by extending qdev hotplug functions to check for hidden
>      attribute, so e.g. device_add can be used to plug a device.
> 
> 
> I have tested this with a mlx5 NIC and was able to migrate the VM with
> above mentioned workarounds for open problems.
> 
> Command line example:
> 
> qemu-system-x86_64 -enable-kvm -m 3072 -smp 3 \
>         -machine q35,kernel-irqchip=split -cpu host   \
>         -k fr   \
>         -serial stdio   \
>         -net none \
>         -qmp unix:/tmp/qmp.socket,server,nowait \
>         -monitor telnet:127.0.0.1:5555,server,nowait \
>         -device pcie-root-port,id=root0,multifunction=on,chassis=0,addr=0xa \
>         -device pcie-root-port,id=root1,bus=pcie.0,chassis=1 \
>         -device pcie-root-port,id=root2,bus=pcie.0,chassis=2 \
>         -netdev tap,script=/root/bin/bridge.sh,downscript=no,id=hostnet1,vhost=on \
>         -device virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:6f:55:cc,bus=root2,failover=on \                                                                                    
>         /root/rhel-guest-image-8.0-1781.x86_64.qcow2
> 
> Then the primary device can be hotplugged via
>  (qemu) device_add vfio-pci,host=5e:00.2,id=hostdev0,bus=root1,standby=net1

This command line syntax looks much saner now that the circular dep is
gone. I think this approach is now viable to use from libvirt's POV.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [Qemu-devel] [PATCH 0/4] add failover feature for assigned network devices
Posted by Michael S. Tsirkin 4 years, 11 months ago
On Fri, May 17, 2019 at 02:58:16PM +0200, Jens Freimann wrote:
> This is another attempt at implementing the host side of the
> net_failover concept
> (https://www.kernel.org/doc/html/latest/networking/net_failover.html)
> 
> Changes since last RFC:
> - work around circular dependency of commandline options. Just add
>   failover=on to the virtio-net standby options and reference it from
>   primary (vfio-pci) device with standby=<id>  
> - add patch 3/4 to allow migration of vfio-pci device when it is part of a
>   failover pair, still disallow for all other devices
> - add patch 4/4 to allow unplug of device during migrationm, make an
>   exception for failover primary devices. I'd like feedback on how to
>   solve this more elegant. I added a boolean to DeviceState, have it
>   default to false for all devices except for primary devices. 
> - not tested yet with surprise removal
> - I don't expect this to go in as it is, still needs more testing but
>   I'd like to get feedback on above mentioned changes.
> 
> The general idea is that we have a pair of devices, a vfio-pci and a
> emulated device. Before migration the vfio device is unplugged and data
> flows to the emulated device, on the target side another vfio-pci device
> is plugged in to take over the data-path. In the guest the net_failover
> module will pair net devices with the same MAC address.
> 
> * In the first patch the infrastructure for hiding the device is added
>   for the qbus and qdev APIs. 
> 
> * In the second patch the virtio-net uses the API to defer adding the vfio
>   device until the VIRTIO_NET_F_STANDBY feature is acked.
> 
> Previous discussion: 
>   RFC v1 https://patchwork.ozlabs.org/cover/989098/
>   RFC v2 https://www.mail-archive.com/qemu-devel@nongnu.org/msg606906.html
> 
> To summarize concerns/feedback from previous discussion:
> 1.- guest OS can reject or worse _delay_ unplug by any amount of time.
>   Migration might get stuck for unpredictable time with unclear reason.
>   This approach combines two tricky things, hot/unplug and migration. 
>   -> We can surprise-remove the PCI device and in QEMU we can do all
>      necessary rollbacks transparent to management software. Will it be
>      easy, probably not.
> 2. PCI devices are a precious ressource. The primary device should never
>   be added to QEMU if it won't be used by guest instead of hiding it in
>   QEMU. 
>   -> We only hotplug the device when the standby feature bit was
>      negotiated. We save the device cmdline options until we need it for
>      qdev_device_add()
>      Hiding a device can be a useful concept to model. For example a
>      pci device in a powered-off slot could be marked as hidden until the slot is
>      powered on (mst).
> 3. Management layer software should handle this. Open Stack already has
>   components/code to handle unplug/replug VFIO devices and metadata to
>   provide to the guest for detecting which devices should be paired.
>   -> An approach that includes all software from firmware to
>      higher-level management software wasn't tried in the last years. This is
>      an attempt to keep it simple and contained in QEMU as much as possible.
> 4. Hotplugging a device and then making it part of a failover setup is
>    not possible
>   -> addressed by extending qdev hotplug functions to check for hidden
>      attribute, so e.g. device_add can be used to plug a device.
> 
> 
> I have tested this with a mlx5 NIC and was able to migrate the VM with
> above mentioned workarounds for open problems.
> 
> Command line example:
> 
> qemu-system-x86_64 -enable-kvm -m 3072 -smp 3 \
>         -machine q35,kernel-irqchip=split -cpu host   \
>         -k fr   \
>         -serial stdio   \
>         -net none \
>         -qmp unix:/tmp/qmp.socket,server,nowait \
>         -monitor telnet:127.0.0.1:5555,server,nowait \
>         -device pcie-root-port,id=root0,multifunction=on,chassis=0,addr=0xa \
>         -device pcie-root-port,id=root1,bus=pcie.0,chassis=1 \
>         -device pcie-root-port,id=root2,bus=pcie.0,chassis=2 \
>         -netdev tap,script=/root/bin/bridge.sh,downscript=no,id=hostnet1,vhost=on \
>         -device virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:6f:55:cc,bus=root2,failover=on \                                                                                    
>         /root/rhel-guest-image-8.0-1781.x86_64.qcow2
> 
> Then the primary device can be hotplugged via
>  (qemu) device_add vfio-pci,host=5e:00.2,id=hostdev0,bus=root1,standby=net1
> 
> 
> I'm grateful for any remarks or ideas!
> 
> Thanks!

Hi Jens!
Overall I like the patches. Thanks!

Could you please tell us a bit more about other hardware: does this work
more or less universally across vendors? were any other cards tested?

Thanks in advance.

> 
> Jens Freimann (4):
>   migration: allow unplug during migration for failover devices
>   qdev/qbus: Add hidden device support
>   net/virtio: add failover support
>   vfio: unplug failover primary device before migration
> 
>  hw/core/qdev.c                 |  20 ++++++
>  hw/net/virtio-net.c            | 117 +++++++++++++++++++++++++++++++++
>  hw/vfio/pci.c                  |  25 ++++++-
>  hw/vfio/pci.h                  |   2 +
>  include/hw/qdev-core.h         |  10 +++
>  include/hw/virtio/virtio-net.h |  12 ++++
>  qdev-monitor.c                 |  43 ++++++++++--
>  vl.c                           |   6 +-
>  8 files changed, 228 insertions(+), 7 deletions(-)
> 
> -- 
> 2.21.0

Re: [Qemu-devel] [PATCH 0/4] add failover feature for assigned network devices
Posted by Jens Freimann 4 years, 11 months ago
On Tue, May 21, 2019 at 06:10:19AM -0400, Michael S. Tsirkin wrote:
>On Fri, May 17, 2019 at 02:58:16PM +0200, Jens Freimann wrote:
>> I'm grateful for any remarks or ideas!
>>
>> Thanks!
>
>Hi Jens!
>Overall I like the patches. Thanks!
>
>Could you please tell us a bit more about other hardware: does this work
>more or less universally across vendors? were any other cards tested?

Thank you, I have tested only Mellanox and XL710 so far but am working
on testing with more nics at the moment. I think there's a few more
things to work out with the patches (especially the unplug before
migration) which should give me a bit more time to test other cards.

Also I haven't yet tested unplug with surprise removal. My understanding
is that device_del was switched from using surprise removal to
attention button etc. a while back. So I'll have to find out first how
to try removing a card with surprise removal in qemu. 

regards,
Jens  

Re: [Qemu-devel] [PATCH 0/4] add failover feature for assigned network devices
Posted by Michael S. Tsirkin 4 years, 11 months ago
On Tue, May 21, 2019 at 09:17:54PM +0200, Jens Freimann wrote:
> On Tue, May 21, 2019 at 06:10:19AM -0400, Michael S. Tsirkin wrote:
> > On Fri, May 17, 2019 at 02:58:16PM +0200, Jens Freimann wrote:
> > > I'm grateful for any remarks or ideas!
> > > 
> > > Thanks!
> > 
> > Hi Jens!
> > Overall I like the patches. Thanks!
> > 
> > Could you please tell us a bit more about other hardware: does this work
> > more or less universally across vendors? were any other cards tested?
> 
> Thank you, I have tested only Mellanox and XL710 so far but am working
> on testing with more nics at the moment. I think there's a few more
> things to work out with the patches (especially the unplug before
> migration) which should give me a bit more time to test other cards.
> 
> Also I haven't yet tested unplug with surprise removal. My understanding
> is that device_del was switched from using surprise removal to
> attention button etc. a while back.

it never used surprise removal

> So I'll have to find out first how
> to try removing a card with surprise removal in qemu.
> 
> regards,
> Jens

i would not do this at this stage yet. lots of work needed to
make linux not crash

-- 
MST

Re: [Qemu-devel] [PATCH 0/4] add failover feature for assigned network devices
Posted by Laine Stump 4 years, 10 months ago
On 5/17/19 8:58 AM, Jens Freimann wrote:
> This is another attempt at implementing the host side of the
> net_failover concept
> (https://www.kernel.org/doc/html/latest/networking/net_failover.html)
> 
> Changes since last RFC:
> - work around circular dependency of commandline options. Just add
>    failover=on to the virtio-net standby options and reference it from
>    primary (vfio-pci) device with standby=<id>
> - add patch 3/4 to allow migration of vfio-pci device when it is part of a
>    failover pair, still disallow for all other devices
> - add patch 4/4 to allow unplug of device during migrationm, make an
>    exception for failover primary devices. I'd like feedback on how to
>    solve this more elegant. I added a boolean to DeviceState, have it
>    default to false for all devices except for primary devices.
> - not tested yet with surprise removal
> - I don't expect this to go in as it is, still needs more testing but
>    I'd like to get feedback on above mentioned changes.
> 
> The general idea is that we have a pair of devices, a vfio-pci and a
> emulated device. Before migration the vfio device is unplugged and data
> flows to the emulated device, on the target side another vfio-pci device
> is plugged in to take over the data-path. In the guest the net_failover
> module will pair net devices with the same MAC address.
> 
> * In the first patch the infrastructure for hiding the device is added
>    for the qbus and qdev APIs.
> 
> * In the second patch the virtio-net uses the API to defer adding the vfio
>    device until the VIRTIO_NET_F_STANDBY feature is acked.
> 
> Previous discussion:
>    RFC v1 https://patchwork.ozlabs.org/cover/989098/
>    RFC v2 https://www.mail-archive.com/qemu-devel@nongnu.org/msg606906.html
> 
> To summarize concerns/feedback from previous discussion:
> 1.- guest OS can reject or worse _delay_ unplug by any amount of time.
>    Migration might get stuck for unpredictable time with unclear reason.
>    This approach combines two tricky things, hot/unplug and migration.
>    -> We can surprise-remove the PCI device and in QEMU we can do all
>       necessary rollbacks transparent to management software. Will it be
>       easy, probably not.
> 2. PCI devices are a precious ressource. The primary device should never
>    be added to QEMU if it won't be used by guest instead of hiding it in
>    QEMU.
>    -> We only hotplug the device when the standby feature bit was
>       negotiated. We save the device cmdline options until we need it for
>       qdev_device_add()
>       Hiding a device can be a useful concept to model. For example a
>       pci device in a powered-off slot could be marked as hidden until the slot is
>       powered on (mst).
> 3. Management layer software should handle this. Open Stack already has
>    components/code to handle unplug/replug VFIO devices and metadata to
>    provide to the guest for detecting which devices should be paired.
>    -> An approach that includes all software from firmware to
>       higher-level management software wasn't tried in the last years. This is
>       an attempt to keep it simple and contained in QEMU as much as possible.
> 4. Hotplugging a device and then making it part of a failover setup is
>     not possible
>    -> addressed by extending qdev hotplug functions to check for hidden
>       attribute, so e.g. device_add can be used to plug a device.
> 
> 
> I have tested this with a mlx5 NIC and was able to migrate the VM with
> above mentioned workarounds for open problems.
> 
> Command line example:
> 
> qemu-system-x86_64 -enable-kvm -m 3072 -smp 3 \
>          -machine q35,kernel-irqchip=split -cpu host   \
>          -k fr   \
>          -serial stdio   \
>          -net none \
>          -qmp unix:/tmp/qmp.socket,server,nowait \
>          -monitor telnet:127.0.0.1:5555,server,nowait \
>          -device pcie-root-port,id=root0,multifunction=on,chassis=0,addr=0xa \
>          -device pcie-root-port,id=root1,bus=pcie.0,chassis=1 \
>          -device pcie-root-port,id=root2,bus=pcie.0,chassis=2 \
>          -netdev tap,script=/root/bin/bridge.sh,downscript=no,id=hostnet1,vhost=on \
>          -device virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:6f:55:cc,bus=root2,failover=on \
>          /root/rhel-guest-image-8.0-1781.x86_64.qcow2
> 
> Then the primary device can be hotplugged via
>   (qemu) device_add vfio-pci,host=5e:00.2,id=hostdev0,bus=root1,standby=net1


I guess this is the commandline on the migration destination, and as far 
as I understand from this example, on the destination we (meaning 
libvirt or higher level management application) must *not* include the 
assigned device on the qemu commandline, but must instead hotplug the 
device later after the guest CPUs have been restarted on the destination.

So if I'm understanding correctly, the idea is that on the migration 
source, the device may have been hotplugged, or may have been included 
when qemu was originally started. Then qemu automatically handles the 
unplug of the device on the source, but it seems qemu does nothing on 
the destination, leaving that up to libvirt or a higher layer to implement.

Then in order for this to work, libvirt (or OpenStack or oVirt or 
whoever) needs to understand that the device in the libvirt config (it 
will still be in the libvirt config, since from libvirt's POV it hasn't 
been unplugged):

1) shouldn't be included in the qemu commandline on the destination,

2) will almost surely need to be replaced with a different device on the 
destination (since it's almost certain that the destination won't have 
an available device at the same PCI address)

3) will probably need to be unbinded from the VF net driver (does this 
need to happen before migration is finished? If we want to lower the 
probability of a failure after we're already committed to the migration, 
then I think we must, but libvirt isn't set up for that in any way).

4) will need to be hotplugged after the migration has finished *and* 
after the guest CPUs have been restarted on the destination.


While it will be possible to assure that there is a destination device, 
and to replace the old device with new in the config (and maybe, either 
with some major reworking of device assignment code, or offloading the 
responsibility to the management application(s), possible to re-bind the 
device to the vfio-pci driver), prior to marking the migration as 
"successful" (thus committing to running it on the destination), we 
can't say as much for actually assigning the device. So if the 
assignment fails, then what happens?


So a few issues I see that will need to be solved by [someone] 
(apparently either libvirt or management):

a) there isn't anything in libvirt's XML grammar that allows us to 
signify a device that is "present in the config but shouldn't be 
included in the commandline"

b) someone will need to replace the device from the source with an 
equivalent device on the destination in the libvirt XML. There are other 
cases of management modifying the XML during migration (I think), but 
this does point out that putting the "auto-unplug code into qemu isn't 
turning this into a trivial

c) there is nothing in libvirt's migration logic that can cause a device 
to be re-binded to vfio-pci prior to completion of a migration. Unless 
this is added to libvirt (or the re-bind operation is passed off to the 
management application), we will need to live with the possibility that 
hotplugging the device will fail due to failed re-bind *after* we've 
committed to the migration.

d) once the guest CPUs are restarted on the destination, [someone] 
(libvirt or management) needs to hotplug the new device on the 
destination. (I'm guessing that a hotplug can only be done while the 
guest CPUs are running; correct me if this is wrong!)

This sounds like a lot of complexity for something that was supposed to 
be handled completely/transparently by qemu :-P.

Re: [Qemu-devel] [PATCH 0/4] add failover feature for assigned network devices
Posted by Michael S. Tsirkin 4 years, 10 months ago
On Tue, Jun 11, 2019 at 11:42:54AM -0400, Laine Stump wrote:
> On 5/17/19 8:58 AM, Jens Freimann wrote:
> > This is another attempt at implementing the host side of the
> > net_failover concept
> > (https://www.kernel.org/doc/html/latest/networking/net_failover.html)
> > 
> > Changes since last RFC:
> > - work around circular dependency of commandline options. Just add
> >    failover=on to the virtio-net standby options and reference it from
> >    primary (vfio-pci) device with standby=<id>
> > - add patch 3/4 to allow migration of vfio-pci device when it is part of a
> >    failover pair, still disallow for all other devices
> > - add patch 4/4 to allow unplug of device during migrationm, make an
> >    exception for failover primary devices. I'd like feedback on how to
> >    solve this more elegant. I added a boolean to DeviceState, have it
> >    default to false for all devices except for primary devices.
> > - not tested yet with surprise removal
> > - I don't expect this to go in as it is, still needs more testing but
> >    I'd like to get feedback on above mentioned changes.
> > 
> > The general idea is that we have a pair of devices, a vfio-pci and a
> > emulated device. Before migration the vfio device is unplugged and data
> > flows to the emulated device, on the target side another vfio-pci device
> > is plugged in to take over the data-path. In the guest the net_failover
> > module will pair net devices with the same MAC address.
> > 
> > * In the first patch the infrastructure for hiding the device is added
> >    for the qbus and qdev APIs.
> > 
> > * In the second patch the virtio-net uses the API to defer adding the vfio
> >    device until the VIRTIO_NET_F_STANDBY feature is acked.
> > 
> > Previous discussion:
> >    RFC v1 https://patchwork.ozlabs.org/cover/989098/
> >    RFC v2 https://www.mail-archive.com/qemu-devel@nongnu.org/msg606906.html
> > 
> > To summarize concerns/feedback from previous discussion:
> > 1.- guest OS can reject or worse _delay_ unplug by any amount of time.
> >    Migration might get stuck for unpredictable time with unclear reason.
> >    This approach combines two tricky things, hot/unplug and migration.
> >    -> We can surprise-remove the PCI device and in QEMU we can do all
> >       necessary rollbacks transparent to management software. Will it be
> >       easy, probably not.
> > 2. PCI devices are a precious ressource. The primary device should never
> >    be added to QEMU if it won't be used by guest instead of hiding it in
> >    QEMU.
> >    -> We only hotplug the device when the standby feature bit was
> >       negotiated. We save the device cmdline options until we need it for
> >       qdev_device_add()
> >       Hiding a device can be a useful concept to model. For example a
> >       pci device in a powered-off slot could be marked as hidden until the slot is
> >       powered on (mst).
> > 3. Management layer software should handle this. Open Stack already has
> >    components/code to handle unplug/replug VFIO devices and metadata to
> >    provide to the guest for detecting which devices should be paired.
> >    -> An approach that includes all software from firmware to
> >       higher-level management software wasn't tried in the last years. This is
> >       an attempt to keep it simple and contained in QEMU as much as possible.
> > 4. Hotplugging a device and then making it part of a failover setup is
> >     not possible
> >    -> addressed by extending qdev hotplug functions to check for hidden
> >       attribute, so e.g. device_add can be used to plug a device.
> > 
> > 
> > I have tested this with a mlx5 NIC and was able to migrate the VM with
> > above mentioned workarounds for open problems.
> > 
> > Command line example:
> > 
> > qemu-system-x86_64 -enable-kvm -m 3072 -smp 3 \
> >          -machine q35,kernel-irqchip=split -cpu host   \
> >          -k fr   \
> >          -serial stdio   \
> >          -net none \
> >          -qmp unix:/tmp/qmp.socket,server,nowait \
> >          -monitor telnet:127.0.0.1:5555,server,nowait \
> >          -device pcie-root-port,id=root0,multifunction=on,chassis=0,addr=0xa \
> >          -device pcie-root-port,id=root1,bus=pcie.0,chassis=1 \
> >          -device pcie-root-port,id=root2,bus=pcie.0,chassis=2 \
> >          -netdev tap,script=/root/bin/bridge.sh,downscript=no,id=hostnet1,vhost=on \
> >          -device virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:6f:55:cc,bus=root2,failover=on \
> >          /root/rhel-guest-image-8.0-1781.x86_64.qcow2
> > 
> > Then the primary device can be hotplugged via
> >   (qemu) device_add vfio-pci,host=5e:00.2,id=hostdev0,bus=root1,standby=net1
> 
> 
> I guess this is the commandline on the migration destination, and as far as
> I understand from this example, on the destination we (meaning libvirt or
> higher level management application) must *not* include the assigned device
> on the qemu commandline, but must instead hotplug the device later after the
> guest CPUs have been restarted on the destination.
> 
> So if I'm understanding correctly, the idea is that on the migration source,
> the device may have been hotplugged, or may have been included when qemu was
> originally started. Then qemu automatically handles the unplug of the device
> on the source, but it seems qemu does nothing on the destination, leaving
> that up to libvirt or a higher layer to implement.

Good point. I don't see why it would not work just as well
with device present straight away.

Did I miss something?

I think Jens was just testing local machine migration
and of course you can only assign a device to 1 VM at a time.

> Then in order for this to work, libvirt (or OpenStack or oVirt or whoever)
> needs to understand that the device in the libvirt config (it will still be
> in the libvirt config, since from libvirt's POV it hasn't been unplugged):
> 
> 1) shouldn't be included in the qemu commandline on the destination,
> 
> 2) will almost surely need to be replaced with a different device on the
> destination (since it's almost certain that the destination won't have an
> available device at the same PCI address)
> 
> 3) will probably need to be unbinded from the VF net driver (does this need
> to happen before migration is finished? If we want to lower the probability
> of a failure after we're already committed to the migration, then I think we
> must, but libvirt isn't set up for that in any way).
> 
> 4) will need to be hotplugged after the migration has finished *and* after
> the guest CPUs have been restarted on the destination.
> 
> 
> While it will be possible to assure that there is a destination device, and
> to replace the old device with new in the config (and maybe, either with
> some major reworking of device assignment code, or offloading the
> responsibility to the management application(s), possible to re-bind the
> device to the vfio-pci driver), prior to marking the migration as
> "successful" (thus committing to running it on the destination), we can't
> say as much for actually assigning the device. So if the assignment fails,
> then what happens?
> 
> 
> So a few issues I see that will need to be solved by [someone] (apparently
> either libvirt or management):
> 
> a) there isn't anything in libvirt's XML grammar that allows us to signify a
> device that is "present in the config but shouldn't be included in the
> commandline"
> 
> b) someone will need to replace the device from the source with an
> equivalent device on the destination in the libvirt XML. There are other
> cases of management modifying the XML during migration (I think), but this
> does point out that putting the "auto-unplug code into qemu isn't turning
> this into a trivial
> 
> c) there is nothing in libvirt's migration logic that can cause a device to
> be re-binded to vfio-pci prior to completion of a migration. Unless this is
> added to libvirt (or the re-bind operation is passed off to the management
> application), we will need to live with the possibility that hotplugging the
> device will fail due to failed re-bind *after* we've committed to the
> migration.
> 
> d) once the guest CPUs are restarted on the destination, [someone] (libvirt
> or management) needs to hotplug the new device on the destination. (I'm
> guessing that a hotplug can only be done while the guest CPUs are running;
> correct me if this is wrong!)
> 
> This sounds like a lot of complexity for something that was supposed to be
> handled completely/transparently by qemu :-P.

Re: [Qemu-devel] [PATCH 0/4] add failover feature for assigned network devices
Posted by Laine Stump 4 years, 10 months ago
On 6/11/19 11:51 AM, Michael S. Tsirkin wrote:
> On Tue, Jun 11, 2019 at 11:42:54AM -0400, Laine Stump wrote:
>> On 5/17/19 8:58 AM, Jens Freimann wrote:
>>> This is another attempt at implementing the host side of the
>>> net_failover concept
>>> (https://www.kernel.org/doc/html/latest/networking/net_failover.html)
>>>
>>> Changes since last RFC:
>>> - work around circular dependency of commandline options. Just add
>>>     failover=on to the virtio-net standby options and reference it from
>>>     primary (vfio-pci) device with standby=<id>
>>> - add patch 3/4 to allow migration of vfio-pci device when it is part of a
>>>     failover pair, still disallow for all other devices
>>> - add patch 4/4 to allow unplug of device during migrationm, make an
>>>     exception for failover primary devices. I'd like feedback on how to
>>>     solve this more elegant. I added a boolean to DeviceState, have it
>>>     default to false for all devices except for primary devices.
>>> - not tested yet with surprise removal
>>> - I don't expect this to go in as it is, still needs more testing but
>>>     I'd like to get feedback on above mentioned changes.
>>>
>>> The general idea is that we have a pair of devices, a vfio-pci and a
>>> emulated device. Before migration the vfio device is unplugged and data
>>> flows to the emulated device, on the target side another vfio-pci device
>>> is plugged in to take over the data-path. In the guest the net_failover
>>> module will pair net devices with the same MAC address.
>>>
>>> * In the first patch the infrastructure for hiding the device is added
>>>     for the qbus and qdev APIs.
>>>
>>> * In the second patch the virtio-net uses the API to defer adding the vfio
>>>     device until the VIRTIO_NET_F_STANDBY feature is acked.
>>>
>>> Previous discussion:
>>>     RFC v1 https://patchwork.ozlabs.org/cover/989098/
>>>     RFC v2 https://www.mail-archive.com/qemu-devel@nongnu.org/msg606906.html
>>>
>>> To summarize concerns/feedback from previous discussion:
>>> 1.- guest OS can reject or worse _delay_ unplug by any amount of time.
>>>     Migration might get stuck for unpredictable time with unclear reason.
>>>     This approach combines two tricky things, hot/unplug and migration.
>>>     -> We can surprise-remove the PCI device and in QEMU we can do all
>>>        necessary rollbacks transparent to management software. Will it be
>>>        easy, probably not.
>>> 2. PCI devices are a precious ressource. The primary device should never
>>>     be added to QEMU if it won't be used by guest instead of hiding it in
>>>     QEMU.
>>>     -> We only hotplug the device when the standby feature bit was
>>>        negotiated. We save the device cmdline options until we need it for
>>>        qdev_device_add()
>>>        Hiding a device can be a useful concept to model. For example a
>>>        pci device in a powered-off slot could be marked as hidden until the slot is
>>>        powered on (mst).
>>> 3. Management layer software should handle this. Open Stack already has
>>>     components/code to handle unplug/replug VFIO devices and metadata to
>>>     provide to the guest for detecting which devices should be paired.
>>>     -> An approach that includes all software from firmware to
>>>        higher-level management software wasn't tried in the last years. This is
>>>        an attempt to keep it simple and contained in QEMU as much as possible.
>>> 4. Hotplugging a device and then making it part of a failover setup is
>>>      not possible
>>>     -> addressed by extending qdev hotplug functions to check for hidden
>>>        attribute, so e.g. device_add can be used to plug a device.
>>>
>>>
>>> I have tested this with a mlx5 NIC and was able to migrate the VM with
>>> above mentioned workarounds for open problems.
>>>
>>> Command line example:
>>>
>>> qemu-system-x86_64 -enable-kvm -m 3072 -smp 3 \
>>>           -machine q35,kernel-irqchip=split -cpu host   \
>>>           -k fr   \
>>>           -serial stdio   \
>>>           -net none \
>>>           -qmp unix:/tmp/qmp.socket,server,nowait \
>>>           -monitor telnet:127.0.0.1:5555,server,nowait \
>>>           -device pcie-root-port,id=root0,multifunction=on,chassis=0,addr=0xa \
>>>           -device pcie-root-port,id=root1,bus=pcie.0,chassis=1 \
>>>           -device pcie-root-port,id=root2,bus=pcie.0,chassis=2 \
>>>           -netdev tap,script=/root/bin/bridge.sh,downscript=no,id=hostnet1,vhost=on \
>>>           -device virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:6f:55:cc,bus=root2,failover=on \
>>>           /root/rhel-guest-image-8.0-1781.x86_64.qcow2
>>>
>>> Then the primary device can be hotplugged via
>>>    (qemu) device_add vfio-pci,host=5e:00.2,id=hostdev0,bus=root1,standby=net1
>>
>>
>> I guess this is the commandline on the migration destination, and as far as
>> I understand from this example, on the destination we (meaning libvirt or
>> higher level management application) must *not* include the assigned device
>> on the qemu commandline, but must instead hotplug the device later after the
>> guest CPUs have been restarted on the destination.
>>
>> So if I'm understanding correctly, the idea is that on the migration source,
>> the device may have been hotplugged, or may have been included when qemu was
>> originally started. Then qemu automatically handles the unplug of the device
>> on the source, but it seems qemu does nothing on the destination, leaving
>> that up to libvirt or a higher layer to implement.
> 
> Good point. I don't see why it would not work just as well
> with device present straight away.

Will the guest get properly notified about the device if it had been 
unplugged (from the guest POV) prior to the migration (so the last thing 
the guest knows is that there is no device), but is suddenly/magically 
back in place from the instant the CPUs start on the destination? 
Doesn't there need to be some sort of notification sent from qemu to the 
guest OS to let it know that a "new" device has been plugged in? (I've 
always assumed this was the case, but it's really just guessing on my 
part :-)

It will certainly make things simpler if the device can be present in 
the qemu commandline. If that's the case, then I *think* only item (2) 
below will need solving.

> 
> Did I miss something?
> 
> I think Jens was just testing local machine migration
> and of course you can only assign a device to 1 VM at a time.


That's useful and convenient for a smoke test, but doesn't account for 
the (almost 100%) possibility of having a different device address 
(maybe even a different model of device) on source and destination, or 
for the need to unbind/rebind vfio-pci and the host net driver.


> 
>> Then in order for this to work, libvirt (or OpenStack or oVirt or whoever)
>> needs to understand that the device in the libvirt config (it will still be
>> in the libvirt config, since from libvirt's POV it hasn't been unplugged):
>>
>> 1) shouldn't be included in the qemu commandline on the destination,
>>
>> 2) will almost surely need to be replaced with a different device on the
>> destination (since it's almost certain that the destination won't have an
>> available device at the same PCI address)
>>
>> 3) will probably need to be unbinded from the VF net driver (does this need
>> to happen before migration is finished? If we want to lower the probability
>> of a failure after we're already committed to the migration, then I think we
>> must, but libvirt isn't set up for that in any way).
>>
>> 4) will need to be hotplugged after the migration has finished *and* after
>> the guest CPUs have been restarted on the destination.
>>
>>
>> While it will be possible to assure that there is a destination device, and
>> to replace the old device with new in the config (and maybe, either with
>> some major reworking of device assignment code, or offloading the
>> responsibility to the management application(s), possible to re-bind the
>> device to the vfio-pci driver), prior to marking the migration as
>> "successful" (thus committing to running it on the destination), we can't
>> say as much for actually assigning the device. So if the assignment fails,
>> then what happens?
>>
>>
>> So a few issues I see that will need to be solved by [someone] (apparently
>> either libvirt or management):
>>
>> a) there isn't anything in libvirt's XML grammar that allows us to signify a
>> device that is "present in the config but shouldn't be included in the
>> commandline"
>>
>> b) someone will need to replace the device from the source with an
>> equivalent device on the destination in the libvirt XML. There are other
>> cases of management modifying the XML during migration (I think), but this
>> does point out that putting the "auto-unplug code into qemu isn't turning
>> this into a trivial
>>
>> c) there is nothing in libvirt's migration logic that can cause a device to
>> be re-binded to vfio-pci prior to completion of a migration. Unless this is
>> added to libvirt (or the re-bind operation is passed off to the management
>> application), we will need to live with the possibility that hotplugging the
>> device will fail due to failed re-bind *after* we've committed to the
>> migration.
>>
>> d) once the guest CPUs are restarted on the destination, [someone] (libvirt
>> or management) needs to hotplug the new device on the destination. (I'm
>> guessing that a hotplug can only be done while the guest CPUs are running;
>> correct me if this is wrong!)
>>
>> This sounds like a lot of complexity for something that was supposed to be
>> handled completely/transparently by qemu :-P.


Re: [Qemu-devel] [PATCH 0/4] add failover feature for assigned network devices
Posted by Daniel P. Berrangé 4 years, 10 months ago
On Tue, Jun 11, 2019 at 11:42:54AM -0400, Laine Stump wrote:
> On 5/17/19 8:58 AM, Jens Freimann wrote:
> > This is another attempt at implementing the host side of the
> > net_failover concept
> > (https://www.kernel.org/doc/html/latest/networking/net_failover.html)
> > 
> > Changes since last RFC:
> > - work around circular dependency of commandline options. Just add
> >    failover=on to the virtio-net standby options and reference it from
> >    primary (vfio-pci) device with standby=<id>
> > - add patch 3/4 to allow migration of vfio-pci device when it is part of a
> >    failover pair, still disallow for all other devices
> > - add patch 4/4 to allow unplug of device during migrationm, make an
> >    exception for failover primary devices. I'd like feedback on how to
> >    solve this more elegant. I added a boolean to DeviceState, have it
> >    default to false for all devices except for primary devices.
> > - not tested yet with surprise removal
> > - I don't expect this to go in as it is, still needs more testing but
> >    I'd like to get feedback on above mentioned changes.
> > 
> > The general idea is that we have a pair of devices, a vfio-pci and a
> > emulated device. Before migration the vfio device is unplugged and data
> > flows to the emulated device, on the target side another vfio-pci device
> > is plugged in to take over the data-path. In the guest the net_failover
> > module will pair net devices with the same MAC address.
> > 
> > * In the first patch the infrastructure for hiding the device is added
> >    for the qbus and qdev APIs.
> > 
> > * In the second patch the virtio-net uses the API to defer adding the vfio
> >    device until the VIRTIO_NET_F_STANDBY feature is acked.
> > 
> > Previous discussion:
> >    RFC v1 https://patchwork.ozlabs.org/cover/989098/
> >    RFC v2 https://www.mail-archive.com/qemu-devel@nongnu.org/msg606906.html
> > 
> > To summarize concerns/feedback from previous discussion:
> > 1.- guest OS can reject or worse _delay_ unplug by any amount of time.
> >    Migration might get stuck for unpredictable time with unclear reason.
> >    This approach combines two tricky things, hot/unplug and migration.
> >    -> We can surprise-remove the PCI device and in QEMU we can do all
> >       necessary rollbacks transparent to management software. Will it be
> >       easy, probably not.
> > 2. PCI devices are a precious ressource. The primary device should never
> >    be added to QEMU if it won't be used by guest instead of hiding it in
> >    QEMU.
> >    -> We only hotplug the device when the standby feature bit was
> >       negotiated. We save the device cmdline options until we need it for
> >       qdev_device_add()
> >       Hiding a device can be a useful concept to model. For example a
> >       pci device in a powered-off slot could be marked as hidden until the slot is
> >       powered on (mst).
> > 3. Management layer software should handle this. Open Stack already has
> >    components/code to handle unplug/replug VFIO devices and metadata to
> >    provide to the guest for detecting which devices should be paired.
> >    -> An approach that includes all software from firmware to
> >       higher-level management software wasn't tried in the last years. This is
> >       an attempt to keep it simple and contained in QEMU as much as possible.
> > 4. Hotplugging a device and then making it part of a failover setup is
> >     not possible
> >    -> addressed by extending qdev hotplug functions to check for hidden
> >       attribute, so e.g. device_add can be used to plug a device.
> > 
> > 
> > I have tested this with a mlx5 NIC and was able to migrate the VM with
> > above mentioned workarounds for open problems.
> > 
> > Command line example:
> > 
> > qemu-system-x86_64 -enable-kvm -m 3072 -smp 3 \
> >          -machine q35,kernel-irqchip=split -cpu host   \
> >          -k fr   \
> >          -serial stdio   \
> >          -net none \
> >          -qmp unix:/tmp/qmp.socket,server,nowait \
> >          -monitor telnet:127.0.0.1:5555,server,nowait \
> >          -device pcie-root-port,id=root0,multifunction=on,chassis=0,addr=0xa \
> >          -device pcie-root-port,id=root1,bus=pcie.0,chassis=1 \
> >          -device pcie-root-port,id=root2,bus=pcie.0,chassis=2 \
> >          -netdev tap,script=/root/bin/bridge.sh,downscript=no,id=hostnet1,vhost=on \
> >          -device virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:6f:55:cc,bus=root2,failover=on \
> >          /root/rhel-guest-image-8.0-1781.x86_64.qcow2
> > 
> > Then the primary device can be hotplugged via
> >   (qemu) device_add vfio-pci,host=5e:00.2,id=hostdev0,bus=root1,standby=net1
> 
> 
> I guess this is the commandline on the migration destination, and as far as
> I understand from this example, on the destination we (meaning libvirt or
> higher level management application) must *not* include the assigned device
> on the qemu commandline, but must instead hotplug the device later after the
> guest CPUs have been restarted on the destination.
> 
> So if I'm understanding correctly, the idea is that on the migration source,
> the device may have been hotplugged, or may have been included when qemu was
> originally started. Then qemu automatically handles the unplug of the device
> on the source, but it seems qemu does nothing on the destination, leaving
> that up to libvirt or a higher layer to implement.
> 
> Then in order for this to work, libvirt (or OpenStack or oVirt or whoever)
> needs to understand that the device in the libvirt config (it will still be
> in the libvirt config, since from libvirt's POV it hasn't been unplugged):
> 
> 1) shouldn't be included in the qemu commandline on the destination,

I don't believe that's the case.  The CLI args above are just illustrating
that it is now possible to *optionally* not specify the VFIO device on the
CLI. This is because previous versions of the patchset *always* required
the device on the CLI due to a circular dependancy in the CLI syntax. This
patch series version fixed that limitation, so now the VFIO device can be
cold plugged or hotplugged as desired.

> 2) will almost surely need to be replaced with a different device on the
> destination (since it's almost certain that the destination won't have an
> available device at the same PCI address)

Yes, the management application that triggers the migration will need to
pass in a new XML document to libvirt when starting the migration so that
we use the suitable new device on the target host.

> 3) will probably need to be unbinded from the VF net driver (does this need
> to happen before migration is finished? If we want to lower the probability
> of a failure after we're already committed to the migration, then I think we
> must, but libvirt isn't set up for that in any way).
> 
> 4) will need to be hotplugged after the migration has finished *and* after
> the guest CPUs have been restarted on the destination.

My understanding is that QEMU takes care of this.

> a) there isn't anything in libvirt's XML grammar that allows us to signify a
> device that is "present in the config but shouldn't be included in the
> commandline"

I don't thin we need that.

> b) someone will need to replace the device from the source with an
> equivalent device on the destination in the libvirt XML. There are other
> cases of management modifying the XML during migration (I think), but this
> does point out that putting the "auto-unplug code into qemu isn't turning
> this into a trivial

The mgmt app should pass the new device details in the XML when starting
migration. Shouldn't be a big deal as OpenStack already does that for 
quite a few other parts of the config.

> c) there is nothing in libvirt's migration logic that can cause a device to
> be re-binded to vfio-pci prior to completion of a migration. Unless this is
> added to libvirt (or the re-bind operation is passed off to the management
> application), we will need to live with the possibility that hotplugging the
> device will fail due to failed re-bind *after* we've committed to the
> migration.

IIUC, we should be binding to vfio-pci during the prepare phase of the
migration, since that's when QEMU is started by libvirt on the target.

> d) once the guest CPUs are restarted on the destination, [someone] (libvirt
> or management) needs to hotplug the new device on the destination. (I'm
> guessing that a hotplug can only be done while the guest CPUs are running;
> correct me if this is wrong!)

I don't believe so, since we'll be able to cold plug it during prepare
phase.


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [Qemu-devel] [PATCH 0/4] add failover feature for assigned network devices
Posted by Jens Freimann 4 years, 10 months ago
On Wed, Jun 12, 2019 at 11:11:23AM +0200, Daniel P. Berrangé wrote:
>On Tue, Jun 11, 2019 at 11:42:54AM -0400, Laine Stump wrote:
>> On 5/17/19 8:58 AM, Jens Freimann wrote:
> >
>> > Command line example:
>> >
>> > qemu-system-x86_64 -enable-kvm -m 3072 -smp 3 \
>> >          -machine q35,kernel-irqchip=split -cpu host   \
>> >          -k fr   \
>> >          -serial stdio   \
>> >          -net none \
>> >          -qmp unix:/tmp/qmp.socket,server,nowait \
>> >          -monitor telnet:127.0.0.1:5555,server,nowait \
>> >          -device pcie-root-port,id=root0,multifunction=on,chassis=0,addr=0xa \
>> >          -device pcie-root-port,id=root1,bus=pcie.0,chassis=1 \
>> >          -device pcie-root-port,id=root2,bus=pcie.0,chassis=2 \
>> >          -netdev tap,script=/root/bin/bridge.sh,downscript=no,id=hostnet1,vhost=on \
>> >          -device virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:6f:55:cc,bus=root2,failover=on \
>> >          /root/rhel-guest-image-8.0-1781.x86_64.qcow2
>> >
>> > Then the primary device can be hotplugged via
>> >   (qemu) device_add vfio-pci,host=5e:00.2,id=hostdev0,bus=root1,standby=net1
>>
>>
>> I guess this is the commandline on the migration destination, and as far as
>> I understand from this example, on the destination we (meaning libvirt or
>> higher level management application) must *not* include the assigned device
>> on the qemu commandline, but must instead hotplug the device later after the
>> guest CPUs have been restarted on the destination.
>>
>> So if I'm understanding correctly, the idea is that on the migration source,
>> the device may have been hotplugged, or may have been included when qemu was
>> originally started. Then qemu automatically handles the unplug of the device
>> on the source, but it seems qemu does nothing on the destination, leaving
>> that up to libvirt or a higher layer to implement.
>>
>> Then in order for this to work, libvirt (or OpenStack or oVirt or whoever)
>> needs to understand that the device in the libvirt config (it will still be
>> in the libvirt config, since from libvirt's POV it hasn't been unplugged):
>>
>> 1) shouldn't be included in the qemu commandline on the destination,
>
>I don't believe that's the case.  The CLI args above are just illustrating
>that it is now possible to *optionally* not specify the VFIO device on the
>CLI. This is because previous versions of the patchset *always* required
>the device on the CLI due to a circular dependancy in the CLI syntax. This
>patch series version fixed that limitation, so now the VFIO device can be
>cold plugged or hotplugged as desired.

I've mostly tested hotplugging but cold plugged should work as well. 

>> 2) will almost surely need to be replaced with a different device on the
>> destination (since it's almost certain that the destination won't have an
>> available device at the same PCI address)
>
>Yes, the management application that triggers the migration will need to
>pass in a new XML document to libvirt when starting the migration so that
>we use the suitable new device on the target host.

Yes, that's how I expected it to work. In my tests the pci address was
the same on destination and source host but that was more by accident. I
think the libvirt XML on the destination just needs to have the pci
address of nic of the same type for it to work. 

>> 3) will probably need to be unbinded from the VF net driver (does this need
>> to happen before migration is finished? If we want to lower the probability
>> of a failure after we're already committed to the migration, then I think we
>> must, but libvirt isn't set up for that in any way).

Yes, so I think that's part of the 'partial' unplug I'm trying to
figure out add the moment. 

>> 4) will need to be hotplugged after the migration has finished *and* after
>> the guest CPUs have been restarted on the destination.
>
>My understanding is that QEMU takes care of this.

So the re-plugging of the device on the destination is not in the v1
of the patches, which I failed to mention, my bad. I will sent out a v2
that has this part as well shortly. I added a runstate change handler
that is called on the destination when the run state changes from INMIGRATE
to something else. When the new state is RUNNING I hotplug the primary device. 

>> a) there isn't anything in libvirt's XML grammar that allows us to signify a
>> device that is "present in the config but shouldn't be included in the
>> commandline"
>
>I don't thin we need that.
>
>> b) someone will need to replace the device from the source with an
>> equivalent device on the destination in the libvirt XML. There are other
>> cases of management modifying the XML during migration (I think), but this
>> does point out that putting the "auto-unplug code into qemu isn't turning
>> this into a trivial
>
>The mgmt app should pass the new device details in the XML when starting
>migration. Shouldn't be a big deal as OpenStack already does that for
>quite a few other parts of the config.
>
>> c) there is nothing in libvirt's migration logic that can cause a device to
>> be re-binded to vfio-pci prior to completion of a migration. Unless this is
>> added to libvirt (or the re-bind operation is passed off to the management
>> application), we will need to live with the possibility that hotplugging the
>> device will fail due to failed re-bind *after* we've committed to the
>> migration.
>
>IIUC, we should be binding to vfio-pci during the prepare phase of the
>migration, since that's when QEMU is started by libvirt on the target.
>
>> d) once the guest CPUs are restarted on the destination, [someone] (libvirt
>> or management) needs to hotplug the new device on the destination. (I'm
>> guessing that a hotplug can only be done while the guest CPUs are running;
>> correct me if this is wrong!)
>
>I don't believe so, since we'll be able to cold plug it during prepare
>phase.

I think I don't understand what happens during the prepare phase on
the destination. Need to look into that. But I think I had an error in
my logic that I need to plug the device from QEMU on the destination
side. You're saying we could just always cold plug it directly when
the VM is started. I think an exception would be when the guest was
migrated before we added the primary device on the source, so before
virtio feature negotiation.

regards,
Jens 

Re: [Qemu-devel] [PATCH 0/4] add failover feature for assigned network devices
Posted by Laine Stump 4 years, 10 months ago
On 6/12/19 7:59 AM, Jens Freimann wrote:
> On Wed, Jun 12, 2019 at 11:11:23AM +0200, Daniel P. Berrangé wrote:
>> On Tue, Jun 11, 2019 at 11:42:54AM -0400, Laine Stump wrote:
>>> On 5/17/19 8:58 AM, Jens Freimann wrote:
>> >
>>> > Command line example:
>>> >
>>> > qemu-system-x86_64 -enable-kvm -m 3072 -smp 3 \
>>> >          -machine q35,kernel-irqchip=split -cpu host   \
>>> >          -k fr   \
>>> >          -serial stdio   \
>>> >          -net none \
>>> >          -qmp unix:/tmp/qmp.socket,server,nowait \
>>> >          -monitor telnet:127.0.0.1:5555,server,nowait \
>>> >          -device 
>>> pcie-root-port,id=root0,multifunction=on,chassis=0,addr=0xa \
>>> >          -device pcie-root-port,id=root1,bus=pcie.0,chassis=1 \
>>> >          -device pcie-root-port,id=root2,bus=pcie.0,chassis=2 \
>>> >          -netdev 
>>> tap,script=/root/bin/bridge.sh,downscript=no,id=hostnet1,vhost=on \
>>> >          -device 
>>> virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:6f:55:cc,bus=root2,failover=on 
>>> \
>>> >          /root/rhel-guest-image-8.0-1781.x86_64.qcow2
>>> >
>>> > Then the primary device can be hotplugged via
>>> >   (qemu) device_add 
>>> vfio-pci,host=5e:00.2,id=hostdev0,bus=root1,standby=net1
>>>
>>>
>>> I guess this is the commandline on the migration destination, and as 
>>> far as
>>> I understand from this example, on the destination we (meaning 
>>> libvirt or
>>> higher level management application) must *not* include the assigned 
>>> device
>>> on the qemu commandline, but must instead hotplug the device later 
>>> after the
>>> guest CPUs have been restarted on the destination.
>>>
>>> So if I'm understanding correctly, the idea is that on the migration 
>>> source,
>>> the device may have been hotplugged, or may have been included when 
>>> qemu was
>>> originally started. Then qemu automatically handles the unplug of the 
>>> device
>>> on the source, but it seems qemu does nothing on the destination, 
>>> leaving
>>> that up to libvirt or a higher layer to implement.
>>>
>>> Then in order for this to work, libvirt (or OpenStack or oVirt or 
>>> whoever)
>>> needs to understand that the device in the libvirt config (it will 
>>> still be
>>> in the libvirt config, since from libvirt's POV it hasn't been 
>>> unplugged):
>>>
>>> 1) shouldn't be included in the qemu commandline on the destination,
>>
>> I don't believe that's the case.  The CLI args above are just 
>> illustrating
>> that it is now possible to *optionally* not specify the VFIO device on 
>> the
>> CLI. This is because previous versions of the patchset *always* required
>> the device on the CLI due to a circular dependancy in the CLI syntax. 
>> This
>> patch series version fixed that limitation, so now the VFIO device can be
>> cold plugged or hotplugged as desired.
> 
> I've mostly tested hotplugging but cold plugged should work as well.


Okay, in that case my issues 1, 3, and 4 are irrelevant (and (2) is 
handled by management), so the concerns from my previous email are all 
addressed.

Re: [Qemu-devel] [PATCH 0/4] add failover feature for assigned network devices
Posted by Alex Williamson 4 years, 11 months ago
On Fri, 17 May 2019 14:58:16 +0200
Jens Freimann <jfreimann@redhat.com> wrote:

> This is another attempt at implementing the host side of the
> net_failover concept
> (https://www.kernel.org/doc/html/latest/networking/net_failover.html)
> 
> Changes since last RFC:
> - work around circular dependency of commandline options. Just add
>   failover=on to the virtio-net standby options and reference it from
>   primary (vfio-pci) device with standby=<id>  
> - add patch 3/4 to allow migration of vfio-pci device when it is part of a
>   failover pair, still disallow for all other devices
> - add patch 4/4 to allow unplug of device during migrationm, make an
>   exception for failover primary devices. I'd like feedback on how to
>   solve this more elegant. I added a boolean to DeviceState, have it
>   default to false for all devices except for primary devices. 
> - not tested yet with surprise removal
> - I don't expect this to go in as it is, still needs more testing but
>   I'd like to get feedback on above mentioned changes.
> 
> The general idea is that we have a pair of devices, a vfio-pci and a
> emulated device. Before migration the vfio device is unplugged and data
> flows to the emulated device, on the target side another vfio-pci device
> is plugged in to take over the data-path. In the guest the net_failover
> module will pair net devices with the same MAC address.
> 
> * In the first patch the infrastructure for hiding the device is added
>   for the qbus and qdev APIs. 
> 
> * In the second patch the virtio-net uses the API to defer adding the vfio
>   device until the VIRTIO_NET_F_STANDBY feature is acked.
> 
> Previous discussion: 
>   RFC v1 https://patchwork.ozlabs.org/cover/989098/
>   RFC v2 https://www.mail-archive.com/qemu-devel@nongnu.org/msg606906.html
> 
> To summarize concerns/feedback from previous discussion:
> 1.- guest OS can reject or worse _delay_ unplug by any amount of time.
>   Migration might get stuck for unpredictable time with unclear reason.
>   This approach combines two tricky things, hot/unplug and migration. 
>   -> We can surprise-remove the PCI device and in QEMU we can do all  
>      necessary rollbacks transparent to management software. Will it be
>      easy, probably not.
> 2. PCI devices are a precious ressource. The primary device should never
>   be added to QEMU if it won't be used by guest instead of hiding it in
>   QEMU. 
>   -> We only hotplug the device when the standby feature bit was  
>      negotiated. We save the device cmdline options until we need it for
>      qdev_device_add()
>      Hiding a device can be a useful concept to model. For example a
>      pci device in a powered-off slot could be marked as hidden until the slot is
>      powered on (mst).
> 3. Management layer software should handle this. Open Stack already has
>   components/code to handle unplug/replug VFIO devices and metadata to
>   provide to the guest for detecting which devices should be paired.
>   -> An approach that includes all software from firmware to  
>      higher-level management software wasn't tried in the last years. This is
>      an attempt to keep it simple and contained in QEMU as much as possible.
> 4. Hotplugging a device and then making it part of a failover setup is
>    not possible
>   -> addressed by extending qdev hotplug functions to check for hidden  
>      attribute, so e.g. device_add can be used to plug a device.
> 
> 
> I have tested this with a mlx5 NIC and was able to migrate the VM with
> above mentioned workarounds for open problems.
> 
> Command line example:
> 
> qemu-system-x86_64 -enable-kvm -m 3072 -smp 3 \
>         -machine q35,kernel-irqchip=split -cpu host   \
>         -k fr   \
>         -serial stdio   \
>         -net none \
>         -qmp unix:/tmp/qmp.socket,server,nowait \
>         -monitor telnet:127.0.0.1:5555,server,nowait \
>         -device pcie-root-port,id=root0,multifunction=on,chassis=0,addr=0xa \
>         -device pcie-root-port,id=root1,bus=pcie.0,chassis=1 \
>         -device pcie-root-port,id=root2,bus=pcie.0,chassis=2 \
>         -netdev tap,script=/root/bin/bridge.sh,downscript=no,id=hostnet1,vhost=on \
>         -device virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:6f:55:cc,bus=root2,failover=on \                                                                                    
>         /root/rhel-guest-image-8.0-1781.x86_64.qcow2
> 
> Then the primary device can be hotplugged via
>  (qemu) device_add vfio-pci,host=5e:00.2,id=hostdev0,bus=root1,standby=net1

Is this standby= option only valid for Network/Ethernet class code
devices?  If so, perhaps vfio-pci code should reject the option on any
non-ethernet devices.  The option is also non-intuitive for users, only
through examples like above can we see it relates to the id of the
secondary device.  Could we instead name it something like
"standby_net_failover_pair_id="?

Also, this feature requires matching MAC addresses per the description,
where is that done?  Is it the user's responsibility to set the MAC on
the host device prior to the device_add?  If so, is this actually not
only specific to ethernet devices, but ethernet VFs?

Finally, please copy me on code touching vfio.  Thanks,

Alex

Re: [Qemu-devel] [PATCH 0/4] add failover feature for assigned network devices
Posted by Jens Freimann 4 years, 11 months ago
On Mon, May 20, 2019 at 04:56:57PM -0600, Alex Williamson wrote:
>On Fri, 17 May 2019 14:58:16 +0200
>Jens Freimann <jfreimann@redhat.com> wrote:
>> Command line example:
>>
>> qemu-system-x86_64 -enable-kvm -m 3072 -smp 3 \
>>         -machine q35,kernel-irqchip=split -cpu host   \
>>         -k fr   \
>>         -serial stdio   \
>>         -net none \
>>         -qmp unix:/tmp/qmp.socket,server,nowait \
>>         -monitor telnet:127.0.0.1:5555,server,nowait \
>>         -device pcie-root-port,id=root0,multifunction=on,chassis=0,addr=0xa \
>>         -device pcie-root-port,id=root1,bus=pcie.0,chassis=1 \
>>         -device pcie-root-port,id=root2,bus=pcie.0,chassis=2 \
>>         -netdev tap,script=/root/bin/bridge.sh,downscript=no,id=hostnet1,vhost=on \
>>         -device virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:6f:55:cc,bus=root2,failover=on \
>>         /root/rhel-guest-image-8.0-1781.x86_64.qcow2
>>
>> Then the primary device can be hotplugged via
>>  (qemu) device_add vfio-pci,host=5e:00.2,id=hostdev0,bus=root1,standby=net1
>
>Is this standby= option only valid for Network/Ethernet class code
>devices?  If so, perhaps vfio-pci code should reject the option on any
>non-ethernet devices.  The option is also non-intuitive for users, only
>through examples like above can we see it relates to the id of the
>secondary device.  Could we instead name it something like
>"standby_net_failover_pair_id="?

It is only for ethernet (VFs), I will add code to reject non-ethernet VF devices.
I agree the name is not descriptive and the one you suggest seems good to
me. 
>
>Also, this feature requires matching MAC addresses per the description,
>where is that done?  Is it the user's responsibility to set the MAC on
>the host device prior to the device_add?  If so, is this actually not
>only specific to ethernet devices, but ethernet VFs?

Yes, it's the users responsibility and the MACs are then matched by
the net_failover driver in the guest. It makes sense for ethernet VFs only,
I'll add a check for that.
>
>Finally, please copy me on code touching vfio.  Thanks,

I'm sorry about that, will do.

Thanks for the review Alex!

regards,
Jens 

Re: [Qemu-devel] [PATCH 0/4] add failover feature for assigned network devices
Posted by Michael S. Tsirkin 4 years, 11 months ago
On Tue, May 21, 2019 at 09:21:57AM +0200, Jens Freimann wrote:
> On Mon, May 20, 2019 at 04:56:57PM -0600, Alex Williamson wrote:
> > On Fri, 17 May 2019 14:58:16 +0200
> > Jens Freimann <jfreimann@redhat.com> wrote:
> > > Command line example:
> > > 
> > > qemu-system-x86_64 -enable-kvm -m 3072 -smp 3 \
> > >         -machine q35,kernel-irqchip=split -cpu host   \
> > >         -k fr   \
> > >         -serial stdio   \
> > >         -net none \
> > >         -qmp unix:/tmp/qmp.socket,server,nowait \
> > >         -monitor telnet:127.0.0.1:5555,server,nowait \
> > >         -device pcie-root-port,id=root0,multifunction=on,chassis=0,addr=0xa \
> > >         -device pcie-root-port,id=root1,bus=pcie.0,chassis=1 \
> > >         -device pcie-root-port,id=root2,bus=pcie.0,chassis=2 \
> > >         -netdev tap,script=/root/bin/bridge.sh,downscript=no,id=hostnet1,vhost=on \
> > >         -device virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:6f:55:cc,bus=root2,failover=on \
> > >         /root/rhel-guest-image-8.0-1781.x86_64.qcow2
> > > 
> > > Then the primary device can be hotplugged via
> > >  (qemu) device_add vfio-pci,host=5e:00.2,id=hostdev0,bus=root1,standby=net1
> > 
> > Is this standby= option only valid for Network/Ethernet class code
> > devices?  If so, perhaps vfio-pci code should reject the option on any
> > non-ethernet devices.  The option is also non-intuitive for users, only
> > through examples like above can we see it relates to the id of the
> > secondary device.  Could we instead name it something like
> > "standby_net_failover_pair_id="?
> 
> It is only for ethernet (VFs), I will add code to reject non-ethernet VF devices.
> I agree the name is not descriptive and the one you suggest seems good to
> me.
> > 
> > Also, this feature requires matching MAC addresses per the description,
> > where is that done?  Is it the user's responsibility to set the MAC on
> > the host device prior to the device_add?  If so, is this actually not
> > only specific to ethernet devices, but ethernet VFs?
> 
> Yes, it's the users responsibility and the MACs are then matched by
> the net_failover driver in the guest. It makes sense for ethernet VFs only,
> I'll add a check for that.

Actually is there a list of devices for which this has been tested
besides mlx5? I think someone said some old intel cards
don't support this well, we might need to blacklist these ...

> > 
> > Finally, please copy me on code touching vfio.  Thanks,
> 
> I'm sorry about that, will do.
> 
> Thanks for the review Alex!
> 
> regards,
> Jens

Re: [Qemu-devel] [PATCH 0/4] add failover feature for assigned network devices
Posted by Jens Freimann 4 years, 11 months ago
On Tue, May 21, 2019 at 07:37:19AM -0400, Michael S. Tsirkin wrote:
>On Tue, May 21, 2019 at 09:21:57AM +0200, Jens Freimann wrote:
>> On Mon, May 20, 2019 at 04:56:57PM -0600, Alex Williamson wrote:
>> > On Fri, 17 May 2019 14:58:16 +0200
>> > Jens Freimann <jfreimann@redhat.com> wrote:
>> > > Command line example:
>> > >
>> > > qemu-system-x86_64 -enable-kvm -m 3072 -smp 3 \
>> > >         -machine q35,kernel-irqchip=split -cpu host   \
>> > >         -k fr   \
>> > >         -serial stdio   \
>> > >         -net none \
>> > >         -qmp unix:/tmp/qmp.socket,server,nowait \
>> > >         -monitor telnet:127.0.0.1:5555,server,nowait \
>> > >         -device pcie-root-port,id=root0,multifunction=on,chassis=0,addr=0xa \
>> > >         -device pcie-root-port,id=root1,bus=pcie.0,chassis=1 \
>> > >         -device pcie-root-port,id=root2,bus=pcie.0,chassis=2 \
>> > >         -netdev tap,script=/root/bin/bridge.sh,downscript=no,id=hostnet1,vhost=on \
>> > >         -device virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:6f:55:cc,bus=root2,failover=on \
>> > >         /root/rhel-guest-image-8.0-1781.x86_64.qcow2
>> > >
>> > > Then the primary device can be hotplugged via
>> > >  (qemu) device_add vfio-pci,host=5e:00.2,id=hostdev0,bus=root1,standby=net1
>> >
>> > Is this standby= option only valid for Network/Ethernet class code
>> > devices?  If so, perhaps vfio-pci code should reject the option on any
>> > non-ethernet devices.  The option is also non-intuitive for users, only
>> > through examples like above can we see it relates to the id of the
>> > secondary device.  Could we instead name it something like
>> > "standby_net_failover_pair_id="?
>>
>> It is only for ethernet (VFs), I will add code to reject non-ethernet VF devices.
>> I agree the name is not descriptive and the one you suggest seems good to
>> me.
>> >
>> > Also, this feature requires matching MAC addresses per the description,
>> > where is that done?  Is it the user's responsibility to set the MAC on
>> > the host device prior to the device_add?  If so, is this actually not
>> > only specific to ethernet devices, but ethernet VFs?
>>
>> Yes, it's the users responsibility and the MACs are then matched by
>> the net_failover driver in the guest. It makes sense for ethernet VFs only,
>> I'll add a check for that.
>
>Actually is there a list of devices for which this has been tested
>besides mlx5? I think someone said some old intel cards
>don't support this well, we might need to blacklist these ...

So far I've tested mlx5 and XL710 which both worked, but I'm
working on testing with more devices. But of course help with testing
is greatly appreciated.

regards,
Jens 

Re: [Qemu-devel] [PATCH 0/4] add failover feature for assigned network devices
Posted by si-wei liu 4 years, 11 months ago

On 5/21/2019 11:49 AM, Jens Freimann wrote:
> On Tue, May 21, 2019 at 07:37:19AM -0400, Michael S. Tsirkin wrote:
>> On Tue, May 21, 2019 at 09:21:57AM +0200, Jens Freimann wrote:
>>> On Mon, May 20, 2019 at 04:56:57PM -0600, Alex Williamson wrote:
>>> > On Fri, 17 May 2019 14:58:16 +0200
>>> > Jens Freimann <jfreimann@redhat.com> wrote:
>>> > > Command line example:
>>> > >
>>> > > qemu-system-x86_64 -enable-kvm -m 3072 -smp 3 \
>>> > >         -machine q35,kernel-irqchip=split -cpu host   \
>>> > >         -k fr   \
>>> > >         -serial stdio   \
>>> > >         -net none \
>>> > >         -qmp unix:/tmp/qmp.socket,server,nowait \
>>> > >         -monitor telnet:127.0.0.1:5555,server,nowait \
>>> > >         -device 
>>> pcie-root-port,id=root0,multifunction=on,chassis=0,addr=0xa \
>>> > >         -device pcie-root-port,id=root1,bus=pcie.0,chassis=1 \
>>> > >         -device pcie-root-port,id=root2,bus=pcie.0,chassis=2 \
>>> > >         -netdev 
>>> tap,script=/root/bin/bridge.sh,downscript=no,id=hostnet1,vhost=on \
>>> > >         -device 
>>> virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:6f:55:cc,bus=root2,failover=on 
>>> \
>>> > >         /root/rhel-guest-image-8.0-1781.x86_64.qcow2
>>> > >
>>> > > Then the primary device can be hotplugged via
>>> > >  (qemu) device_add 
>>> vfio-pci,host=5e:00.2,id=hostdev0,bus=root1,standby=net1
>>> >
>>> > Is this standby= option only valid for Network/Ethernet class code
>>> > devices?  If so, perhaps vfio-pci code should reject the option on 
>>> any
>>> > non-ethernet devices.  The option is also non-intuitive for users, 
>>> only
>>> > through examples like above can we see it relates to the id of the
>>> > secondary device.  Could we instead name it something like
>>> > "standby_net_failover_pair_id="?
>>>
>>> It is only for ethernet (VFs), I will add code to reject 
>>> non-ethernet VF devices.
>>> I agree the name is not descriptive and the one you suggest seems 
>>> good to
>>> me.
>>> >
>>> > Also, this feature requires matching MAC addresses per the 
>>> description,
>>> > where is that done?  Is it the user's responsibility to set the 
>>> MAC on
>>> > the host device prior to the device_add?  If so, is this actually not
>>> > only specific to ethernet devices, but ethernet VFs?
>>>
>>> Yes, it's the users responsibility and the MACs are then matched by
>>> the net_failover driver in the guest. It makes sense for ethernet 
>>> VFs only,
>>> I'll add a check for that.
>>
>> Actually is there a list of devices for which this has been tested
>> besides mlx5? I think someone said some old intel cards
>> don't support this well, we might need to blacklist these ...
>
> So far I've tested mlx5 and XL710 which both worked, but I'm
> working on testing with more devices. But of course help with testing
> is greatly appreciated.
It won't work on Intel ixgbe and Broadcom bnxt_en, which requires 
toggling the state of tap backing the virtio-net in order to 
release/reprogram MAC filter. Actually, it's very few NICs that could 
work with this - even some works by chance the behavior is undefined. 
Instead of blacklisting it makes more sense to whitelist the NIC that 
supports it - with some new sysfs attribute claiming the support 
presumably.

-Siwei

>
> regards,
> Jens


Re: [Qemu-devel] [PATCH 0/4] add failover feature for assigned network devices
Posted by Michael S. Tsirkin 4 years, 11 months ago
On Tue, May 28, 2019 at 05:14:22PM -0700, si-wei liu wrote:
> 
> 
> On 5/21/2019 11:49 AM, Jens Freimann wrote:
> > On Tue, May 21, 2019 at 07:37:19AM -0400, Michael S. Tsirkin wrote:
> > > On Tue, May 21, 2019 at 09:21:57AM +0200, Jens Freimann wrote:
> > > > On Mon, May 20, 2019 at 04:56:57PM -0600, Alex Williamson wrote:
> > > > > On Fri, 17 May 2019 14:58:16 +0200
> > > > > Jens Freimann <jfreimann@redhat.com> wrote:
> > > > > > Command line example:
> > > > > >
> > > > > > qemu-system-x86_64 -enable-kvm -m 3072 -smp 3 \
> > > > > >         -machine q35,kernel-irqchip=split -cpu host   \
> > > > > >         -k fr   \
> > > > > >         -serial stdio   \
> > > > > >         -net none \
> > > > > >         -qmp unix:/tmp/qmp.socket,server,nowait \
> > > > > >         -monitor telnet:127.0.0.1:5555,server,nowait \
> > > > > >         -device
> > > > pcie-root-port,id=root0,multifunction=on,chassis=0,addr=0xa \
> > > > > >         -device pcie-root-port,id=root1,bus=pcie.0,chassis=1 \
> > > > > >         -device pcie-root-port,id=root2,bus=pcie.0,chassis=2 \
> > > > > >         -netdev
> > > > tap,script=/root/bin/bridge.sh,downscript=no,id=hostnet1,vhost=on
> > > > \
> > > > > >         -device virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:6f:55:cc,bus=root2,failover=on
> > > > \
> > > > > >         /root/rhel-guest-image-8.0-1781.x86_64.qcow2
> > > > > >
> > > > > > Then the primary device can be hotplugged via
> > > > > >  (qemu) device_add
> > > > vfio-pci,host=5e:00.2,id=hostdev0,bus=root1,standby=net1
> > > > >
> > > > > Is this standby= option only valid for Network/Ethernet class code
> > > > > devices?  If so, perhaps vfio-pci code should reject the
> > > > option on any
> > > > > non-ethernet devices.  The option is also non-intuitive for
> > > > users, only
> > > > > through examples like above can we see it relates to the id of the
> > > > > secondary device.  Could we instead name it something like
> > > > > "standby_net_failover_pair_id="?
> > > > 
> > > > It is only for ethernet (VFs), I will add code to reject
> > > > non-ethernet VF devices.
> > > > I agree the name is not descriptive and the one you suggest
> > > > seems good to
> > > > me.
> > > > >
> > > > > Also, this feature requires matching MAC addresses per the
> > > > description,
> > > > > where is that done?  Is it the user's responsibility to set
> > > > the MAC on
> > > > > the host device prior to the device_add?  If so, is this actually not
> > > > > only specific to ethernet devices, but ethernet VFs?
> > > > 
> > > > Yes, it's the users responsibility and the MACs are then matched by
> > > > the net_failover driver in the guest. It makes sense for
> > > > ethernet VFs only,
> > > > I'll add a check for that.
> > > 
> > > Actually is there a list of devices for which this has been tested
> > > besides mlx5? I think someone said some old intel cards
> > > don't support this well, we might need to blacklist these ...
> > 
> > So far I've tested mlx5 and XL710 which both worked, but I'm
> > working on testing with more devices. But of course help with testing
> > is greatly appreciated.
> It won't work on Intel ixgbe and Broadcom bnxt_en, which requires toggling
> the state of tap backing the virtio-net in order to release/reprogram MAC
> filter. Actually, it's very few NICs that could work with this - even some
> works by chance the behavior is undefined. Instead of blacklisting it makes
> more sense to whitelist the NIC that supports it - with some new sysfs
> attribute claiming the support presumably.
> 
> -Siwei

I agree for many cards we won't know how they behave until we try.  One
can consider this a bug in Linux that cards don't behave in a consistent
way.  The best thing to do IMHO would be to write a tool that people can
run to test the behaviour.


> > 
> > regards,
> > Jens

Re: [Qemu-devel] [PATCH 0/4] add failover feature for assigned network devices
Posted by Laine Stump 4 years, 11 months ago
On 5/28/19 10:54 PM, Michael S. Tsirkin wrote:
> On Tue, May 28, 2019 at 05:14:22PM -0700, si-wei liu wrote:
>>
>>
>> On 5/21/2019 11:49 AM, Jens Freimann wrote:
>>> On Tue, May 21, 2019 at 07:37:19AM -0400, Michael S. Tsirkin wrote:
>>>> On Tue, May 21, 2019 at 09:21:57AM +0200, Jens Freimann wrote:
>>>>> On Mon, May 20, 2019 at 04:56:57PM -0600, Alex Williamson wrote:

>>>> Actually is there a list of devices for which this has been tested
>>>> besides mlx5? I think someone said some old intel cards
>>>> don't support this well, we might need to blacklist these ...
>>>
>>> So far I've tested mlx5 and XL710 which both worked, but I'm
>>> working on testing with more devices. But of course help with testing
>>> is greatly appreciated.
 >>
>> It won't work on Intel ixgbe and Broadcom bnxt_en, which requires toggling
>> the state of tap backing the virtio-net in order to release/reprogram MAC
>> filter. Actually, it's very few NICs that could work with this - even some
>> works by chance the behavior is undefined. Instead of blacklisting it makes
>> more sense to whitelist the NIC that supports it - with some new sysfs
>> attribute claiming the support presumably.
>>
>> -Siwei
> 
> I agree for many cards we won't know how they behave until we try.  One
> can consider this a bug in Linux that cards don't behave in a consistent
> way.  The best thing to do IMHO would be to write a tool that people can
> run to test the behaviour.

Is the "bad behavior" something due to the hardware of the cards, or 
their drivers? If it's the latter, then at least initially having a 
whitelist would be counterproductive, since it would make it difficult 
for relative outsiders to test and report success/failure of various cards.

(It's probably just a pipe dream, but it would be nice if it eventually 
could work with old igb cards - I have several of them that I use for 
SRIOV testing, and would rather avoid having to buy new hardware.)

Re: [Qemu-devel] [PATCH 0/4] add failover feature for assigned network devices
Posted by Michael S. Tsirkin 4 years, 11 months ago
On Mon, Jun 03, 2019 at 02:06:47PM -0400, Laine Stump wrote:
> On 5/28/19 10:54 PM, Michael S. Tsirkin wrote:
> > On Tue, May 28, 2019 at 05:14:22PM -0700, si-wei liu wrote:
> > > 
> > > 
> > > On 5/21/2019 11:49 AM, Jens Freimann wrote:
> > > > On Tue, May 21, 2019 at 07:37:19AM -0400, Michael S. Tsirkin wrote:
> > > > > On Tue, May 21, 2019 at 09:21:57AM +0200, Jens Freimann wrote:
> > > > > > On Mon, May 20, 2019 at 04:56:57PM -0600, Alex Williamson wrote:
> 
> > > > > Actually is there a list of devices for which this has been tested
> > > > > besides mlx5? I think someone said some old intel cards
> > > > > don't support this well, we might need to blacklist these ...
> > > > 
> > > > So far I've tested mlx5 and XL710 which both worked, but I'm
> > > > working on testing with more devices. But of course help with testing
> > > > is greatly appreciated.
> >>
> > > It won't work on Intel ixgbe and Broadcom bnxt_en, which requires toggling
> > > the state of tap backing the virtio-net in order to release/reprogram MAC
> > > filter. Actually, it's very few NICs that could work with this - even some
> > > works by chance the behavior is undefined. Instead of blacklisting it makes
> > > more sense to whitelist the NIC that supports it - with some new sysfs
> > > attribute claiming the support presumably.
> > > 
> > > -Siwei
> > 
> > I agree for many cards we won't know how they behave until we try.  One
> > can consider this a bug in Linux that cards don't behave in a consistent
> > way.  The best thing to do IMHO would be to write a tool that people can
> > run to test the behaviour.
> 
> Is the "bad behavior" something due to the hardware of the cards, or their
> drivers? If it's the latter, then at least initially having a whitelist
> would be counterproductive, since it would make it difficult for relative
> outsiders to test and report success/failure of various cards.

We can add an "ignore whitelist" flag. Would that address the issue?

> (It's probably just a pipe dream, but it would be nice if it eventually
> could work with old igb cards - I have several of them that I use for SRIOV
> testing, and would rather avoid having to buy new hardware.)

I think it generally can be worked around in the driver.
Most host drivers do get a notification when guest driver
loads/unloads and can use that to manipulate the on-device
switch.

-- 
MST

Re: [Qemu-devel] [PATCH 0/4] add failover feature for assigned network devices
Posted by Laine Stump 4 years, 11 months ago
On 6/3/19 2:12 PM, Michael S. Tsirkin wrote:
> On Mon, Jun 03, 2019 at 02:06:47PM -0400, Laine Stump wrote:
>> On 5/28/19 10:54 PM, Michael S. Tsirkin wrote:
>>> On Tue, May 28, 2019 at 05:14:22PM -0700, si-wei liu wrote:
>>>>
>>>>
>>>> On 5/21/2019 11:49 AM, Jens Freimann wrote:
>>>>> On Tue, May 21, 2019 at 07:37:19AM -0400, Michael S. Tsirkin wrote:
>>>>>> On Tue, May 21, 2019 at 09:21:57AM +0200, Jens Freimann wrote:
>>>>>>> On Mon, May 20, 2019 at 04:56:57PM -0600, Alex Williamson wrote:
>>
>>>>>> Actually is there a list of devices for which this has been tested
>>>>>> besides mlx5? I think someone said some old intel cards
>>>>>> don't support this well, we might need to blacklist these ...
>>>>>
>>>>> So far I've tested mlx5 and XL710 which both worked, but I'm
>>>>> working on testing with more devices. But of course help with testing
>>>>> is greatly appreciated.
>>>>
>>>> It won't work on Intel ixgbe and Broadcom bnxt_en, which requires toggling
>>>> the state of tap backing the virtio-net in order to release/reprogram MAC
>>>> filter. Actually, it's very few NICs that could work with this - even some
>>>> works by chance the behavior is undefined. Instead of blacklisting it makes
>>>> more sense to whitelist the NIC that supports it - with some new sysfs
>>>> attribute claiming the support presumably.
>>>>
>>>> -Siwei
>>>
>>> I agree for many cards we won't know how they behave until we try.  One
>>> can consider this a bug in Linux that cards don't behave in a consistent
>>> way.  The best thing to do IMHO would be to write a tool that people can
>>> run to test the behaviour.
>>
>> Is the "bad behavior" something due to the hardware of the cards, or their
>> drivers? If it's the latter, then at least initially having a whitelist
>> would be counterproductive, since it would make it difficult for relative
>> outsiders to test and report success/failure of various cards.
> 
> We can add an "ignore whitelist" flag. Would that address the issue?

It would be better than requiring a kernel/qemu recompile :-)


Where would the whilelist live? In qemu or in the kernel? It would be 
problematic to have the whitelist in qemu if kernel driver changes could 
fix a particular card.

Beyond that, what about *always* just issuing some sort of warning 
rather than completely forbidding a card that wasn't whitelisted? 
(Haven't decided if I like that better or not (and it probably doesn't 
matter, since I'm not a "real" user, but I thought I would mention it).

Re: [Qemu-devel] [PATCH 0/4] add failover feature for assigned network devices
Posted by Michael S. Tsirkin 4 years, 11 months ago
On Mon, Jun 03, 2019 at 02:18:19PM -0400, Laine Stump wrote:
> On 6/3/19 2:12 PM, Michael S. Tsirkin wrote:
> > On Mon, Jun 03, 2019 at 02:06:47PM -0400, Laine Stump wrote:
> > > On 5/28/19 10:54 PM, Michael S. Tsirkin wrote:
> > > > On Tue, May 28, 2019 at 05:14:22PM -0700, si-wei liu wrote:
> > > > > 
> > > > > 
> > > > > On 5/21/2019 11:49 AM, Jens Freimann wrote:
> > > > > > On Tue, May 21, 2019 at 07:37:19AM -0400, Michael S. Tsirkin wrote:
> > > > > > > On Tue, May 21, 2019 at 09:21:57AM +0200, Jens Freimann wrote:
> > > > > > > > On Mon, May 20, 2019 at 04:56:57PM -0600, Alex Williamson wrote:
> > > 
> > > > > > > Actually is there a list of devices for which this has been tested
> > > > > > > besides mlx5? I think someone said some old intel cards
> > > > > > > don't support this well, we might need to blacklist these ...
> > > > > > 
> > > > > > So far I've tested mlx5 and XL710 which both worked, but I'm
> > > > > > working on testing with more devices. But of course help with testing
> > > > > > is greatly appreciated.
> > > > > 
> > > > > It won't work on Intel ixgbe and Broadcom bnxt_en, which requires toggling
> > > > > the state of tap backing the virtio-net in order to release/reprogram MAC
> > > > > filter. Actually, it's very few NICs that could work with this - even some
> > > > > works by chance the behavior is undefined. Instead of blacklisting it makes
> > > > > more sense to whitelist the NIC that supports it - with some new sysfs
> > > > > attribute claiming the support presumably.
> > > > > 
> > > > > -Siwei
> > > > 
> > > > I agree for many cards we won't know how they behave until we try.  One
> > > > can consider this a bug in Linux that cards don't behave in a consistent
> > > > way.  The best thing to do IMHO would be to write a tool that people can
> > > > run to test the behaviour.
> > > 
> > > Is the "bad behavior" something due to the hardware of the cards, or their
> > > drivers? If it's the latter, then at least initially having a whitelist
> > > would be counterproductive, since it would make it difficult for relative
> > > outsiders to test and report success/failure of various cards.
> > 
> > We can add an "ignore whitelist" flag. Would that address the issue?
> 
> It would be better than requiring a kernel/qemu recompile :-)
> 
> 
> Where would the whilelist live? In qemu or in the kernel? It would be
> problematic to have the whitelist in qemu if kernel driver changes could fix
> a particular card.

So originally I thought:
- add some interface in the kernel to signal new behaviour
- start with a whitelist in qemu
- if not on the whitelist, check the new interface
- if not there, check a "force" flag on the device

But one problem with all of the above is that it's actually
too late. With a broken driver when management sets MAC on the
to-be-primary VF traffic stops being sent to standby.

> Beyond that, what about *always* just issuing some sort of warning rather
> than completely forbidding a card that wasn't whitelisted? (Haven't decided
> if I like that better or not (and it probably doesn't matter, since I'm not
> a "real" user, but I thought I would mention it).

People tend to ignore warnings :)

-- 
MST

Re: [Qemu-devel] [PATCH 0/4] add failover feature for assigned network devices
Posted by Michael S. Tsirkin 4 years, 11 months ago
On Tue, May 21, 2019 at 08:49:18PM +0200, Jens Freimann wrote:
> On Tue, May 21, 2019 at 07:37:19AM -0400, Michael S. Tsirkin wrote:
> > On Tue, May 21, 2019 at 09:21:57AM +0200, Jens Freimann wrote:
> > > On Mon, May 20, 2019 at 04:56:57PM -0600, Alex Williamson wrote:
> > > > On Fri, 17 May 2019 14:58:16 +0200
> > > > Jens Freimann <jfreimann@redhat.com> wrote:
> > > > > Command line example:
> > > > >
> > > > > qemu-system-x86_64 -enable-kvm -m 3072 -smp 3 \
> > > > >         -machine q35,kernel-irqchip=split -cpu host   \
> > > > >         -k fr   \
> > > > >         -serial stdio   \
> > > > >         -net none \
> > > > >         -qmp unix:/tmp/qmp.socket,server,nowait \
> > > > >         -monitor telnet:127.0.0.1:5555,server,nowait \
> > > > >         -device pcie-root-port,id=root0,multifunction=on,chassis=0,addr=0xa \
> > > > >         -device pcie-root-port,id=root1,bus=pcie.0,chassis=1 \
> > > > >         -device pcie-root-port,id=root2,bus=pcie.0,chassis=2 \
> > > > >         -netdev tap,script=/root/bin/bridge.sh,downscript=no,id=hostnet1,vhost=on \
> > > > >         -device virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:6f:55:cc,bus=root2,failover=on \
> > > > >         /root/rhel-guest-image-8.0-1781.x86_64.qcow2
> > > > >
> > > > > Then the primary device can be hotplugged via
> > > > >  (qemu) device_add vfio-pci,host=5e:00.2,id=hostdev0,bus=root1,standby=net1
> > > >
> > > > Is this standby= option only valid for Network/Ethernet class code
> > > > devices?  If so, perhaps vfio-pci code should reject the option on any
> > > > non-ethernet devices.  The option is also non-intuitive for users, only
> > > > through examples like above can we see it relates to the id of the
> > > > secondary device.  Could we instead name it something like
> > > > "standby_net_failover_pair_id="?
> > > 
> > > It is only for ethernet (VFs), I will add code to reject non-ethernet VF devices.
> > > I agree the name is not descriptive and the one you suggest seems good to
> > > me.
> > > >
> > > > Also, this feature requires matching MAC addresses per the description,
> > > > where is that done?  Is it the user's responsibility to set the MAC on
> > > > the host device prior to the device_add?  If so, is this actually not
> > > > only specific to ethernet devices, but ethernet VFs?
> > > 
> > > Yes, it's the users responsibility and the MACs are then matched by
> > > the net_failover driver in the guest. It makes sense for ethernet VFs only,
> > > I'll add a check for that.
> > 
> > Actually is there a list of devices for which this has been tested
> > besides mlx5? I think someone said some old intel cards
> > don't support this well, we might need to blacklist these ...
> 
> So far I've tested mlx5 and XL710 which both worked, but I'm
> working on testing with more devices. But of course help with testing
> is greatly appreciated.
> 
> regards,
> Jens

A testing tool that people can run to get a pass/fail
result would be needed for that.
Do you have something like this?

-- 
MST

Re: [Qemu-devel] [PATCH 0/4] add failover feature for assigned network devices
Posted by Jens Freimann 4 years, 11 months ago
On Tue, May 28, 2019 at 10:40:42PM -0400, Michael S. Tsirkin wrote:
>On Tue, May 21, 2019 at 08:49:18PM +0200, Jens Freimann wrote:
>> On Tue, May 21, 2019 at 07:37:19AM -0400, Michael S. Tsirkin wrote:
>> > On Tue, May 21, 2019 at 09:21:57AM +0200, Jens Freimann wrote:
>> > Actually is there a list of devices for which this has been tested
>> > besides mlx5? I think someone said some old intel cards
>> > don't support this well, we might need to blacklist these ...
>>
>> So far I've tested mlx5 and XL710 which both worked, but I'm
>> working on testing with more devices. But of course help with testing
>> is greatly appreciated.
>
>A testing tool that people can run to get a pass/fail
>result would be needed for that.
>Do you have something like this?

I have two simple tools. One that sends packets and another one that
sniffs for packets to see which device the packet goes to. Find it at
https://github.com/jensfr/netfailover_driver_detect

Feedback and/or patches welcome.

regards,
Jens 

Re: [Qemu-devel] [PATCH 0/4] add failover feature for assigned network devices
Posted by Michael S. Tsirkin 4 years, 11 months ago
On Wed, May 29, 2019 at 09:48:02AM +0200, Jens Freimann wrote:
> On Tue, May 28, 2019 at 10:40:42PM -0400, Michael S. Tsirkin wrote:
> > On Tue, May 21, 2019 at 08:49:18PM +0200, Jens Freimann wrote:
> > > On Tue, May 21, 2019 at 07:37:19AM -0400, Michael S. Tsirkin wrote:
> > > > On Tue, May 21, 2019 at 09:21:57AM +0200, Jens Freimann wrote:
> > > > Actually is there a list of devices for which this has been tested
> > > > besides mlx5? I think someone said some old intel cards
> > > > don't support this well, we might need to blacklist these ...
> > > 
> > > So far I've tested mlx5 and XL710 which both worked, but I'm
> > > working on testing with more devices. But of course help with testing
> > > is greatly appreciated.
> > 
> > A testing tool that people can run to get a pass/fail
> > result would be needed for that.
> > Do you have something like this?
> 
> I have two simple tools. One that sends packets and another one that
> sniffs for packets to see which device the packet goes to. Find it at
> https://github.com/jensfr/netfailover_driver_detect
> 
> Feedback and/or patches welcome.
> 
> regards,
> Jens

The docs say:
 ./is_legacy -d . If is_legacy returns 0 it means it has received the packets sent by send_packet. If it returns 1 it didn't receive the packet. Now run ./is_legacy -d 


So -d twice. What is the difference?

-- 
MST

Re: [Qemu-devel] [PATCH 0/4] add failover feature for assigned network devices
Posted by Jens Freimann 4 years, 11 months ago
On Thu, May 30, 2019 at 02:12:21PM -0400, Michael S. Tsirkin wrote:
>On Wed, May 29, 2019 at 09:48:02AM +0200, Jens Freimann wrote:
>> On Tue, May 28, 2019 at 10:40:42PM -0400, Michael S. Tsirkin wrote:
>> > On Tue, May 21, 2019 at 08:49:18PM +0200, Jens Freimann wrote:
>> > > On Tue, May 21, 2019 at 07:37:19AM -0400, Michael S. Tsirkin wrote:
>> > > > On Tue, May 21, 2019 at 09:21:57AM +0200, Jens Freimann wrote:
>> > > > Actually is there a list of devices for which this has been tested
>> > > > besides mlx5? I think someone said some old intel cards
>> > > > don't support this well, we might need to blacklist these ...
>> > >
>> > > So far I've tested mlx5 and XL710 which both worked, but I'm
>> > > working on testing with more devices. But of course help with testing
>> > > is greatly appreciated.
>> >
>> > A testing tool that people can run to get a pass/fail
>> > result would be needed for that.
>> > Do you have something like this?
>>
>> I have two simple tools. One that sends packets and another one that
>> sniffs for packets to see which device the packet goes to. Find it at
>> https://github.com/jensfr/netfailover_driver_detect
>>
>> Feedback and/or patches welcome.
>>
>> regards,
>> Jens
>
>The docs say:
> ./is_legacy -d . If is_legacy returns 0 it means it has received the packets sent by send_packet. If it returns 1 it didn't receive the packet. Now run ./is_legacy -d
>
>So -d twice. What is the difference?

Should say "Now run ./is_legacy -d <vf device>" to sniff on vf device.
I'll fix the README, thanks!

regards,
Jens 

Re: [Qemu-devel] [PATCH 0/4] add failover feature for assigned network devices
Posted by Alex Williamson 4 years, 11 months ago
On Tue, 21 May 2019 09:21:57 +0200
Jens Freimann <jfreimann@redhat.com> wrote:

> On Mon, May 20, 2019 at 04:56:57PM -0600, Alex Williamson wrote:
> >On Fri, 17 May 2019 14:58:16 +0200
> >Jens Freimann <jfreimann@redhat.com> wrote:  
> >> Command line example:
> >>
> >> qemu-system-x86_64 -enable-kvm -m 3072 -smp 3 \
> >>         -machine q35,kernel-irqchip=split -cpu host   \
> >>         -k fr   \
> >>         -serial stdio   \
> >>         -net none \
> >>         -qmp unix:/tmp/qmp.socket,server,nowait \
> >>         -monitor telnet:127.0.0.1:5555,server,nowait \
> >>         -device pcie-root-port,id=root0,multifunction=on,chassis=0,addr=0xa \
> >>         -device pcie-root-port,id=root1,bus=pcie.0,chassis=1 \
> >>         -device pcie-root-port,id=root2,bus=pcie.0,chassis=2 \
> >>         -netdev tap,script=/root/bin/bridge.sh,downscript=no,id=hostnet1,vhost=on \
> >>         -device virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:6f:55:cc,bus=root2,failover=on \
> >>         /root/rhel-guest-image-8.0-1781.x86_64.qcow2
> >>
> >> Then the primary device can be hotplugged via
> >>  (qemu) device_add vfio-pci,host=5e:00.2,id=hostdev0,bus=root1,standby=net1  
> >
> >Is this standby= option only valid for Network/Ethernet class code
> >devices?  If so, perhaps vfio-pci code should reject the option on any
> >non-ethernet devices.  The option is also non-intuitive for users, only
> >through examples like above can we see it relates to the id of the
> >secondary device.  Could we instead name it something like
> >"standby_net_failover_pair_id="?  
> 
> It is only for ethernet (VFs), I will add code to reject non-ethernet VF devices.
> I agree the name is not descriptive and the one you suggest seems good to
> me. 
> >
> >Also, this feature requires matching MAC addresses per the description,
> >where is that done?  Is it the user's responsibility to set the MAC on
> >the host device prior to the device_add?  If so, is this actually not
> >only specific to ethernet devices, but ethernet VFs?  
> 
> Yes, it's the users responsibility and the MACs are then matched by
> the net_failover driver in the guest. It makes sense for ethernet VFs only,
> I'll add a check for that.

FWIW, I'd probably stop at Ethernet class devices, vfio doesn't really
expose whether a device is a VF, so we'd likely need to resort to
getting that info through sysfs.  It also seems like there might be
some limited-use cases of copying the MAC from a PF to the virtio nic
or use of utilities on the host for modifiying a PF MAC, perhaps via
eeprom.  So while we expect the typical use case to be a VF, it's
probably ugly and unnecessarily restrictive to enforce it.  Thanks,

Alex