[PATCH 00/14] migration: Improve error reporting

Cédric Le Goater posted 14 patches 9 months, 3 weeks ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20240207133347.1115903-1-clg@redhat.com
Maintainers: Stefano Stabellini <sstabellini@kernel.org>, Anthony Perard <anthony.perard@citrix.com>, Paul Durrant <paul@xen.org>, "Michael S. Tsirkin" <mst@redhat.com>, Marcel Apfelbaum <marcel.apfelbaum@gmail.com>, Paolo Bonzini <pbonzini@redhat.com>, Richard Henderson <richard.henderson@linaro.org>, Eduardo Habkost <eduardo@habkost.net>, Nicholas Piggin <npiggin@gmail.com>, Daniel Henrique Barboza <danielhb413@gmail.com>, "Cédric Le Goater" <clg@kaod.org>, David Gibson <david@gibson.dropbear.id.au>, Harsh Prateek Bora <harshpb@linux.ibm.com>, Halil Pasic <pasic@linux.ibm.com>, Christian Borntraeger <borntraeger@linux.ibm.com>, Eric Farman <farman@linux.ibm.com>, David Hildenbrand <david@redhat.com>, Ilya Leoshkevich <iii@linux.ibm.com>, Thomas Huth <thuth@redhat.com>, Alex Williamson <alex.williamson@redhat.com>, Peter Xu <peterx@redhat.com>, "Philippe Mathieu-Daudé" <philmd@linaro.org>, Fabiano Rosas <farosas@suse.de>, Stefan Hajnoczi <stefanha@redhat.com>, Fam Zheng <fam@euphon.net>, Eric Blake <eblake@redhat.com>, Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>, John Snow <jsnow@redhat.com>, Hyman Huang <yong.huang@smartx.com>
There is a newer version of this series
include/exec/memory.h                 | 12 ++--
include/hw/vfio/vfio-common.h         |  2 +-
include/hw/vfio/vfio-container-base.h |  4 +-
include/migration/register.h          |  4 +-
hw/i386/xen/xen-hvm.c                 |  8 +--
hw/ppc/spapr.c                        |  2 +-
hw/s390x/s390-stattrib.c              |  2 +-
hw/vfio/common.c                      | 96 ++++++++++++++++-----------
hw/vfio/container-base.c              |  4 +-
hw/vfio/container.c                   |  6 +-
hw/vfio/migration.c                   | 87 +++++++++++++++---------
hw/vfio/pci.c                         |  5 +-
hw/virtio/vhost.c                     |  4 +-
migration/block-dirty-bitmap.c        |  2 +-
migration/block.c                     |  2 +-
migration/dirtyrate.c                 | 24 +++++--
migration/migration.c                 | 16 ++---
migration/qemu-file.c                 |  5 +-
migration/ram.c                       | 40 ++++++++---
migration/savevm.c                    | 14 ++--
system/memory.c                       | 37 +++++++----
21 files changed, 236 insertions(+), 140 deletions(-)
[PATCH 00/14] migration: Improve error reporting
Posted by Cédric Le Goater 9 months, 3 weeks ago
Hello,

The motivation behind these changes is to improve error reporting to
the upper management layer (libvirt) with a more detailed error, this
to let it decide, depending on the reported error, whether to try
migration again later. It would be useful in cases where migration
fails due to lack of HW resources on the host. For instance, some
adapters can only initiate a limited number of simultaneous dirty
tracking requests and this imposes a limit on the the number of VMs
that can be migrated simultaneously.

We are not quite ready for such a mechanism but what we can do first is
to cleanup the error reporting ​in the early save_setup sequence. This
is what the following changes propose, by adding an Error argument to
various handlers and propagating it to the core migration subsystem.

The last patches try to address a related issue found on VMs with MLX5
VF assigned devices. These are one of those adapters with the HW
limitation described above. If dirty tracking setup fails and
return-path is in use, the return-path thread does not terminate,
leaving the source and destination VMs waiting for an event to occur.

The last patch is still an RFC because the correct fix is not obvious
and implies reworking the QEMUFile software construct, built on top of
the QEMU I/O channel.
 
Thanks,

C.

[1] https://lore.kernel.org/qemu-devel/20240201184853.890471-1-clg@redhat.com/

Cédric Le Goater (14):
  migration: Add Error** argument to .save_setup() handler
  migration: Add Error** argument to .load_setup() handler
  memory: Add Error** argument to .log_global*() handlers
  migration: Modify ram_init_bitmaps() to report dirty tracking errors
  vfio: Add Error** argument to .set_dirty_page_tracking() handler
  vfio: Add Error** argument to vfio_devices_dma_logging_start()
  vfio: Add Error** argument to vfio_devices_dma_logging_stop()
  vfio: Use new Error** argument in vfio_save_setup()
  vfio: Add Error** argument to .vfio_save_config() handler
  vfio: Also trace event failures in vfio_save_complete_precopy()
  vfio: Extend vfio_set_migration_error() with Error* argument
  migration: Report error when shutdown fails
  migration: Use migrate_has_error() in close_return_path_on_source()
  migration: Fix return-path thread exit

 include/exec/memory.h                 | 12 ++--
 include/hw/vfio/vfio-common.h         |  2 +-
 include/hw/vfio/vfio-container-base.h |  4 +-
 include/migration/register.h          |  4 +-
 hw/i386/xen/xen-hvm.c                 |  8 +--
 hw/ppc/spapr.c                        |  2 +-
 hw/s390x/s390-stattrib.c              |  2 +-
 hw/vfio/common.c                      | 96 ++++++++++++++++-----------
 hw/vfio/container-base.c              |  4 +-
 hw/vfio/container.c                   |  6 +-
 hw/vfio/migration.c                   | 87 +++++++++++++++---------
 hw/vfio/pci.c                         |  5 +-
 hw/virtio/vhost.c                     |  4 +-
 migration/block-dirty-bitmap.c        |  2 +-
 migration/block.c                     |  2 +-
 migration/dirtyrate.c                 | 24 +++++--
 migration/migration.c                 | 16 ++---
 migration/qemu-file.c                 |  5 +-
 migration/ram.c                       | 40 ++++++++---
 migration/savevm.c                    | 14 ++--
 system/memory.c                       | 37 +++++++----
 21 files changed, 236 insertions(+), 140 deletions(-)

-- 
2.43.0