From nobody Sat May 30 17:44:07 2026 Delivered-To: importer@patchew.org Authentication-Results: mx.zohomail.com; spf=pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org Return-Path: Received: from lists1p.gnu.org (lists1p.gnu.org [209.51.188.17]) by mx.zohomail.com with SMTPS id 1779288173549350.9825888013572; Wed, 20 May 2026 07:42:53 -0700 (PDT) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists1p.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1wPi7Z-0002qc-0o; Wed, 20 May 2026 10:42:05 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists1p.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1wPi7W-0002q8-NW for qemu-devel@nongnu.org; Wed, 20 May 2026 10:42:02 -0400 Received: from vps-ovh.mhejs.net ([145.239.82.108]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1wPi7V-0002GL-2i for qemu-devel@nongnu.org; Wed, 20 May 2026 10:42:02 -0400 Received: from MUA by vps-ovh.mhejs.net with esmtpsa (TLS1.3) tls TLS_AES_256_GCM_SHA384 (Exim 4.99.1) (envelope-from ) id 1wPi7K-00000001uqY-2awI; Wed, 20 May 2026 16:41:50 +0200 From: "Maciej S. Szmigiero" To: Alex Williamson , =?UTF-8?q?C=C3=A9dric=20Le=20Goater?= Cc: Peter Xu , Fabiano Rosas , Paolo Bonzini , Avihai Horon , qemu-devel@nongnu.org Subject: [PATCH 1/2] system/runstate: Allow adjustment of priority for VM state change handlers Date: Wed, 20 May 2026 16:41:37 +0200 Message-ID: X-Mailer: git-send-email 2.53.0 In-Reply-To: References: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Received-SPF: pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) client-ip=209.51.188.17; envelope-from=qemu-devel-bounces+importer=patchew.org@nongnu.org; helo=lists1p.gnu.org; Received-SPF: pass client-ip=145.239.82.108; envelope-from=mhej@vps-ovh.mhejs.net; helo=vps-ovh.mhejs.net X-Spam_score_int: -16 X-Spam_score: -1.7 X-Spam_bar: - X-Spam_report: (-1.7 / 5.0 requ) BAYES_00=-1.9, HEADER_FROM_DIFFERENT_DOMAINS=0.249, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001 autolearn=no autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: qemu development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+importer=patchew.org@nongnu.org Sender: qemu-devel-bounces+importer=patchew.org@nongnu.org X-ZM-MESSAGEID: 1779288175230154100 Content-Type: text/plain; charset="utf-8" From: "Maciej S. Szmigiero" A future patch will need an ability to register qdev VM state change handlers below and above the normal priority level for the registering device qdev tree depth, but still properly ordered with respect to handlers registered at other tree depths. To implement this split the priority argument passed to qemu_add_vm_change_state_handler_prio_full() into two parts: its 15 most significant bits will now carry the actual qdev tree depth while the 16 least significant bits will now carry the caller provided priority adjustment value. Although this will limit the qdev tree to a depth of 32k such high limit shouldn't be a problem in practice. Signed-off-by: Maciej S. Szmigiero --- hw/core/vm-change-state-handler.c | 22 +++++++++++++++------- hw/vfio/migration.c | 2 +- include/system/runstate.h | 2 +- 3 files changed, 17 insertions(+), 9 deletions(-) diff --git a/hw/core/vm-change-state-handler.c b/hw/core/vm-change-state-ha= ndler.c index 2c111350298d..3db0819984c6 100644 --- a/hw/core/vm-change-state-handler.c +++ b/hw/core/vm-change-state-handler.c @@ -19,9 +19,9 @@ #include "hw/core/qdev.h" #include "system/runstate.h" =20 -static int qdev_get_dev_tree_depth(DeviceState *dev) +static unsigned int qdev_get_dev_tree_depth(DeviceState *dev) { - int depth; + unsigned int depth; =20 for (depth =3D 0; dev; depth++) { BusState *bus =3D dev->parent_bus; @@ -61,20 +61,28 @@ VMChangeStateEntry *qdev_add_vm_change_state_handler(De= viceState *dev, void *opaque) { assert(!cb || !cb_ret); - return qdev_add_vm_change_state_handler_full(dev, cb, NULL, cb_ret, op= aque); + return qdev_add_vm_change_state_handler_full(dev, cb, NULL, cb_ret, op= aque, 0); } =20 /* * Exactly like qdev_add_vm_change_state_handler() but passes a prepare_cb - * and the cb_ret arguments too. + * and the cb_ret arguments too and allows for adjustment of priority. */ VMChangeStateEntry *qdev_add_vm_change_state_handler_full( DeviceState *dev, VMChangeStateHandler *cb, VMChangeStateHandler *prep= are_cb, - VMChangeStateHandlerWithRet *cb_ret, void *opaque) + VMChangeStateHandlerWithRet *cb_ret, void *opaque, int adj) { - int depth =3D qdev_get_dev_tree_depth(dev); + unsigned int depth =3D qdev_get_dev_tree_depth(dev); + int prio; + + /* 32k depth should be enough for everyone */ + assert(depth <=3D INT16_MAX); + + /* encode depth on 15 MSB and adj on 16 LSB */ + assert(adj >=3D INT16_MIN && adj <=3D INT16_MAX); + prio =3D (depth << 16) + (adj - INT16_MIN); =20 assert(!cb || !cb_ret); return qemu_add_vm_change_state_handler_prio_full(cb, prepare_cb, cb_r= et, - opaque, depth); + opaque, prio); } diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index dbfd13b83a15..9889b20ad7dd 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -1227,7 +1227,7 @@ static int vfio_migration_init(VFIODevice *vbasedev, = Error **errp) vfio_vmstate_change_prepare : NULL; migration->vm_state =3D qdev_add_vm_change_state_handler_full( - vbasedev->dev, vfio_vmstate_change, prepare_cb, NULL, vbasedev); + vbasedev->dev, vfio_vmstate_change, prepare_cb, NULL, vbasedev, 0); migration_add_notifier(&migration->migration_state, vfio_migration_state_notifier); =20 diff --git a/include/system/runstate.h b/include/system/runstate.h index 929379adae41..306e2684c195 100644 --- a/include/system/runstate.h +++ b/include/system/runstate.h @@ -69,7 +69,7 @@ VMChangeStateEntry *qdev_add_vm_change_state_handler(Devi= ceState *dev, void *opaque); VMChangeStateEntry *qdev_add_vm_change_state_handler_full( DeviceState *dev, VMChangeStateHandler *cb, VMChangeStateHandler *prep= are_cb, - VMChangeStateHandlerWithRet *cb_ret, void *opaque); + VMChangeStateHandlerWithRet *cb_ret, void *opaque, int adj); void qemu_del_vm_change_state_handler(VMChangeStateEntry *e); /** * vm_state_notify: Notify the state of the VM From nobody Sat May 30 17:44:07 2026 Delivered-To: importer@patchew.org Authentication-Results: mx.zohomail.com; spf=pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer=patchew.org@nongnu.org Return-Path: Received: from lists1p.gnu.org (lists1p.gnu.org [209.51.188.17]) by mx.zohomail.com with SMTPS id 1779288169763931.5286223180295; Wed, 20 May 2026 07:42:49 -0700 (PDT) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists1p.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1wPi7j-0002rg-SH; Wed, 20 May 2026 10:42:16 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists1p.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1wPi7c-0002rC-66 for qemu-devel@nongnu.org; Wed, 20 May 2026 10:42:08 -0400 Received: from vps-ovh.mhejs.net ([145.239.82.108]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1wPi7Z-0002Gn-Tp for qemu-devel@nongnu.org; Wed, 20 May 2026 10:42:07 -0400 Received: from MUA by vps-ovh.mhejs.net with esmtpsa (TLS1.3) tls TLS_AES_256_GCM_SHA384 (Exim 4.99.1) (envelope-from ) id 1wPi7P-00000001uqk-361g; Wed, 20 May 2026 16:41:55 +0200 From: "Maciej S. Szmigiero" To: Alex Williamson , =?UTF-8?q?C=C3=A9dric=20Le=20Goater?= Cc: Peter Xu , Fabiano Rosas , Paolo Bonzini , Avihai Horon , qemu-devel@nongnu.org Subject: [PATCH 2/2] vfio/migration: Parallelize device state transitions Date: Wed, 20 May 2026 16:41:38 +0200 Message-ID: <743f03758d2a30df4093db140ccdd0f91f69bd7e.1779217494.git.maciej.szmigiero@oracle.com> X-Mailer: git-send-email 2.53.0 In-Reply-To: References: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Received-SPF: pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) client-ip=209.51.188.17; envelope-from=qemu-devel-bounces+importer=patchew.org@nongnu.org; helo=lists1p.gnu.org; Received-SPF: pass client-ip=145.239.82.108; envelope-from=mhej@vps-ovh.mhejs.net; helo=vps-ovh.mhejs.net X-Spam_score_int: -16 X-Spam_score: -1.7 X-Spam_bar: - X-Spam_report: (-1.7 / 5.0 requ) BAYES_00=-1.9, HEADER_FROM_DIFFERENT_DOMAINS=0.249, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001 autolearn=no autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: qemu development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+importer=patchew.org@nongnu.org Sender: qemu-devel-bounces+importer=patchew.org@nongnu.org X-ZM-MESSAGEID: 1779288174737158500 Content-Type: text/plain; charset="utf-8" From: "Maciej S. Szmigiero" When multiple VFIO devices are present in a VM the fact that their state transitions on migration happen sequentially has a visible impact on migration downtime. This is because both PRE_COPY -> PRE_COPY_P2P -> STOP_COPY transitions on the source and RESUMING -> RUNNING_P2P -> RUNNING transitions on the target happen during the switchover phase. During this phase the VM is stopped so the downtime is ticking. These device state transitions are performed by VM state change handlers registered by the VFIO device migration code. Instead of performing such state transition synchronously use the priority adjustment mechanism from the previous patch to launch a thread performing the state change at the priority level *before* all other VM state change handlers at the particular VFIO device qdev tree depth. Only wait for this thread to finish at the priority level ordered *after* all other handlers at this tree depth. This way these state transitions can happen in parallel not only with respect to other VFIO device instances but also ordinary (serialized) handlers for other devices at this qdev tree depth while still being properly ordered with respect to handlers registered at other tree depths. Unfortunately, the order in which VM state handlers are called depends on whether the VM is starting or stopping. Because of this, one extra layer of indirection is necessary to make the (first, second) ordering of these handlers constant. Enable the feature by default since it has no impact on the migration bit stream protocol - it shouldn't need disabling for anything else but debugging scenarios. Signed-off-by: Maciej S. Szmigiero --- hw/vfio/migration.c | 174 ++++++++++++++++++++++++++++-- hw/vfio/pci.c | 2 + hw/vfio/vfio-migration-internal.h | 4 +- include/hw/vfio/vfio-device.h | 1 + 4 files changed, 173 insertions(+), 8 deletions(-) diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index 9889b20ad7dd..fd2b37a85f4b 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -1047,6 +1047,135 @@ static void vfio_vmstate_change(void *opaque, bool = running, RunState state) mig_state_to_str(new_state)); } =20 +typedef struct VMStateChangeThreadData { + VFIODevice *vbasedev; + bool is_prepare; + bool running; + RunState state; +} VMStateChangeThreadData; + +static void *vfio_vmstate_change_thread(void *opaque) +{ + g_autofree VMStateChangeThreadData *data =3D opaque; + + if (data->is_prepare) { + vfio_vmstate_change_prepare(data->vbasedev, data->running, data->s= tate); + } else { + vfio_vmstate_change(data->vbasedev, data->running, data->state); + } + + return NULL; +} + +static void vfio_vmstate_change_thread_launch(VFIODevice *vbasedev, + bool is_prepare, + bool running, + RunState state) +{ + VFIOMigration *migration =3D vbasedev->migration; + VMStateChangeThreadData *data =3D g_new(VMStateChangeThreadData, 1); + + data->vbasedev =3D vbasedev; + data->is_prepare =3D is_prepare; + data->running =3D running; + data->state =3D state; + + assert(!migration->vm_state_thread_running); + migration->vm_state_thread_running =3D true; + + qemu_thread_create(&migration->vm_state_thread, + is_prepare ? "vfio_vmstate_change_prepare" : + "vfio_vmstate_change", + vfio_vmstate_change_thread, data, + QEMU_THREAD_JOINABLE); +} + +static void vfio_vmstate_change_thread_join(VFIODevice *vbasedev) +{ + VFIOMigration *migration =3D vbasedev->migration; + + assert(migration->vm_state_thread_running); + + qemu_thread_join(&migration->vm_state_thread); + + migration->vm_state_thread_running =3D false; +} + +/* + * The first handler called during a vmstate change at a particular depth - + * launch the VFIO device state change thread. + */ +static void vfio_vmstate_change_first(VFIODevice *vbasedev, + bool is_prepare, + bool running, RunState state) +{ + vfio_vmstate_change_thread_launch(vbasedev, + is_prepare, + running, + state); +} + +/* + * The last handler called during a vmstate change at a particular depth - + * wait for the VFIO device state change thread to finish. + */ +static void vfio_vmstate_change_second(VFIODevice *vbasedev) +{ + vfio_vmstate_change_thread_join(vbasedev); +} + +/* + * Lower priority number handler: + * Called before higher number handler when VM is starting + * but after higher number handler when VM is stopping. + */ +static void vfio_vmstate_change_prepare_lower_prio(void *opaque, bool runn= ing, + RunState state) +{ + if (running) { + vfio_vmstate_change_first(opaque, true, running, state); + } else { + vfio_vmstate_change_second(opaque); + } +} + +/* + * Higher priority number handler: + * Called after lower number handler when VM is starting + * but before lower number handler when VM is stopping. + */ +static void vfio_vmstate_change_prepare_higher_prio(void *opaque, bool run= ning, + RunState state) +{ + if (running) { + vfio_vmstate_change_second(opaque); + } else { + vfio_vmstate_change_first(opaque, true, running, state); + } +} + +/* Same ordering issues as for vfio_vmstate_change_prepare_lower_prio() */ +static void vfio_vmstate_change_lower_prio(void *opaque, bool running, + RunState state) +{ + if (running) { + vfio_vmstate_change_first(opaque, false, running, state); + } else { + vfio_vmstate_change_second(opaque); + } +} + +/* Same ordering issues as for vfio_vmstate_change_prepare_higher_prio() */ +static void vfio_vmstate_change_higher_prio(void *opaque, bool running, + RunState state) +{ + if (running) { + vfio_vmstate_change_second(opaque); + } else { + vfio_vmstate_change_first(opaque, false, running, state); + } +} + static int vfio_migration_state_notifier(NotifierWithReturn *notifier, MigrationEvent *e, Error **errp) { @@ -1063,6 +1192,8 @@ static int vfio_migration_state_notifier(NotifierWith= Return *notifier, * MigrationNotifyFunc may not return an error code and an Error * object for MIG_EVENT_FAILED. Hence, report the error * locally and ignore the errp argument. + * This state change is not parallelized as it is not expected to = be + * performance critical. */ ret =3D vfio_migration_set_state_or_reset(vbasedev, VFIO_DEVICE_STATE_RUNNING, @@ -1143,7 +1274,7 @@ static int vfio_migration_init(VFIODevice *vbasedev, = Error **errp) char id[256] =3D ""; g_autofree char *path =3D NULL, *oid =3D NULL; uint64_t mig_flags =3D 0; - VMChangeStateHandler *prepare_cb; + VMChangeStateHandler *prepare_cb, *prepare_cb_lower, *prepare_cb_highe= r; =20 if (!vbasedev->ops->vfio_get_object) { ret =3D -EINVAL; @@ -1223,11 +1354,34 @@ static int vfio_migration_init(VFIODevice *vbasedev= , Error **errp) register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1, &savevm_vfio_hand= lers, vbasedev); =20 - prepare_cb =3D migration->mig_flags & VFIO_MIGRATION_P2P ? - vfio_vmstate_change_prepare : - NULL; - migration->vm_state =3D qdev_add_vm_change_state_handler_full( - vbasedev->dev, vfio_vmstate_change, prepare_cb, NULL, vbasedev, 0); + if (vbasedev->migration_parallel_states) { + /* + * Unfortunately, the order in which vmstate handlers are called d= epends + * on whether the VM is starting or stopping. + * Because of this, one extra layer of indirection is necessary + * to make the (first, second) ordering of these handlers constant. + */ + prepare_cb_lower =3D migration->mig_flags & VFIO_MIGRATION_P2P ? + vfio_vmstate_change_prepare_lower_prio : NULL; + prepare_cb_higher =3D migration->mig_flags & VFIO_MIGRATION_P2P ? + vfio_vmstate_change_prepare_higher_prio : NULL; + migration->vm_state_lower_prio =3D qdev_add_vm_change_state_handle= r_full( + vbasedev->dev, vfio_vmstate_change_lower_prio, prepare_cb_lowe= r, + NULL, vbasedev, -1); + migration->vm_state_higher_prio =3D qdev_add_vm_change_state_handl= er_full( + vbasedev->dev, vfio_vmstate_change_higher_prio, prepare_cb_hig= her, + NULL, vbasedev, 1); + } else { + prepare_cb =3D migration->mig_flags & VFIO_MIGRATION_P2P ? + vfio_vmstate_change_prepare : NULL; + /* Arbitrarily use lower_prio field to store non-parallel handler = */ + migration->vm_state_lower_prio =3D + qdev_add_vm_change_state_handler_full(vbasedev->dev, + vfio_vmstate_change, + prepare_cb, NULL, + vbasedev, 0); + } + migration_add_notifier(&migration->migration_state, vfio_migration_state_notifier); =20 @@ -1302,7 +1456,13 @@ static void vfio_migration_deinit(VFIODevice *vbased= ev) VFIOMigration *migration =3D vbasedev->migration; =20 migration_remove_notifier(&migration->migration_state); - qemu_del_vm_change_state_handler(migration->vm_state); + + if (vbasedev->migration_parallel_states) { + qemu_del_vm_change_state_handler(migration->vm_state_higher_prio); + } + /* Non-parallel state change uses lower_prio field to store its handle= r */ + qemu_del_vm_change_state_handler(migration->vm_state_lower_prio); + unregister_savevm(VMSTATE_IF(vbasedev->dev), "vfio", vbasedev); vfio_migration_free(vbasedev); vfio_unblock_multiple_devices_migration(); diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index b2a07f6bb421..fa2411474c9b 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -3777,6 +3777,8 @@ static const Property vfio_pci_properties[] =3D { ON_OFF_AUTO_AUTO), DEFINE_PROP_SIZE("x-migration-max-queued-buffers-size", VFIOPCIDevice, vbasedev.migration_max_queued_buffers_size, UINT64_MA= X), + DEFINE_PROP_BOOL("x-migration-parallel-states", VFIOPCIDevice, + vbasedev.migration_parallel_states, true), DEFINE_PROP_BOOL("migration-events", VFIOPCIDevice, vbasedev.migration_events, false), DEFINE_PROP_BOOL("x-no-mmap", VFIOPCIDevice, vbasedev.no_mmap, false), diff --git a/hw/vfio/vfio-migration-internal.h b/hw/vfio/vfio-migration-int= ernal.h index a1c58b112685..9fb00f9f4d7d 100644 --- a/hw/vfio/vfio-migration-internal.h +++ b/hw/vfio/vfio-migration-internal.h @@ -38,7 +38,9 @@ typedef struct VFIOMultifd VFIOMultifd; =20 typedef struct VFIOMigration { struct VFIODevice *vbasedev; - VMChangeStateEntry *vm_state; + VMChangeStateEntry *vm_state_lower_prio, *vm_state_higher_prio; + QemuThread vm_state_thread; + bool vm_state_thread_running; NotifierWithReturn migration_state; uint32_t device_state; int data_fd; diff --git a/include/hw/vfio/vfio-device.h b/include/hw/vfio/vfio-device.h index 380a55d6e5ea..28004e1e99f4 100644 --- a/include/hw/vfio/vfio-device.h +++ b/include/hw/vfio/vfio-device.h @@ -69,6 +69,7 @@ typedef struct VFIODevice { OnOffAuto migration_multifd_transfer; OnOffAuto migration_load_config_after_iter; uint64_t migration_max_queued_buffers_size; + bool migration_parallel_states; bool migration_events; bool use_region_fds; VFIODeviceOps *ops;