Changing the device state from STOP_COPY to STOP can take time as the
device may need to free resources and do other operations as part of the
transition. Currently, this is done in vfio_save_complete_precopy() and
therefore it is counted in the migration downtime.
To avoid this, change the device state from STOP_COPY to STOP in
vfio_save_cleanup(), which is called after migration has completed and
thus is not part of migration downtime.
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
hw/vfio/migration.c | 19 +++++++++++++------
1 file changed, 13 insertions(+), 6 deletions(-)
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 2674f4bc47..8acd182a8b 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -383,6 +383,19 @@ static void vfio_save_cleanup(void *opaque)
VFIODevice *vbasedev = opaque;
VFIOMigration *migration = vbasedev->migration;
+ /*
+ * Changing device state from STOP_COPY to STOP can take time. Do it here,
+ * after migration has completed, so it won't increase downtime.
+ */
+ if (migration->device_state == VFIO_DEVICE_STATE_STOP_COPY) {
+ /*
+ * If setting the device in STOP state fails, the device should be
+ * reset. To do so, use ERROR state as a recover state.
+ */
+ vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP,
+ VFIO_DEVICE_STATE_ERROR);
+ }
+
g_free(migration->data_buffer);
migration->data_buffer = NULL;
migration->precopy_init_size = 0;
@@ -508,12 +521,6 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
return ret;
}
- /*
- * If setting the device in STOP state fails, the device should be reset.
- * To do so, use ERROR state as a recover state.
- */
- ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP,
- VFIO_DEVICE_STATE_ERROR);
trace_vfio_save_complete_precopy(vbasedev->name, ret);
return ret;
--
2.26.3
On 7/16/23 10:15, Avihai Horon wrote:
> Changing the device state from STOP_COPY to STOP can take time as the
> device may need to free resources and do other operations as part of the
> transition. Currently, this is done in vfio_save_complete_precopy() and
> therefore it is counted in the migration downtime.
>
> To avoid this, change the device state from STOP_COPY to STOP in
> vfio_save_cleanup(), which is called after migration has completed and
> thus is not part of migration downtime.
>
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Have you tried this series with vGPUs ? If so, are there any improvement
to report ?
Thanks,
C.
> ---
> hw/vfio/migration.c | 19 +++++++++++++------
> 1 file changed, 13 insertions(+), 6 deletions(-)
>
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 2674f4bc47..8acd182a8b 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -383,6 +383,19 @@ static void vfio_save_cleanup(void *opaque)
> VFIODevice *vbasedev = opaque;
> VFIOMigration *migration = vbasedev->migration;
>
> + /*
> + * Changing device state from STOP_COPY to STOP can take time. Do it here,
> + * after migration has completed, so it won't increase downtime.
> + */
> + if (migration->device_state == VFIO_DEVICE_STATE_STOP_COPY) {
> + /*
> + * If setting the device in STOP state fails, the device should be
> + * reset. To do so, use ERROR state as a recover state.
> + */
> + vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP,
> + VFIO_DEVICE_STATE_ERROR);
> + }
> +
> g_free(migration->data_buffer);
> migration->data_buffer = NULL;
> migration->precopy_init_size = 0;
> @@ -508,12 +521,6 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> return ret;
> }
>
> - /*
> - * If setting the device in STOP state fails, the device should be reset.
> - * To do so, use ERROR state as a recover state.
> - */
> - ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP,
> - VFIO_DEVICE_STATE_ERROR);
> trace_vfio_save_complete_precopy(vbasedev->name, ret);
>
> return ret;
On 30/07/2023 19:25, Cédric Le Goater wrote:
> External email: Use caution opening links or attachments
>
>
> On 7/16/23 10:15, Avihai Horon wrote:
>> Changing the device state from STOP_COPY to STOP can take time as the
>> device may need to free resources and do other operations as part of the
>> transition. Currently, this is done in vfio_save_complete_precopy() and
>> therefore it is counted in the migration downtime.
>>
>> To avoid this, change the device state from STOP_COPY to STOP in
>> vfio_save_cleanup(), which is called after migration has completed and
>> thus is not part of migration downtime.
>>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>
> Have you tried this series with vGPUs ? If so, are there any improvement
> to report ?
>
Not with a vGPU.
But I tried it with a ConnectX-7 VF and I could see up to 6% downtime
improvement.
I mentioned this in the cover letter.
Do you want to mention it also in the commit message?
Thanks.
>
>> ---
>> hw/vfio/migration.c | 19 +++++++++++++------
>> 1 file changed, 13 insertions(+), 6 deletions(-)
>>
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index 2674f4bc47..8acd182a8b 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -383,6 +383,19 @@ static void vfio_save_cleanup(void *opaque)
>> VFIODevice *vbasedev = opaque;
>> VFIOMigration *migration = vbasedev->migration;
>>
>> + /*
>> + * Changing device state from STOP_COPY to STOP can take time.
>> Do it here,
>> + * after migration has completed, so it won't increase downtime.
>> + */
>> + if (migration->device_state == VFIO_DEVICE_STATE_STOP_COPY) {
>> + /*
>> + * If setting the device in STOP state fails, the device
>> should be
>> + * reset. To do so, use ERROR state as a recover state.
>> + */
>> + vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP,
>> + VFIO_DEVICE_STATE_ERROR);
>> + }
>> +
>> g_free(migration->data_buffer);
>> migration->data_buffer = NULL;
>> migration->precopy_init_size = 0;
>> @@ -508,12 +521,6 @@ static int vfio_save_complete_precopy(QEMUFile
>> *f, void *opaque)
>> return ret;
>> }
>>
>> - /*
>> - * If setting the device in STOP state fails, the device should
>> be reset.
>> - * To do so, use ERROR state as a recover state.
>> - */
>> - ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP,
>> - VFIO_DEVICE_STATE_ERROR);
>> trace_vfio_save_complete_precopy(vbasedev->name, ret);
>>
>> return ret;
>
On 7/31/23 08:32, Avihai Horon wrote: > > On 30/07/2023 19:25, Cédric Le Goater wrote: >> External email: Use caution opening links or attachments >> >> >> On 7/16/23 10:15, Avihai Horon wrote: >>> Changing the device state from STOP_COPY to STOP can take time as the >>> device may need to free resources and do other operations as part of the >>> transition. Currently, this is done in vfio_save_complete_precopy() and >>> therefore it is counted in the migration downtime. >>> >>> To avoid this, change the device state from STOP_COPY to STOP in >>> vfio_save_cleanup(), which is called after migration has completed and >>> thus is not part of migration downtime. >>> >>> Signed-off-by: Avihai Horon <avihaih@nvidia.com> >> >> Have you tried this series with vGPUs ? If so, are there any improvement >> to report ? >> > Not with a vGPU. > But I tried it with a ConnectX-7 VF and I could see up to 6% downtime improvement. > I mentioned this in the cover letter. > Do you want to mention it also in the commit message? This is not necessary. Thanks, C.
© 2016 - 2026 Red Hat, Inc.