[PATCH 00/14] Make switchover-ack re-usable and add VFIO precopy REINIT feature

Avihai Horon posted 14 patches 2 weeks, 5 days ago
Failed in applying to current master (apply log)
docs/devel/migration/vfio.rst                 |   4 +-
hw/vfio/vfio-migration-internal.h             |   2 +
include/migration/client-options.h            |   1 +
include/migration/misc.h                      |   2 +
include/migration/register.h                  |  21 +-
include/standard-headers/drm/drm_fourcc.h     |  28 +-
include/standard-headers/linux/const.h        |  18 +
include/standard-headers/linux/ethtool.h      |  28 +-
.../linux/input-event-codes.h                 |  13 +
include/standard-headers/linux/pci_regs.h     |  71 ++-
include/standard-headers/linux/typelimits.h   |   8 +
include/standard-headers/linux/virtio_ring.h  |   3 +-
include/standard-headers/linux/virtio_rtc.h   | 237 ++++++++++
include/standard-headers/linux/vmclock-abi.h  |  20 +
linux-headers/asm-arm64/kvm.h                 |   1 +
linux-headers/asm-arm64/unistd_64.h           |   1 +
linux-headers/asm-generic/unistd.h            |   5 +-
linux-headers/asm-loongarch/kvm.h             |   5 +
linux-headers/asm-loongarch/kvm_para.h        |   1 +
linux-headers/asm-loongarch/unistd_64.h       |   2 +
linux-headers/asm-mips/unistd_n32.h           |   1 +
linux-headers/asm-mips/unistd_n64.h           |   1 +
linux-headers/asm-mips/unistd_o32.h           |   1 +
linux-headers/asm-powerpc/unistd_32.h         |   1 +
linux-headers/asm-powerpc/unistd_64.h         |   1 +
linux-headers/asm-riscv/kvm.h                 |  11 +-
linux-headers/asm-riscv/ptrace.h              |  37 ++
linux-headers/asm-riscv/unistd_32.h           |   1 +
linux-headers/asm-riscv/unistd_64.h           |   1 +
linux-headers/asm-s390/unistd_32.h            | 446 ------------------
linux-headers/asm-s390/unistd_64.h            |   1 +
linux-headers/asm-x86/kvm.h                   |  21 +-
linux-headers/asm-x86/unistd_32.h             |   1 +
linux-headers/asm-x86/unistd_64.h             |   1 +
linux-headers/asm-x86/unistd_x32.h            |   1 +
linux-headers/linux/const.h                   |  18 +
linux-headers/linux/iommufd.h                 |  48 ++
linux-headers/linux/kvm.h                     |  46 +-
linux-headers/linux/mshv.h                    |   4 +-
linux-headers/linux/psp-sev.h                 |   2 +-
linux-headers/linux/stddef.h                  |   4 +
linux-headers/linux/vduse.h                   |  85 +++-
linux-headers/linux/vfio.h                    |  30 +-
migration/migration.h                         |  17 +-
migration/savevm.h                            |   4 +-
hw/core/machine.c                             |   4 +-
hw/vfio/migration.c                           | 195 +++++++-
migration/migration.c                         |  65 ++-
migration/options.c                           |   9 +
migration/savevm.c                            | 146 ++++--
hw/vfio/trace-events                          |   5 +-
migration/trace-events                        |   8 +-
scripts/update-linux-headers.sh               |   2 +
53 files changed, 1109 insertions(+), 580 deletions(-)
create mode 100644 include/standard-headers/linux/typelimits.h
create mode 100644 include/standard-headers/linux/virtio_rtc.h
delete mode 100644 linux-headers/asm-s390/unistd_32.h
[PATCH 00/14] Make switchover-ack re-usable and add VFIO precopy REINIT feature
Posted by Avihai Horon 2 weeks, 5 days ago
Hello,

This series adds support for the kernel VFIO_PRECOPY_INFO_REINIT feature
[1], which can reduce migration downtime in some VFIO precopy scenarios
(see the Tests section below). Supporting it requires refactoring
switchover-ack and making it re-usable; that work comes first.

The series is based on Peter's VFIO fixes series [2].
Based-on: 20260421202110.306051-1-peterx@redhat.com

=== Background ===

Switchover-ack is a mechanism to synchronize between source and
destination QEMU during migration to prevent the source from switching
over prematurely.

VFIO uses switchover-ack to ensure switchover happens only after
destination side has loaded the precopy initial bytes. This is important
for VFIO, as otherwise downtime could be impacted and be higher.

In its current state, switchover-ack is a one-time mechanism, meaning
that switchover is acked only once and past that another ACK cannot be
requested again. This was sufficient until now, as VFIO precopy initial
bytes was defined to be monotonically decreasing. Thus, when precopy
initial bytes reached zero for all VFIO devices, a single ACK would be
sent and its validity would hold.

=== Problem ===

Now the new VFIO_PRECOPY_INFO_REINIT feature allows precopy initial
bytes to be re-initialized during precopy. Specifically, it means that
initial bytes can grow after reaching zero, which would invalidate a
previously sent switchover ACK.

=== Solution ===

To support VFIO_PRECOPY_INFO_REINIT, this series makes switchover-ack
re-usable and allows devices to request another switchover ACK when
needed.

In the new mechansim, switchover ACK can be requested for a specific
device and in different times, so switchover ACK is changed to be
per-device (instead of a single ACK for all devices) and source side is
the one who does the pending ACKs accounting (instead of destination
side).

The old (legacy) switchover-ack mechanism is kept for backward
compatibility and is turned on by a compatibility property for older
machines.

With that infrastructure in place, VFIO uses the new switchover-ack path
and implements VFIO_PRECOPY_INFO_REINIT.

=== Tests ===

Functional and performance tests were run.

Performance tests were done by migrating a single VM with:
* 8 GB RAM
* 4 mlx5 VFIO devices:
  - One device with 1GB of device data (stopcopy data) that runs
    workload during precopy so VFIO_PRECOPY_INFO_REINIT is exercised
    (generate new initial_bytes chunks during precopy).
  - The other 3 devices are idle.

In this setup, VFIO_PRECOPY_INFO_REINIT reduced total migration downtime by
about 43%, and downtime attributed to the busy VFIO device by about 67%:

With VFIO_PRECOPY_INFO_REINIT:
  1335ms total (~520ms from the VFIO device running the workload).

Without VFIO_PRECOPY_INFO_REINIT:
  2352ms total (~1600ms from the VFIO device running the workload).

Functional tests covered the main code paths and combinations, including
legacy and new switchover-ack across versions, for example:
* Migration between QEMU 11.0 (old binary) and 11.1 (new binary).
* Migration between two 11.1 binaries with different machine versions.
* Migration when the VFIO device has no VFIO_PRECOPY_INFO_REINIT.
* Migration when the VFIO device has no VFIO precopy support.

=== Patch Breakdown ===

* Patches 1-2: Update Linux headers
* Patches 3-8,14: Migration cleanups and the new switchover-ack mechanism
* Patches 9-13: VFIO cleanups and VFIO_PRECOPY_INFO_REINIT

Thanks.

[1] https://lore.kernel.org/all/20260317161753.18964-1-yishaih@nvidia.com/
[2] https://lore.kernel.org/qemu-devel/20260421202110.306051-1-peterx@redhat.com/

Avihai Horon (14):
  scripts/update-linux-headers: Add typelimits.h
  linux-headers: Update to Linux v7.1-rc1
  migration: Propagate errors in migration_completion_precopy()
  migration: Log the approver in qemu_loadvm_approve_switchover()
  migration: Replace switchover_ack_needed SaveVMHandler
  migration: Rename switchover-ack code to legacy
  migration: Make switchover-ack re-usable
  migration: Check switchover-ack during switchover phase
  vfio/migration: Re-query precopy size before sending
    VFIO_MIG_FLAG_DEV_INIT_DATA_SENT
  vfio/migration: Add Error ** parameter to vfio_migration_init()
  vfio/migration: Add new switchover-ack mechanism
  vfio/migration: Implement VFIO_PRECOPY_INFO_REINIT feature
  vfio/migration: Check VFIO_PRECOPY_INFO_REINIT during switchover
  migration: Enable new switchover-ack

 docs/devel/migration/vfio.rst                 |   4 +-
 hw/vfio/vfio-migration-internal.h             |   2 +
 include/migration/client-options.h            |   1 +
 include/migration/misc.h                      |   2 +
 include/migration/register.h                  |  21 +-
 include/standard-headers/drm/drm_fourcc.h     |  28 +-
 include/standard-headers/linux/const.h        |  18 +
 include/standard-headers/linux/ethtool.h      |  28 +-
 .../linux/input-event-codes.h                 |  13 +
 include/standard-headers/linux/pci_regs.h     |  71 ++-
 include/standard-headers/linux/typelimits.h   |   8 +
 include/standard-headers/linux/virtio_ring.h  |   3 +-
 include/standard-headers/linux/virtio_rtc.h   | 237 ++++++++++
 include/standard-headers/linux/vmclock-abi.h  |  20 +
 linux-headers/asm-arm64/kvm.h                 |   1 +
 linux-headers/asm-arm64/unistd_64.h           |   1 +
 linux-headers/asm-generic/unistd.h            |   5 +-
 linux-headers/asm-loongarch/kvm.h             |   5 +
 linux-headers/asm-loongarch/kvm_para.h        |   1 +
 linux-headers/asm-loongarch/unistd_64.h       |   2 +
 linux-headers/asm-mips/unistd_n32.h           |   1 +
 linux-headers/asm-mips/unistd_n64.h           |   1 +
 linux-headers/asm-mips/unistd_o32.h           |   1 +
 linux-headers/asm-powerpc/unistd_32.h         |   1 +
 linux-headers/asm-powerpc/unistd_64.h         |   1 +
 linux-headers/asm-riscv/kvm.h                 |  11 +-
 linux-headers/asm-riscv/ptrace.h              |  37 ++
 linux-headers/asm-riscv/unistd_32.h           |   1 +
 linux-headers/asm-riscv/unistd_64.h           |   1 +
 linux-headers/asm-s390/unistd_32.h            | 446 ------------------
 linux-headers/asm-s390/unistd_64.h            |   1 +
 linux-headers/asm-x86/kvm.h                   |  21 +-
 linux-headers/asm-x86/unistd_32.h             |   1 +
 linux-headers/asm-x86/unistd_64.h             |   1 +
 linux-headers/asm-x86/unistd_x32.h            |   1 +
 linux-headers/linux/const.h                   |  18 +
 linux-headers/linux/iommufd.h                 |  48 ++
 linux-headers/linux/kvm.h                     |  46 +-
 linux-headers/linux/mshv.h                    |   4 +-
 linux-headers/linux/psp-sev.h                 |   2 +-
 linux-headers/linux/stddef.h                  |   4 +
 linux-headers/linux/vduse.h                   |  85 +++-
 linux-headers/linux/vfio.h                    |  30 +-
 migration/migration.h                         |  17 +-
 migration/savevm.h                            |   4 +-
 hw/core/machine.c                             |   4 +-
 hw/vfio/migration.c                           | 195 +++++++-
 migration/migration.c                         |  65 ++-
 migration/options.c                           |   9 +
 migration/savevm.c                            | 146 ++++--
 hw/vfio/trace-events                          |   5 +-
 migration/trace-events                        |   8 +-
 scripts/update-linux-headers.sh               |   2 +
 53 files changed, 1109 insertions(+), 580 deletions(-)
 create mode 100644 include/standard-headers/linux/typelimits.h
 create mode 100644 include/standard-headers/linux/virtio_rtc.h
 delete mode 100644 linux-headers/asm-s390/unistd_32.h

-- 
2.40.1
Re: [PATCH 00/14] Make switchover-ack re-usable and add VFIO precopy REINIT feature
Posted by Peter Xu 5 days, 7 hours ago
On Tue, May 05, 2026 at 11:14:09AM +0300, Avihai Horon wrote:
> Performance tests were done by migrating a single VM with:
> * 8 GB RAM
> * 4 mlx5 VFIO devices:
>   - One device with 1GB of device data (stopcopy data) that runs
>     workload during precopy so VFIO_PRECOPY_INFO_REINIT is exercised
>     (generate new initial_bytes chunks during precopy).

Could you elaborate a bit more on what workload is executed, and how that
will affect REINIT reportings (e.g. is only one REINIT generated, or it
keeps generating)?

Can I understand it in this way: without REINIT, device is forced to put
those data into stopcopy size; then with REINIT, some stopcopy size is
essentially moved back to precopy phase?

Thanks,

>   - The other 3 devices are idle.
> 
> In this setup, VFIO_PRECOPY_INFO_REINIT reduced total migration downtime by
> about 43%, and downtime attributed to the busy VFIO device by about 67%:
> 
> With VFIO_PRECOPY_INFO_REINIT:
>   1335ms total (~520ms from the VFIO device running the workload).
> 
> Without VFIO_PRECOPY_INFO_REINIT:
>   2352ms total (~1600ms from the VFIO device running the workload).
> 
> Functional tests covered the main code paths and combinations, including
> legacy and new switchover-ack across versions, for example:
> * Migration between QEMU 11.0 (old binary) and 11.1 (new binary).
> * Migration between two 11.1 binaries with different machine versions.
> * Migration when the VFIO device has no VFIO_PRECOPY_INFO_REINIT.
> * Migration when the VFIO device has no VFIO precopy support.
> 
> === Patch Breakdown ===
> 
> * Patches 1-2: Update Linux headers
> * Patches 3-8,14: Migration cleanups and the new switchover-ack mechanism
> * Patches 9-13: VFIO cleanups and VFIO_PRECOPY_INFO_REINIT
> 
> Thanks.
> 
> [1] https://lore.kernel.org/all/20260317161753.18964-1-yishaih@nvidia.com/
> [2] https://lore.kernel.org/qemu-devel/20260421202110.306051-1-peterx@redhat.com/
> 
> Avihai Horon (14):
>   scripts/update-linux-headers: Add typelimits.h
>   linux-headers: Update to Linux v7.1-rc1
>   migration: Propagate errors in migration_completion_precopy()
>   migration: Log the approver in qemu_loadvm_approve_switchover()
>   migration: Replace switchover_ack_needed SaveVMHandler
>   migration: Rename switchover-ack code to legacy
>   migration: Make switchover-ack re-usable
>   migration: Check switchover-ack during switchover phase
>   vfio/migration: Re-query precopy size before sending
>     VFIO_MIG_FLAG_DEV_INIT_DATA_SENT
>   vfio/migration: Add Error ** parameter to vfio_migration_init()
>   vfio/migration: Add new switchover-ack mechanism
>   vfio/migration: Implement VFIO_PRECOPY_INFO_REINIT feature
>   vfio/migration: Check VFIO_PRECOPY_INFO_REINIT during switchover
>   migration: Enable new switchover-ack
> 
>  docs/devel/migration/vfio.rst                 |   4 +-
>  hw/vfio/vfio-migration-internal.h             |   2 +
>  include/migration/client-options.h            |   1 +
>  include/migration/misc.h                      |   2 +
>  include/migration/register.h                  |  21 +-
>  include/standard-headers/drm/drm_fourcc.h     |  28 +-
>  include/standard-headers/linux/const.h        |  18 +
>  include/standard-headers/linux/ethtool.h      |  28 +-
>  .../linux/input-event-codes.h                 |  13 +
>  include/standard-headers/linux/pci_regs.h     |  71 ++-
>  include/standard-headers/linux/typelimits.h   |   8 +
>  include/standard-headers/linux/virtio_ring.h  |   3 +-
>  include/standard-headers/linux/virtio_rtc.h   | 237 ++++++++++
>  include/standard-headers/linux/vmclock-abi.h  |  20 +
>  linux-headers/asm-arm64/kvm.h                 |   1 +
>  linux-headers/asm-arm64/unistd_64.h           |   1 +
>  linux-headers/asm-generic/unistd.h            |   5 +-
>  linux-headers/asm-loongarch/kvm.h             |   5 +
>  linux-headers/asm-loongarch/kvm_para.h        |   1 +
>  linux-headers/asm-loongarch/unistd_64.h       |   2 +
>  linux-headers/asm-mips/unistd_n32.h           |   1 +
>  linux-headers/asm-mips/unistd_n64.h           |   1 +
>  linux-headers/asm-mips/unistd_o32.h           |   1 +
>  linux-headers/asm-powerpc/unistd_32.h         |   1 +
>  linux-headers/asm-powerpc/unistd_64.h         |   1 +
>  linux-headers/asm-riscv/kvm.h                 |  11 +-
>  linux-headers/asm-riscv/ptrace.h              |  37 ++
>  linux-headers/asm-riscv/unistd_32.h           |   1 +
>  linux-headers/asm-riscv/unistd_64.h           |   1 +
>  linux-headers/asm-s390/unistd_32.h            | 446 ------------------
>  linux-headers/asm-s390/unistd_64.h            |   1 +
>  linux-headers/asm-x86/kvm.h                   |  21 +-
>  linux-headers/asm-x86/unistd_32.h             |   1 +
>  linux-headers/asm-x86/unistd_64.h             |   1 +
>  linux-headers/asm-x86/unistd_x32.h            |   1 +
>  linux-headers/linux/const.h                   |  18 +
>  linux-headers/linux/iommufd.h                 |  48 ++
>  linux-headers/linux/kvm.h                     |  46 +-
>  linux-headers/linux/mshv.h                    |   4 +-
>  linux-headers/linux/psp-sev.h                 |   2 +-
>  linux-headers/linux/stddef.h                  |   4 +
>  linux-headers/linux/vduse.h                   |  85 +++-
>  linux-headers/linux/vfio.h                    |  30 +-
>  migration/migration.h                         |  17 +-
>  migration/savevm.h                            |   4 +-
>  hw/core/machine.c                             |   4 +-
>  hw/vfio/migration.c                           | 195 +++++++-
>  migration/migration.c                         |  65 ++-
>  migration/options.c                           |   9 +
>  migration/savevm.c                            | 146 ++++--
>  hw/vfio/trace-events                          |   5 +-
>  migration/trace-events                        |   8 +-
>  scripts/update-linux-headers.sh               |   2 +
>  53 files changed, 1109 insertions(+), 580 deletions(-)
>  create mode 100644 include/standard-headers/linux/typelimits.h
>  create mode 100644 include/standard-headers/linux/virtio_rtc.h
>  delete mode 100644 linux-headers/asm-s390/unistd_32.h
> 
> -- 
> 2.40.1
> 

-- 
Peter Xu
Re: [PATCH 00/14] Make switchover-ack re-usable and add VFIO precopy REINIT feature
Posted by Avihai Horon 3 days, 13 hours ago
On 5/19/2026 11:09 PM, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> On Tue, May 05, 2026 at 11:14:09AM +0300, Avihai Horon wrote:
>> Performance tests were done by migrating a single VM with:
>> * 8 GB RAM
>> * 4 mlx5 VFIO devices:
>>    - One device with 1GB of device data (stopcopy data) that runs
>>      workload during precopy so VFIO_PRECOPY_INFO_REINIT is exercised
>>      (generate new initial_bytes chunks during precopy).
> Could you elaborate a bit more on what workload is executed, and how that
> will affect REINIT reportings (e.g. is only one REINIT generated, or it
> keeps generating)?

Basically, I create and destroy RDMA resources (MRs, QPs, CQs, etc.) on 
the VFIO device in a loop for several iterations.
This generates several REINITs.

>
> Can I understand it in this way: without REINIT, device is forced to put
> those data into stopcopy size; then with REINIT, some stopcopy size is
> essentially moved back to precopy phase?

Almost:
Without REINIT, the device is forced to put this data in precopy 
dirty_bytes.
With REINIT, this data can be put in precopy init_bytes (and do the 
switchover-ack dance again).

Thanks.

>>    - The other 3 devices are idle.
>>
>> In this setup, VFIO_PRECOPY_INFO_REINIT reduced total migration downtime by
>> about 43%, and downtime attributed to the busy VFIO device by about 67%:
>>
>> With VFIO_PRECOPY_INFO_REINIT:
>>    1335ms total (~520ms from the VFIO device running the workload).
>>
>> Without VFIO_PRECOPY_INFO_REINIT:
>>    2352ms total (~1600ms from the VFIO device running the workload).
>>
>> Functional tests covered the main code paths and combinations, including
>> legacy and new switchover-ack across versions, for example:
>> * Migration between QEMU 11.0 (old binary) and 11.1 (new binary).
>> * Migration between two 11.1 binaries with different machine versions.
>> * Migration when the VFIO device has no VFIO_PRECOPY_INFO_REINIT.
>> * Migration when the VFIO device has no VFIO precopy support.
>>
>> === Patch Breakdown ===
>>
>> * Patches 1-2: Update Linux headers
>> * Patches 3-8,14: Migration cleanups and the new switchover-ack mechanism
>> * Patches 9-13: VFIO cleanups and VFIO_PRECOPY_INFO_REINIT
>>
>> Thanks.
>>
>> [1] https://lore.kernel.org/all/20260317161753.18964-1-yishaih@nvidia.com/
>> [2] https://lore.kernel.org/qemu-devel/20260421202110.306051-1-peterx@redhat.com/
>>
>> Avihai Horon (14):
>>    scripts/update-linux-headers: Add typelimits.h
>>    linux-headers: Update to Linux v7.1-rc1
>>    migration: Propagate errors in migration_completion_precopy()
>>    migration: Log the approver in qemu_loadvm_approve_switchover()
>>    migration: Replace switchover_ack_needed SaveVMHandler
>>    migration: Rename switchover-ack code to legacy
>>    migration: Make switchover-ack re-usable
>>    migration: Check switchover-ack during switchover phase
>>    vfio/migration: Re-query precopy size before sending
>>      VFIO_MIG_FLAG_DEV_INIT_DATA_SENT
>>    vfio/migration: Add Error ** parameter to vfio_migration_init()
>>    vfio/migration: Add new switchover-ack mechanism
>>    vfio/migration: Implement VFIO_PRECOPY_INFO_REINIT feature
>>    vfio/migration: Check VFIO_PRECOPY_INFO_REINIT during switchover
>>    migration: Enable new switchover-ack
>>
>>   docs/devel/migration/vfio.rst                 |   4 +-
>>   hw/vfio/vfio-migration-internal.h             |   2 +
>>   include/migration/client-options.h            |   1 +
>>   include/migration/misc.h                      |   2 +
>>   include/migration/register.h                  |  21 +-
>>   include/standard-headers/drm/drm_fourcc.h     |  28 +-
>>   include/standard-headers/linux/const.h        |  18 +
>>   include/standard-headers/linux/ethtool.h      |  28 +-
>>   .../linux/input-event-codes.h                 |  13 +
>>   include/standard-headers/linux/pci_regs.h     |  71 ++-
>>   include/standard-headers/linux/typelimits.h   |   8 +
>>   include/standard-headers/linux/virtio_ring.h  |   3 +-
>>   include/standard-headers/linux/virtio_rtc.h   | 237 ++++++++++
>>   include/standard-headers/linux/vmclock-abi.h  |  20 +
>>   linux-headers/asm-arm64/kvm.h                 |   1 +
>>   linux-headers/asm-arm64/unistd_64.h           |   1 +
>>   linux-headers/asm-generic/unistd.h            |   5 +-
>>   linux-headers/asm-loongarch/kvm.h             |   5 +
>>   linux-headers/asm-loongarch/kvm_para.h        |   1 +
>>   linux-headers/asm-loongarch/unistd_64.h       |   2 +
>>   linux-headers/asm-mips/unistd_n32.h           |   1 +
>>   linux-headers/asm-mips/unistd_n64.h           |   1 +
>>   linux-headers/asm-mips/unistd_o32.h           |   1 +
>>   linux-headers/asm-powerpc/unistd_32.h         |   1 +
>>   linux-headers/asm-powerpc/unistd_64.h         |   1 +
>>   linux-headers/asm-riscv/kvm.h                 |  11 +-
>>   linux-headers/asm-riscv/ptrace.h              |  37 ++
>>   linux-headers/asm-riscv/unistd_32.h           |   1 +
>>   linux-headers/asm-riscv/unistd_64.h           |   1 +
>>   linux-headers/asm-s390/unistd_32.h            | 446 ------------------
>>   linux-headers/asm-s390/unistd_64.h            |   1 +
>>   linux-headers/asm-x86/kvm.h                   |  21 +-
>>   linux-headers/asm-x86/unistd_32.h             |   1 +
>>   linux-headers/asm-x86/unistd_64.h             |   1 +
>>   linux-headers/asm-x86/unistd_x32.h            |   1 +
>>   linux-headers/linux/const.h                   |  18 +
>>   linux-headers/linux/iommufd.h                 |  48 ++
>>   linux-headers/linux/kvm.h                     |  46 +-
>>   linux-headers/linux/mshv.h                    |   4 +-
>>   linux-headers/linux/psp-sev.h                 |   2 +-
>>   linux-headers/linux/stddef.h                  |   4 +
>>   linux-headers/linux/vduse.h                   |  85 +++-
>>   linux-headers/linux/vfio.h                    |  30 +-
>>   migration/migration.h                         |  17 +-
>>   migration/savevm.h                            |   4 +-
>>   hw/core/machine.c                             |   4 +-
>>   hw/vfio/migration.c                           | 195 +++++++-
>>   migration/migration.c                         |  65 ++-
>>   migration/options.c                           |   9 +
>>   migration/savevm.c                            | 146 ++++--
>>   hw/vfio/trace-events                          |   5 +-
>>   migration/trace-events                        |   8 +-
>>   scripts/update-linux-headers.sh               |   2 +
>>   53 files changed, 1109 insertions(+), 580 deletions(-)
>>   create mode 100644 include/standard-headers/linux/typelimits.h
>>   create mode 100644 include/standard-headers/linux/virtio_rtc.h
>>   delete mode 100644 linux-headers/asm-s390/unistd_32.h
>>
>> --
>> 2.40.1
>>
> --
> Peter Xu
>
Re: [PATCH 00/14] Make switchover-ack re-usable and add VFIO precopy REINIT feature
Posted by Peter Xu 3 days, 12 hours ago
On Thu, May 21, 2026 at 04:53:54PM +0300, Avihai Horon wrote:
> 
> On 5/19/2026 11:09 PM, Peter Xu wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > On Tue, May 05, 2026 at 11:14:09AM +0300, Avihai Horon wrote:
> > > Performance tests were done by migrating a single VM with:
> > > * 8 GB RAM
> > > * 4 mlx5 VFIO devices:
> > >    - One device with 1GB of device data (stopcopy data) that runs
> > >      workload during precopy so VFIO_PRECOPY_INFO_REINIT is exercised
> > >      (generate new initial_bytes chunks during precopy).
> > Could you elaborate a bit more on what workload is executed, and how that
> > will affect REINIT reportings (e.g. is only one REINIT generated, or it
> > keeps generating)?
> 
> Basically, I create and destroy RDMA resources (MRs, QPs, CQs, etc.) on the
> VFIO device in a loop for several iterations.
> This generates several REINITs.
> 
> > 
> > Can I understand it in this way: without REINIT, device is forced to put
> > those data into stopcopy size; then with REINIT, some stopcopy size is
> > essentially moved back to precopy phase?
> 
> Almost:
> Without REINIT, the device is forced to put this data in precopy
> dirty_bytes.
> With REINIT, this data can be put in precopy init_bytes (and do the
> switchover-ack dance again).

Hmm, then I don't understand why moving some chunk of data from
precopy_bytes to init_bytes helps downtime.

Essentially, QEMU makes the switchover decision based on the math of:

   init+dirty+stop
   --------------- <= downtime_limit
         bw

The possible min of above is:

        stop
   ---------------
         bw

Here whether some data would be in init or precopy portion shouldn't matter
for a min downtime, since both portions are allowed to be moved during
precopy phase.

OTOH, if stop_bytes unchanged, min downtime is still the same before /
after supporting REINIT, if we try harder.

Say, with below testing results:

With VFIO_PRECOPY_INFO_REINIT:
  1335ms total (~520ms from the VFIO device running the workload).

Without VFIO_PRECOPY_INFO_REINIT:
  2352ms total (~1600ms from the VFIO device running the workload).

What is the downtime_limit you specified for both cases?  Have you tried to
specify lower downtime_limit than what you specified, so that both results
will become even closer (until they become, statistically, identical)?

In general, I can understand the REINIT will stop converging too early, but
it'll be the same IIUC just to turn the downtime_limit smaller..  IOW, I
may still miss some important piece of info that how this REINIT feature
helps downtime..

Thanks,

-- 
Peter Xu
Re: [PATCH 00/14] Make switchover-ack re-usable and add VFIO precopy REINIT feature
Posted by Avihai Horon 20 hours ago
On 5/21/2026 6:20 PM, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> On Thu, May 21, 2026 at 04:53:54PM +0300, Avihai Horon wrote:
>> On 5/19/2026 11:09 PM, Peter Xu wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> On Tue, May 05, 2026 at 11:14:09AM +0300, Avihai Horon wrote:
>>>> Performance tests were done by migrating a single VM with:
>>>> * 8 GB RAM
>>>> * 4 mlx5 VFIO devices:
>>>>     - One device with 1GB of device data (stopcopy data) that runs
>>>>       workload during precopy so VFIO_PRECOPY_INFO_REINIT is exercised
>>>>       (generate new initial_bytes chunks during precopy).
>>> Could you elaborate a bit more on what workload is executed, and how that
>>> will affect REINIT reportings (e.g. is only one REINIT generated, or it
>>> keeps generating)?
>> Basically, I create and destroy RDMA resources (MRs, QPs, CQs, etc.) on the
>> VFIO device in a loop for several iterations.
>> This generates several REINITs.
>>
>>> Can I understand it in this way: without REINIT, device is forced to put
>>> those data into stopcopy size; then with REINIT, some stopcopy size is
>>> essentially moved back to precopy phase?
>> Almost:
>> Without REINIT, the device is forced to put this data in precopy
>> dirty_bytes.
>> With REINIT, this data can be put in precopy init_bytes (and do the
>> switchover-ack dance again).
> Hmm, then I don't understand why moving some chunk of data from
> precopy_bytes to init_bytes helps downtime.
>
> Essentially, QEMU makes the switchover decision based on the math of:
>
>     init+dirty+stop
>     --------------- <= downtime_limit
>           bw
>
> The possible min of above is:
>
>          stop
>     ---------------
>           bw
>
> Here whether some data would be in init or precopy portion shouldn't matter
> for a min downtime, since both portions are allowed to be moved during
> precopy phase.
>
> OTOH, if stop_bytes unchanged, min downtime is still the same before /
> after supporting REINIT, if we try harder.
>
> Say, with below testing results:
>
> With VFIO_PRECOPY_INFO_REINIT:
>    1335ms total (~520ms from the VFIO device running the workload).
>
> Without VFIO_PRECOPY_INFO_REINIT:
>    2352ms total (~1600ms from the VFIO device running the workload).
>
> What is the downtime_limit you specified for both cases?  Have you tried to
> specify lower downtime_limit than what you specified, so that both results
> will become even closer (until they become, statistically, identical)?
>
> In general, I can understand the REINIT will stop converging too early, but
> it'll be the same IIUC just to turn the downtime_limit smaller..  IOW, I
> may still miss some important piece of info that how this REINIT feature
> helps downtime..

The init_bytes are special in the sense that it's crucial that they are 
transferred before switching over. Otherwise, VFIO precopy may not have 
full effect which could make VFIO migration slower.
Accordingly, their contribution to downtime may not be just the time it 
takes to transfer them.

Specifically for mlx5, init_bytes hold a small portion of metadata used 
for time consuming pre-allocations on destination side. So, we may have 
10MB of init_bytes which would take a fraction of a second to transfer, 
but once reached destination, it could take even a few seconds to load them.

When moving this data from dirty_bytes to init_bytes along with 
switchover-ack, we guarantee that this long pre-allocation doesn't 
happen during downtime. This is the time difference you see in the test 
results.

Thanks.