[v1] migration/vfio: Fix a few issues on API misuse or statistic reports

[PATCH 00/14] migration/vfio: Fix a few issues on API misuse or statistic reports

Posted by Peter Xu 2 days, 22 hours ago

CI: https://gitlab.com/peterx/qemu/-/pipelines/2437886506
rfc: https://lore.kernel.org/r/20260319231302.123135-1-peterx@redhat.com

This is v1 of this series.  I dropped RFC because I feel like I collected
enough feedback on previous version on what is uncertain, meanwhile I also
managed to borrow a system with nVidia RTX6000 2GB vGPU and tested it.

There're too many trivial things changed since RFC->v1 here, let me only
mention what majorly has changed:

- This version will assume both VFIO ioctls (reporting either precopy or
  stopcopy size) may report anything (say, garbage), and it shouldn't crash
  QEMU.  It will affect what got reported with downtime or remaining data,
  but that's best effort so it's expected.  With that in mind, I dropped
  patch 3 as Avihai suggested.  IOW, I expect no concern on either overflow
  / underflow or atomicity on reading these values from the VFIO drivers.

- The cached stopcopy_bytes for VFIO reflects always the total size
  (includes precopy sizes).

- Introduced one new patch to report "system-wide" remaining data, which
  will start to include VFIO remaining device data.  We can't squash that
  directly into "ram" section of query-migrate QMP results, so I introduced
  a new "remaining" field in query-migrate result for it.

- One more patch "migration: Make qemu_savevm_query_pending() available
  anytime" trying to fix a very hard to hit race condition I found when
  testing against virtio-net-failover tests.  I can only hit it if I run
  tens of concurrent tests, but it will be needed to fix a crash.

Otherwise the major things should be kept almost as-is.  I should also
addressed all comments I received from rfc version.  Please shoot if I
missed something.

Overview
========

VFIO migration was merged quite a while, but we do still see things off
here and there.  This series tries to address some of them, but only based
on my limited understandings.

Two major issues I wanted to resolve:

(1) VFIO reports state_pending_{exact|estimate}() differently

It reports stop-only sizes in exact() only (which includes both precopy and
stopcopy data), while in estimate() it only reports precopy data.  This is
violating the API.  It was done like it to trigger proper sync on the VFIO
ioctls only but it was only a workaround.  This series should fix it by
introducing stopcopy size reporting facility for vmstate handlers.

(2) expected_downtime / remaining doesn't take VFIO devices into account

When query migration, QEMU reports one field called "expected-downtime".
The document was phrasing this almost from RAM perspective, but ideally it
should be about an estimated blackout window (in milliseconds) if we
switchover anytime, based on known information.

This didn't yet took VFIO into account, especially in the case of VFIO
devices that may contain a large amount of device states (like GPUs).

For problem (2), the use case should be that an mgmt app when migrating a
VFIO GPU device needs to always adjust downtime for migration to converge,
because when it's involved normal downtime like 300ms will normally not
suffice.

Now the issue with that is the mgmt doesn't have a good way to know exactly
how well the precopy goes with the whole system and the GPU device.

The hope is fixed expected_downtime will provide one way for the mgmt app
to have a reasonable hint for downtime to setup to converge a migration.

Meanwhile, with a system-wise "remaining" field introduced, mgmt can query
this results at beginning of each iteration to know if a stall is
happening, IOW, if it's likely that this migration will not converge at
all.  When detected, mgmt can start to consider the expected_downtime value
reported above for converging this migration.  See more on testing below.

Tests
=====

Tested this series with an assigned VFIO device GRID RTX6000-2B, FB memory
2GB.

The test covers both correct reporting of system-wise remaining data (which
used to only cover RAM), and the expected downtime.  I verified that using
the expected downtime I can converge a VFIO migration immediately according
to the value reported. Test process as below:

Start the VM and kick off migration until it spins at the end, not
converging with default 300ms downtime.  It's common for a 2GB vGPU device
due to both huge stopsize reported and dramally small mbps reported.

As a start, update avail-switchover (I chose 1GB over a real 10Gbps port):

This will stablize bandwidth.

Libvirt's domjobinfo won't be able to see the real remaining data because
libvirt still doesn't support the new "remaining" field, however we can
still see expected_downtime will be reported correctly now (instead of
reporting zero, before this patch applied):

Data remaining:   0.000 B
Memory remaining: 0.000 B
Expected downtime: 1910         ms

If we peek through QEMU monitor, we'll see with the change the system-wise
remaining data to be 1.9GB (even if RAM keeps reporting 0), and expected
downtime keeps the same as what domjobinfo reports as 1.9 seconds:

Status:                 active
Time (ms):              total=336919, setup=10, exp_down=1910
Remaining (bytes):      1.91 GiB
RAM info:
  Throughput (Mbps):    460.09
  Sizes:                pagesize=4 KiB, total=32 GiB
  Transfers:            transferred=12.7 GiB, remain=0 B
    Channels:           precopy=12.7 GiB, multifd=0 B, postcopy=0 B, vfio=0 B
    Page Types:         normal=3306906, zero=7745576
  Page Rates (pps):     transfer=14010, dirty=8039
  Others:               dirty_syncs=247045

It means 1.91 seconds are required as lowest downtime per math.

We can try to set something lower than that, migration will not converge:

...
...

Then if we update downtime_limit to be slightly larger than expected downtime:

Migration will complete almost immediately.

Peter Xu (14):
  migration: Fix low possibility downtime violation
  migration/qapi: Rename MigrationStats to MigrationRAMStats
  vfio/migration: Cache stop size in VFIOMigration
  migration/treewide: Merge @state_pending_{exact|estimate} APIs
  migration: Use the new save_query_pending() API directly
  migration: Introduce stopcopy_bytes in save_query_pending()
  vfio/migration: Fix incorrect reporting for VFIO pending data
  migration: Make qemu_savevm_query_pending() available anytime
  migration: Move iteration counter out of RAM
  migration: Introduce a helper to return switchover bw estimate
  migration: Calculate expected downtime on demand
  migration: Fix calculation of expected_downtime to take VFIO info
  migration/qapi: Introduce system-wise "remaining" reports
  migration/qapi: Update unit for avail-switchover-bandwidth

 docs/about/removed-features.rst   |   2 +-
 docs/devel/migration/main.rst     |   9 +-
 docs/devel/migration/vfio.rst     |   9 +-
 qapi/migration.json               |  32 +++---
 hw/vfio/vfio-migration-internal.h |   8 ++
 include/migration/register.h      |  59 ++++------
 migration/migration-stats.h       |  13 ++-
 migration/migration.h             |  10 +-
 migration/savevm.h                |   9 +-
 hw/s390x/s390-stattrib.c          |   9 +-
 hw/vfio/migration.c               |  92 +++++++++-------
 migration/block-dirty-bitmap.c    |  10 +-
 migration/migration-hmp-cmds.c    |   5 +
 migration/migration.c             | 172 +++++++++++++++++++++---------
 migration/ram.c                   |  40 ++-----
 migration/savevm.c                |  73 +++++++------
 hw/vfio/trace-events              |   3 +-
 migration/trace-events            |   3 +-
 18 files changed, 313 insertions(+), 245 deletions(-)

-- 
2.53.0

Re: [PATCH 00/14] migration/vfio: Fix a few issues on API misuse or statistic reports

Posted by Peter Xu 2 days, 20 hours ago

On Wed, Apr 08, 2026 at 12:55:44PM -0400, Peter Xu wrote:
> Tests
> =====

Re-inserting all the commands I used for testing below; they got ignored
when posting the cover letter as comments.

> 
> Tested this series with an assigned VFIO device GRID RTX6000-2B, FB memory
> 2GB.
> 
> The test covers both correct reporting of system-wise remaining data (which
> used to only cover RAM), and the expected downtime.  I verified that using
> the expected downtime I can converge a VFIO migration immediately according
> to the value reported. Test process as below:
> 
> Start the VM and kick off migration until it spins at the end, not
> converging with default 300ms downtime.  It's common for a 2GB vGPU device
> due to both huge stopsize reported and dramally small mbps reported.
> 
> As a start, update avail-switchover (I chose 1GB over a real 10Gbps port):

  # virsh qemu-monitor-command $vm --hmp "migrate_set_parameter avail-switchover-bandwidth 1G"

> This will stablize bandwidth.
> 
> Libvirt's domjobinfo won't be able to see the real remaining data because
> libvirt still doesn't support the new "remaining" field, however we can
> still see expected_downtime will be reported correctly now (instead of
> reporting zero, before this patch applied):

  # virsh domjobinfo $vm | grep -E "Expected|remaining"

> Data remaining:   0.000 B
> Memory remaining: 0.000 B
> Expected downtime: 1910         ms
> 
> If we peek through QEMU monitor, we'll see with the change the system-wise
> remaining data to be 1.9GB (even if RAM keeps reporting 0), and expected
> downtime keeps the same as what domjobinfo reports as 1.9 seconds:

  # virsh qemu-monitor-command $vm --hmp "info migrate"

> Status:                 active
> Time (ms):              total=336919, setup=10, exp_down=1910
> Remaining (bytes):      1.91 GiB
> RAM info:
>   Throughput (Mbps):    460.09
>   Sizes:                pagesize=4 KiB, total=32 GiB
>   Transfers:            transferred=12.7 GiB, remain=0 B
>     Channels:           precopy=12.7 GiB, multifd=0 B, postcopy=0 B, vfio=0 B
>     Page Types:         normal=3306906, zero=7745576
>   Page Rates (pps):     transfer=14010, dirty=8039
>   Others:               dirty_syncs=247045
> 
> It means 1.91 seconds are required as lowest downtime per math.
> 
> We can try to set something lower than that, migration will not converge:

  # virsh qemu-monitor-command $vm --hmp "migrate_set_parameter downtime-limit 1000"
  ...
  # virsh qemu-monitor-command $vm --hmp "migrate_set_parameter downtime-limit 1500"
  ...

> Then if we update downtime_limit to be slightly larger than expected downtime:

  # virsh qemu-monitor-command $vm --hmp "migrate_set_parameter downtime-limit 2000"

> Migration will complete almost immediately.

-- 
Peter Xu