CI: https://gitlab.com/peterx/qemu/-/pipelines/2469074018
rfc: https://lore.kernel.org/r/20260319231302.123135-1-peterx@redhat.com
v1: https://lore.kernel.org/r/20260408165559.157108-1-peterx@redhat.com
v2:
- Added tags
- Patch 4
- Fix and rework doc for @save_query_pending [Juraj]
- Trace "exact" in trace_vfio_state_pending() [Avihai]
- Avoid mentioning "pre-copy" in vfio.rst doc for query [Avihai]
- Patch 12
- English errors [Fabiano]
- Patch 13
- Remove " (bytes)" in HMP line [Fabiano]
- Added patch "qemu-iotests: Add query-migrate test for dirty-bitmap"
- This covers a bug that I found when testing v1
- Added patch "vfio/migration: Add tracepoints for precopy/stopcopy query
ioctls" to be able to dump the raw results from the two VFIO ioctls
- Replace patch "migration: Make qemu_savevm_query_pending() available
anytime" with patch "migration: Remember total dirty bytes in mig_stats"
- I fell back to "cache the total dirty bytes" idea on this one to avoid
complication of save_query_pending() invoked anywhere.
Overview
========
VFIO migration was merged quite a while, but we do still see things off
here and there. This series tries to address some of them, but only based
on my limited understandings.
Two major issues I wanted to resolve:
(1) VFIO reports state_pending_{exact|estimate}() differently
It reports stop-only sizes in exact() only (which includes both precopy and
stopcopy data), while in estimate() it only reports precopy data. This is
violating the API. It was done like it to trigger proper sync on the VFIO
ioctls only but it was only a workaround. This series should fix it by
introducing stopcopy size reporting facility for vmstate handlers.
(2) expected_downtime / remaining doesn't take VFIO devices into account
When query migration, QEMU reports one field called "expected-downtime".
The document was phrasing this almost from RAM perspective, but ideally it
should be about an estimated blackout window (in milliseconds) if we
switchover anytime, based on known information.
This didn't yet took VFIO into account, especially in the case of VFIO
devices that may contain a large amount of device states (like GPUs).
For problem (2), the use case should be that an mgmt app when migrating a
VFIO GPU device needs to always adjust downtime for migration to converge,
because when it's involved normal downtime like 300ms will normally not
suffice.
Now the issue with that is the mgmt doesn't have a good way to know exactly
how well the precopy goes with the whole system and the GPU device.
The hope is fixed expected_downtime will provide one way for the mgmt app
to have a reasonable hint for downtime to setup to converge a migration.
Meanwhile, with a system-wise "remaining" field introduced, mgmt can query
this results at beginning of each iteration to know if a stall is
happening, IOW, if it's likely that this migration will not converge at
all. When detected, mgmt can start to consider the expected_downtime value
reported above for converging this migration. See more on testing below.
Tests
=====
Thanks to Cédric on help testing v2. One thing to mention is we did
encounter one case where we observed reported dirty size overflowed for
uint64_t (on both expected_downtime and system remaining data).
Quotes from test results from Cédric, migrating a RHEL9 VM with a vGPU
(NVIDIA L4-2B) and an MLX5 VF, from a RHEL9 host (vGPU mdev) to a RHEL10
host (vGPU VF), with the vGPU under load (glxgears):
(qemu) info migrate
Status: active
Time (ms): total=21140, setup=86, exp_down=152455434886355 <---- !?!
Remaining: 16 EiB <---- !?!
RAM info:
Throughput (Mbps): 967.98
Sizes: pagesize=4 KiB, total=4 GiB
Transfers: transferred=2.29 GiB, remain=4.7 MiB
Channels: precopy=1.91 GiB, multifd=0 B, postcopy=0 B, vfio=387 MiB
Page Types: normal=499427, zero=559708
Page Rates (pps): transfer=0, dirty=1892
Others: dirty_syncs=3
It got fixed itself after a few more rounds of iterations, so it also
didn't affects migration ultimately. Further attempts didn't reproduce it
after I added the tracepoint patch. It would be good if someone knows if it
was a known driver issue.
For detailed testing steps, please refer to v1's cover letter.
Peter Xu (16):
qemu-iotests: Add query-migrate test for dirty-bitmap
migration: Fix low possibility downtime violation
migration/qapi: Rename MigrationStats to MigrationRAMStats
vfio/migration: Cache stop size in VFIOMigration
migration/treewide: Merge @state_pending_{exact|estimate} APIs
migration: Use the new save_query_pending() API directly
migration: Introduce stopcopy_bytes in save_query_pending()
vfio/migration: Fix incorrect reporting for VFIO pending data
migration: Move iteration counter out of RAM
migration: Introduce a helper to return switchover bw estimate
migration: Calculate expected downtime on demand
migration: Fix calculation of expected_downtime to take VFIO info
migration: Remember total dirty bytes in mig_stats
migration/qapi: Introduce system-wise "remaining" reports
migration/qapi: Update unit for avail-switchover-bandwidth
vfio/migration: Add tracepoints for precopy/stopcopy query ioctls
docs/about/removed-features.rst | 2 +-
docs/devel/migration/main.rst | 9 +-
docs/devel/migration/vfio.rst | 9 +-
qapi/migration.json | 32 ++--
hw/vfio/vfio-migration-internal.h | 8 +
include/migration/register.h | 59 +++---
migration/migration-stats.h | 20 +-
migration/migration.h | 2 +-
migration/savevm.h | 7 +-
hw/s390x/s390-stattrib.c | 9 +-
hw/vfio/migration.c | 123 +++++++-----
migration/block-dirty-bitmap.c | 10 +-
migration/migration-hmp-cmds.c | 5 +
migration/migration.c | 177 +++++++++++++-----
migration/ram.c | 40 +---
migration/savevm.c | 42 ++---
hw/vfio/trace-events | 5 +-
migration/trace-events | 3 +-
.../tests/migrate-bitmaps-postcopy-test | 6 +
19 files changed, 322 insertions(+), 246 deletions(-)
--
2.53.0