CI: https://gitlab.com/peterx/qemu/-/pipelines/2437886506
rfc: https://lore.kernel.org/r/20260319231302.123135-1-peterx@redhat.com
This is v1 of this series. I dropped RFC because I feel like I collected
enough feedback on previous version on what is uncertain, meanwhile I also
managed to borrow a system with nVidia RTX6000 2GB vGPU and tested it.
There're too many trivial things changed since RFC->v1 here, let me only
mention what majorly has changed:
- This version will assume both VFIO ioctls (reporting either precopy or
stopcopy size) may report anything (say, garbage), and it shouldn't crash
QEMU. It will affect what got reported with downtime or remaining data,
but that's best effort so it's expected. With that in mind, I dropped
patch 3 as Avihai suggested. IOW, I expect no concern on either overflow
/ underflow or atomicity on reading these values from the VFIO drivers.
- The cached stopcopy_bytes for VFIO reflects always the total size
(includes precopy sizes).
- Introduced one new patch to report "system-wide" remaining data, which
will start to include VFIO remaining device data. We can't squash that
directly into "ram" section of query-migrate QMP results, so I introduced
a new "remaining" field in query-migrate result for it.
- One more patch "migration: Make qemu_savevm_query_pending() available
anytime" trying to fix a very hard to hit race condition I found when
testing against virtio-net-failover tests. I can only hit it if I run
tens of concurrent tests, but it will be needed to fix a crash.
Otherwise the major things should be kept almost as-is. I should also
addressed all comments I received from rfc version. Please shoot if I
missed something.
Overview
========
VFIO migration was merged quite a while, but we do still see things off
here and there. This series tries to address some of them, but only based
on my limited understandings.
Two major issues I wanted to resolve:
(1) VFIO reports state_pending_{exact|estimate}() differently
It reports stop-only sizes in exact() only (which includes both precopy and
stopcopy data), while in estimate() it only reports precopy data. This is
violating the API. It was done like it to trigger proper sync on the VFIO
ioctls only but it was only a workaround. This series should fix it by
introducing stopcopy size reporting facility for vmstate handlers.
(2) expected_downtime / remaining doesn't take VFIO devices into account
When query migration, QEMU reports one field called "expected-downtime".
The document was phrasing this almost from RAM perspective, but ideally it
should be about an estimated blackout window (in milliseconds) if we
switchover anytime, based on known information.
This didn't yet took VFIO into account, especially in the case of VFIO
devices that may contain a large amount of device states (like GPUs).
For problem (2), the use case should be that an mgmt app when migrating a
VFIO GPU device needs to always adjust downtime for migration to converge,
because when it's involved normal downtime like 300ms will normally not
suffice.
Now the issue with that is the mgmt doesn't have a good way to know exactly
how well the precopy goes with the whole system and the GPU device.
The hope is fixed expected_downtime will provide one way for the mgmt app
to have a reasonable hint for downtime to setup to converge a migration.
Meanwhile, with a system-wise "remaining" field introduced, mgmt can query
this results at beginning of each iteration to know if a stall is
happening, IOW, if it's likely that this migration will not converge at
all. When detected, mgmt can start to consider the expected_downtime value
reported above for converging this migration. See more on testing below.
Tests
=====
Tested this series with an assigned VFIO device GRID RTX6000-2B, FB memory
2GB.
The test covers both correct reporting of system-wise remaining data (which
used to only cover RAM), and the expected downtime. I verified that using
the expected downtime I can converge a VFIO migration immediately according
to the value reported. Test process as below:
Start the VM and kick off migration until it spins at the end, not
converging with default 300ms downtime. It's common for a 2GB vGPU device
due to both huge stopsize reported and dramally small mbps reported.
As a start, update avail-switchover (I chose 1GB over a real 10Gbps port):
This will stablize bandwidth.
Libvirt's domjobinfo won't be able to see the real remaining data because
libvirt still doesn't support the new "remaining" field, however we can
still see expected_downtime will be reported correctly now (instead of
reporting zero, before this patch applied):
Data remaining: 0.000 B
Memory remaining: 0.000 B
Expected downtime: 1910 ms
If we peek through QEMU monitor, we'll see with the change the system-wise
remaining data to be 1.9GB (even if RAM keeps reporting 0), and expected
downtime keeps the same as what domjobinfo reports as 1.9 seconds:
Status: active
Time (ms): total=336919, setup=10, exp_down=1910
Remaining (bytes): 1.91 GiB
RAM info:
Throughput (Mbps): 460.09
Sizes: pagesize=4 KiB, total=32 GiB
Transfers: transferred=12.7 GiB, remain=0 B
Channels: precopy=12.7 GiB, multifd=0 B, postcopy=0 B, vfio=0 B
Page Types: normal=3306906, zero=7745576
Page Rates (pps): transfer=14010, dirty=8039
Others: dirty_syncs=247045
It means 1.91 seconds are required as lowest downtime per math.
We can try to set something lower than that, migration will not converge:
...
...
Then if we update downtime_limit to be slightly larger than expected downtime:
Migration will complete almost immediately.
Peter Xu (14):
migration: Fix low possibility downtime violation
migration/qapi: Rename MigrationStats to MigrationRAMStats
vfio/migration: Cache stop size in VFIOMigration
migration/treewide: Merge @state_pending_{exact|estimate} APIs
migration: Use the new save_query_pending() API directly
migration: Introduce stopcopy_bytes in save_query_pending()
vfio/migration: Fix incorrect reporting for VFIO pending data
migration: Make qemu_savevm_query_pending() available anytime
migration: Move iteration counter out of RAM
migration: Introduce a helper to return switchover bw estimate
migration: Calculate expected downtime on demand
migration: Fix calculation of expected_downtime to take VFIO info
migration/qapi: Introduce system-wise "remaining" reports
migration/qapi: Update unit for avail-switchover-bandwidth
docs/about/removed-features.rst | 2 +-
docs/devel/migration/main.rst | 9 +-
docs/devel/migration/vfio.rst | 9 +-
qapi/migration.json | 32 +++---
hw/vfio/vfio-migration-internal.h | 8 ++
include/migration/register.h | 59 ++++------
migration/migration-stats.h | 13 ++-
migration/migration.h | 10 +-
migration/savevm.h | 9 +-
hw/s390x/s390-stattrib.c | 9 +-
hw/vfio/migration.c | 92 +++++++++-------
migration/block-dirty-bitmap.c | 10 +-
migration/migration-hmp-cmds.c | 5 +
migration/migration.c | 172 +++++++++++++++++++++---------
migration/ram.c | 40 ++-----
migration/savevm.c | 73 +++++++------
hw/vfio/trace-events | 3 +-
migration/trace-events | 3 +-
18 files changed, 313 insertions(+), 245 deletions(-)
--
2.53.0