[PATCH RFC 01/12] migration: Fix low possibility downtime violation

Peter Xu posted 12 patches 2 days, 15 hours ago
Maintainers: Pierrick Bouvier <pierrick.bouvier@linaro.org>, Peter Xu <peterx@redhat.com>, Fabiano Rosas <farosas@suse.de>, Alex Williamson <alex@shazbot.org>, "Cédric Le Goater" <clg@redhat.com>, Halil Pasic <pasic@linux.ibm.com>, Christian Borntraeger <borntraeger@linux.ibm.com>, Jason Herne <jjherne@linux.ibm.com>, Richard Henderson <richard.henderson@linaro.org>, Ilya Leoshkevich <iii@linux.ibm.com>, David Hildenbrand <david@kernel.org>, Eric Farman <farman@linux.ibm.com>, Matthew Rosato <mjrosato@linux.ibm.com>, Cornelia Huck <cohuck@redhat.com>, Eric Blake <eblake@redhat.com>, Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>, John Snow <jsnow@redhat.com>, Markus Armbruster <armbru@redhat.com>
[PATCH RFC 01/12] migration: Fix low possibility downtime violation
Posted by Peter Xu 2 days, 15 hours ago
When QEMU queried the estimated version of pending data and thinks it's
ready to converge, it'll send another accurate query to make sure of it.
It is needed to make sure we collect the latest reports and that equation
still holds true.

However we missed one tiny little difference here on "<" v.s. "<=" when
comparing pending_size (A) to threshold_size (B)..

QEMU src only re-query if A<B, but will kickoff switchover if A<=B.

I think it means it is possible to happen if A (as an estimate only so far)
accidentally equals to B, then re-query won't happen and switchover will
proceed without considering new dirtied data.

It turns out it was an accident in my commit 7aaa1fc072 when refactoring
the code around.  Fix this by using the same equation in both places.

Fixes: 7aaa1fc072 ("migration: Rewrite the migration complete detect logic")
Cc: qemu-stable@nongnu.org
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 migration/migration.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/migration/migration.c b/migration/migration.c
index 5c9aaa6e58..dfc60372cf 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3242,7 +3242,7 @@ static MigIterateState migration_iteration_run(MigrationState *s)
          * postcopy started, so ESTIMATE should always match with EXACT
          * during postcopy phase.
          */
-        if (pending_size < s->threshold_size) {
+        if (pending_size <= s->threshold_size) {
             qemu_savevm_state_pending_exact(&must_precopy, &can_postcopy);
             pending_size = must_precopy + can_postcopy;
             trace_migrate_pending_exact(pending_size, must_precopy,
-- 
2.50.1
Re: [PATCH RFC 01/12] migration: Fix low possibility downtime violation
Posted by Prasad Pandit 2 days, 1 hour ago
On Fri, 20 Mar 2026 at 04:46, Peter Xu <peterx@redhat.com> wrote:
> When QEMU queried the estimated version of pending data and thinks it's
> ready to converge, it'll send another accurate query to make sure of it.
> It is needed to make sure we collect the latest reports and that equation
> still holds true.
>
> However we missed one tiny little difference here on "<" v.s. "<=" when
> comparing pending_size (A) to threshold_size (B)..
>
> QEMU src only re-query if A<B, but will kickoff switchover if A<=B.
>
> I think it means it is possible to happen if A (as an estimate only so far)
> accidentally equals to B, then re-query won't happen and switchover will
> proceed without considering new dirtied data.
>
> It turns out it was an accident in my commit 7aaa1fc072 when refactoring
> the code around.  Fix this by using the same equation in both places.
>
> Fixes: 7aaa1fc072 ("migration: Rewrite the migration complete detect logic")
> Cc: qemu-stable@nongnu.org
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/migration.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/migration/migration.c b/migration/migration.c
> index 5c9aaa6e58..dfc60372cf 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -3242,7 +3242,7 @@ static MigIterateState migration_iteration_run(MigrationState *s)
>           * postcopy started, so ESTIMATE should always match with EXACT
>           * during postcopy phase.
>           */
> -        if (pending_size < s->threshold_size) {
> +        if (pending_size <= s->threshold_size) {
>              qemu_savevm_state_pending_exact(&must_precopy, &can_postcopy);
>              pending_size = must_precopy + can_postcopy;
>              trace_migrate_pending_exact(pending_size, must_precopy,

* What is the 'size' difference between < s->threshold_size  Vs  <=
s->threshold_size?  Going through the source IIUC
1) 'pending_size' is measured in Bytes.
     static void ram_state_pending_exact/_estimate()
         remaining_size = rs->migration_dirty_pages *
TARGET_PAGE_SIZE(=4096 bytes);
         100 dirty pages * 4096bytes  => 409600 dirty bytes => 409600
* 8 => 3,276,800 dirty bits

2) 's->threshold_size' is derived from bandwidth (100M bits/s) and
downtime(=300 ms)
        100,000,000 bits/s => 100,000 bits/ms
        100,000 bits/ms * 300ms => 30,000,000 bits in 300 ms
        30,000,000 bits / 8  =>  3,750,000 Bytes / 300 ms
        s->threshold_size = 30,000,000 bits (= 3.75MBytes) can be
transferred in 300ms downtime.

* Are we comparing pending_size(=409600 bytes)  <=
s->threshold_size(=30,000,000 bits)?

*  static void migration_update_counters()
        transferred = current_bytes - s->iteration_initial_bytes;
        bandwidth = (double)transferred / time_spent
        if (switchover_bw) {
            expected_bw_per_ms = (double)switchover_bw / 1000;
        } else {
            expected_bw_per_ms = bandwidth;
        }
=> ^^^^^^^  Should we divide 'bandwidth' by 1000 here (for bw_per_ms) ?

      s->threshold_size = expected_bw_per_ms * migrate_downtime_limit();

migration_iteration_run():
   /* Should we switch to postcopy now? */
   if (must_precopy <= s->threshold_size &&
      can_switchover && qatomic_read(&s->start_postcopy)) {
      if (postcopy_start(s, &local_err)) {
          migrate_error_propagate(s, error_copy(local_err));
          error_report_err(local_err);
      }
      return MIG_ITERATE_SKIP;
   }
* Here we should check pending_size <= s->threshold_size,  because
must_precopy is zero(0) when postcopy is enabled. And we switch to
postcopy mode even when pending_size > s->threshold_size.
  I wonder if we really need both 'must_precopy' and 'can_postcopy'
variables, they seem to complicate things.
===
# virsh migrate --verbose --live --auto-converge --postcopy
--postcopy-after-precopy f42vm
qemu+ssh://destination-machine.com/system
# less /var/log/libvirt/qemu/f42vm.log
...
migration_iteration_run: estimated pending_size: 50577408 bytes,
s->threshold_size: 36282361
migration_iteration_run: estimated pending_size: 43757568 bytes,
s->threshold_size: 36282361
migration_iteration_run: estimated pending_size: 36413440 bytes,
s->threshold_size: 34334680
migration_iteration_run: estimated pending_size: 29069312 bytes,
s->threshold_size: 34334680

migration_iteration_run: exact pending_size: 4339167232 bytes, 0,
4339167232              <== exact size is calculated once.
migration_iteration_run: estimated pending_size: 4332871680 bytes,
s->threshold_size: 35651363
migration_iteration_run: switching to postcopy: 4332871680, 0,
4332871680                    <== switch to postcopy with
must_precopy(=0) <= s->threshold_size

migration_iteration_run: estimated pending_size: 4332892160 bytes,
s->threshold_size: 35651363
migration_iteration_run: estimated pending_size: 4323188736 bytes,
s->threshold_size: 27243109
migration_iteration_run: estimated pending_size: 4315320320 bytes,
s->threshold_size: 27243109
migration_iteration_run: estimated pending_size: 4308221952 bytes,
s->threshold_size: 37695433
===
* Here, the exact pending_size is calculated only once, because we
switch to Postcopy mode even when pending_size is > s->threshold_size.

Thank you.
---
  - Prasad