migration: Synchronize CPUs sooner in postcopy switchover phase

[RFC PATCH] migration: Synchronize CPUs sooner in postcopy switchover phase

Posted by Juraj Marcin 1 month, 1 week ago

From: Juraj Marcin <jmarcin@redhat.com>

Previously, the post init CPU synchronization with the accelerator, like
KVM, was performed in the bottom half of the POSTCOPY_RUN command
handler. However, this causes several problems.

First issue is that if CPU synchronization fails, the destination QEMU
crashes. However, it is too late to recover the source side as the
response to special PING has been already sent, and both sides are
already in the POSTCOPY_ACTIVE state.

By moving synchronization before responding, if the machine crashes, the
response is never sent and the source side can resume from the
POSTCOPY_DEVICE state.

Second issue is caused when migration is paused due to a network
failure or user command right after transitioning to the POSTCOPY_ACTIVE
state and the CPU synchronization causes a page fault. This page fault
blocks the CPU and the main QEMU threads and cannot be resolved until
postcopy migration is recovered. However, as libvirt also tries to
execute 'cont' QMP command at this time (destination side transitions
from POSTCOPY_DEVICE to POSTCOPY_ACTIVE), it will also halt waiting for
the response from the blocked main QEMU thread, unaware of the fact that
the migration is paused and needs to be recovered. Thus, it will wait
indefinitely and never report postcopy migration error.

When the CPU synchronization happens sooner, and the network fails
during it, the source side can transition from POSTCOPY_DEVICE to FAILED
state and resume safely. If migration is paused later, the main thread
won't be blocked by CPU synchronization and can respond to libvirt.

Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
---
I am posting this as RFC to discuss the point at which the CPU
synchronization should happen.

For the POSTCOPY_DEVICE state to be effective, this synchronization must
happen before the destination machine responds to the special PING
command. This leaves us with 2 options:

1) In the PING command handler before responding to the specific
   request, as proposed in this patch.

2) After loading CPU VMSD, for example in the post_load hook.

The first solution limits the number of places where synchronization
needs to happen, however, having it in the PING command handler feels
somewhat hacky.

I have also tested the second solution, and while it seems natural to
synchronize the CPU after its data is loaded in the post_load, there are
multiple CPU types and VMSDs and each one would need to call
synchronization in its post_load hook. This would basically revert Jan
Kiszka's commit [1] which refactored CPU synchronization and united it
to one place.

Kind of middle ground idea I have, is to add a special flag to VMSDs
that contain CPU data. After loading such VMSD the loadvm core would
then run CPU synchronization for that specific CPU. This solution would
bring the synchronization close to the actual load from the migration
stream but also keep it in one place.

[1]: ea375f9ab8c7 ("KVM: Rework VCPU state writeback API")
---
 migration/migration.c |  1 +
 migration/migration.h |  7 +++++++
 migration/savevm.c    | 16 +++++++++++++---
 3 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 5c9aaa6e58..9753f7dc26 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -320,6 +320,7 @@ void migration_object_init(void)
     current_incoming->page_requested = g_tree_new(page_request_addr_cmp);
 
     current_incoming->exit_on_error = INMIGRATE_DEFAULT_EXIT_ON_ERROR;
+    current_incoming->postcopy_device_cpu_synchronized = false;
 
     migration_object_check(current_migration, &error_fatal);
 
diff --git a/migration/migration.h b/migration/migration.h
index b6888daced..799a686a0b 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -210,6 +210,13 @@ struct MigrationIncomingState {
      */
     QemuSemaphore postcopy_pause_sem_fast_load;
 
+    /*
+     * CPUs have been synchronized during POSTCOPY_DEVICE state before
+     * responding to a special PING to source. This means, synchronization is
+     * not required later during loadvm_postcopy_handle_run_bh().
+     */
+    bool postcopy_device_cpu_synchronized;
+
     /* List of listening socket addresses  */
     SocketAddressList *socket_address_list;
 
diff --git a/migration/savevm.c b/migration/savevm.c
index dd58f2a705..ad00c85887 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2143,9 +2143,10 @@ static void loadvm_postcopy_handle_run_bh(void *opaque)
     /* TODO we should move all of this lot into postcopy_ram.c or a shared code
      * in migration.c
      */
-    cpu_synchronize_all_post_init();
-
-    trace_vmstate_downtime_checkpoint("dst-postcopy-bh-cpu-synced");
+    if (!mis->postcopy_device_cpu_synchronized) {
+        cpu_synchronize_all_post_init();
+        trace_vmstate_downtime_checkpoint("dst-postcopy-bh-cpu-synced");
+    }
 
     qemu_announce_self(&mis->announce_timer, migrate_announce_params());
 
@@ -2510,6 +2511,15 @@ static int loadvm_process_command(QEMUFile *f, Error **errp)
                        tmp32);
             return -1;
         }
+        if (tmp32 == QEMU_VM_PING_PACKAGED_LOADED) {
+            /*
+             * Try synchronizing CPU before responding. If it fails, QEMU exits
+             * and source side can resume.
+             */
+            cpu_synchronize_all_post_init();
+            mis->postcopy_device_cpu_synchronized = true;
+            trace_vmstate_downtime_checkpoint("dst-postcopy-bh-cpu-synced");
+        }
         migrate_send_rp_pong(mis, tmp32);
         return 0;
 
-- 
2.53.0

Re: [RFC PATCH] migration: Synchronize CPUs sooner in postcopy switchover phase

Posted by Peter Xu 1 month ago

On Thu, Apr 23, 2026 at 05:45:20PM +0200, Juraj Marcin wrote:
> From: Juraj Marcin <jmarcin@redhat.com>
> 
> Previously, the post init CPU synchronization with the accelerator, like
> KVM, was performed in the bottom half of the POSTCOPY_RUN command
> handler. However, this causes several problems.
> 
> First issue is that if CPU synchronization fails, the destination QEMU
> crashes. However, it is too late to recover the source side as the
> response to special PING has been already sent, and both sides are
> already in the POSTCOPY_ACTIVE state.
> 
> By moving synchronization before responding, if the machine crashes, the
> response is never sent and the source side can resume from the
> POSTCOPY_DEVICE state.
> 
> Second issue is caused when migration is paused due to a network
> failure or user command right after transitioning to the POSTCOPY_ACTIVE
> state and the CPU synchronization causes a page fault. This page fault
> blocks the CPU and the main QEMU threads and cannot be resolved until
> postcopy migration is recovered. However, as libvirt also tries to
> execute 'cont' QMP command at this time (destination side transitions
> from POSTCOPY_DEVICE to POSTCOPY_ACTIVE), it will also halt waiting for
> the response from the blocked main QEMU thread, unaware of the fact that
> the migration is paused and needs to be recovered. Thus, it will wait
> indefinitely and never report postcopy migration error.
> 
> When the CPU synchronization happens sooner, and the network fails
> during it, the source side can transition from POSTCOPY_DEVICE to FAILED
> state and resume safely. If migration is paused later, the main thread
> won't be blocked by CPU synchronization and can respond to libvirt.
> 
> Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
> ---
> I am posting this as RFC to discuss the point at which the CPU
> synchronization should happen.
> 
> For the POSTCOPY_DEVICE state to be effective, this synchronization must
> happen before the destination machine responds to the special PING
> command. This leaves us with 2 options:
> 
> 1) In the PING command handler before responding to the specific
>    request, as proposed in this patch.
> 
> 2) After loading CPU VMSD, for example in the post_load hook.
> 
> The first solution limits the number of places where synchronization
> needs to happen, however, having it in the PING command handler feels
> somewhat hacky.

Yes.

> 
> I have also tested the second solution, and while it seems natural to
> synchronize the CPU after its data is loaded in the post_load, there are
> multiple CPU types and VMSDs and each one would need to call
> synchronization in its post_load hook. This would basically revert Jan
> Kiszka's commit [1] which refactored CPU synchronization and united it
> to one place.

What I was thinking is not reverting all of it, but only removing the
migration relevant paths for post_init().

For example, we have different reasons to sync CPU, and we have two
directions to do that (from/to kernel, in KVM's context). For migration,
what used to be confusing is why we need to have CPU specific post_init()
hooks when each CPU has VMSD and post_load() on its own.

The other one is on savevm side (cpu_synchronize_all_states()) and we can
at least keep it as-is for now, simplify the discussion.

So if we want to move this into any of such post_load(), it means removal
of three call sites of post_init() only in migration/savevm.c, then do
per-CPU's post_init() in its post_load() hook.

I don't know the real answer of this one; I recall you tried it out but hit
some ARM specific issue.  Just to want to make sure we're on the same page
on what you have experimented.

Said that, I do see some complexity over such a change already.  For
example, cpu_vmstate_register() seems to be able to register more than one
VMSDs for each vCPU.  I don't know if it means at least we can't simply put
it into post_load() of vmstate_cpu_common, as when reaching there it is not
guaranteed that all vCPUs' registers are uptodate..

Besides..

I do have another thought, though, to avoid the hacky part of this patch
and looks pretty safe: dest QEMU does not reply directly to the
QEMU_VM_PING_PACKAGED_LOADED ping, instead it sets a flag. Then we move the
conditional PONG into loadvm_postcopy_handle_run().

With that, we can move post_init() into loadvm_postcopy_handle_run()
altogether.

  loadvm_postcopy_handle_run():
    ...
    cpu_synchronize_all_post_init();
    if (package_loaded_ping_received) {
        send_pong();
    }
    migrate_set_state(POSTCOPY_DEVICE, POSTCOPY_ACTIVE);
    migration_bh_schedule(mis);

Would this work?

-- 
Peter Xu

Re: [RFC PATCH] migration: Synchronize CPUs sooner in postcopy switchover phase

Posted by Juraj Marcin 4 weeks, 1 day ago

Hi Peter,

On 2026-04-29 15:50, Peter Xu wrote:
> On Thu, Apr 23, 2026 at 05:45:20PM +0200, Juraj Marcin wrote:
> > From: Juraj Marcin <jmarcin@redhat.com>
> > 
> > Previously, the post init CPU synchronization with the accelerator, like
> > KVM, was performed in the bottom half of the POSTCOPY_RUN command
> > handler. However, this causes several problems.
> > 
> > First issue is that if CPU synchronization fails, the destination QEMU
> > crashes. However, it is too late to recover the source side as the
> > response to special PING has been already sent, and both sides are
> > already in the POSTCOPY_ACTIVE state.
> > 
> > By moving synchronization before responding, if the machine crashes, the
> > response is never sent and the source side can resume from the
> > POSTCOPY_DEVICE state.
> > 
> > Second issue is caused when migration is paused due to a network
> > failure or user command right after transitioning to the POSTCOPY_ACTIVE
> > state and the CPU synchronization causes a page fault. This page fault
> > blocks the CPU and the main QEMU threads and cannot be resolved until
> > postcopy migration is recovered. However, as libvirt also tries to
> > execute 'cont' QMP command at this time (destination side transitions
> > from POSTCOPY_DEVICE to POSTCOPY_ACTIVE), it will also halt waiting for
> > the response from the blocked main QEMU thread, unaware of the fact that
> > the migration is paused and needs to be recovered. Thus, it will wait
> > indefinitely and never report postcopy migration error.
> > 
> > When the CPU synchronization happens sooner, and the network fails
> > during it, the source side can transition from POSTCOPY_DEVICE to FAILED
> > state and resume safely. If migration is paused later, the main thread
> > won't be blocked by CPU synchronization and can respond to libvirt.
> > 
> > Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
> > ---
> > I am posting this as RFC to discuss the point at which the CPU
> > synchronization should happen.
> > 
> > For the POSTCOPY_DEVICE state to be effective, this synchronization must
> > happen before the destination machine responds to the special PING
> > command. This leaves us with 2 options:
> > 
> > 1) In the PING command handler before responding to the specific
> >    request, as proposed in this patch.
> > 
> > 2) After loading CPU VMSD, for example in the post_load hook.
> > 
> > The first solution limits the number of places where synchronization
> > needs to happen, however, having it in the PING command handler feels
> > somewhat hacky.
> 
> Yes.
> 
> > 
> > I have also tested the second solution, and while it seems natural to
> > synchronize the CPU after its data is loaded in the post_load, there are
> > multiple CPU types and VMSDs and each one would need to call
> > synchronization in its post_load hook. This would basically revert Jan
> > Kiszka's commit [1] which refactored CPU synchronization and united it
> > to one place.
> 
> What I was thinking is not reverting all of it, but only removing the
> migration relevant paths for post_init().
> 
> For example, we have different reasons to sync CPU, and we have two
> directions to do that (from/to kernel, in KVM's context). For migration,
> what used to be confusing is why we need to have CPU specific post_init()
> hooks when each CPU has VMSD and post_load() on its own.
> 
> The other one is on savevm side (cpu_synchronize_all_states()) and we can
> at least keep it as-is for now, simplify the discussion.
> 
> So if we want to move this into any of such post_load(), it means removal
> of three call sites of post_init() only in migration/savevm.c, then do
> per-CPU's post_init() in its post_load() hook.
> 
> I don't know the real answer of this one; I recall you tried it out but hit
> some ARM specific issue.  Just to want to make sure we're on the same page
> on what you have experimented.

I thought I did, however, that ARM issue in the CI turned out to be
unrelated.

> 
> Said that, I do see some complexity over such a change already.  For
> example, cpu_vmstate_register() seems to be able to register more than one
> VMSDs for each vCPU.  I don't know if it means at least we can't simply put
> it into post_load() of vmstate_cpu_common, as when reaching there it is not
> guaranteed that all vCPUs' registers are uptodate..
> 
> Besides..
> 
> I do have another thought, though, to avoid the hacky part of this patch
> and looks pretty safe: dest QEMU does not reply directly to the
> QEMU_VM_PING_PACKAGED_LOADED ping, instead it sets a flag. Then we move the
> conditional PONG into loadvm_postcopy_handle_run().
> 
> With that, we can move post_init() into loadvm_postcopy_handle_run()
> altogether.
> 
>   loadvm_postcopy_handle_run():
>     ...
>     cpu_synchronize_all_post_init();
>     if (package_loaded_ping_received) {
>         send_pong();
>     }
>     migrate_set_state(POSTCOPY_DEVICE, POSTCOPY_ACTIVE);
>     migration_bh_schedule(mis);
> 
> Would this work?

Interesting idea, I think it should work, but I am a bit unsure about
delaying the PONG response in such way, if it could potentially break
anything.

Thank you!

Best regards,

Juraj Marcin

> 
> -- 
> Peter Xu
>