[v1] migration: Threadify loadvm process

[PATCH 08/13] migration: Thread-ify precopy vmstate load process

Posted by Peter Xu 3 months, 2 weeks ago

Migration module was there for 10+ years.  Initially, it was in most cases
based on coroutines.  As more features were added into the framework, like
postcopy, multifd, etc.. it became a mixture of threads and coroutines.

I'm guessing coroutines just can't fix all issues that migration want to
resolve.

After all these years, migration is now heavily based on a threaded model.

Now there's still a major part of migration framework that is still not
thread-based, which is precopy load.  We do load in a separate thread in
postcopy since the 1st day postcopy was introduced, however that requires a
separate state transition from precopy loading all devices first, which
still happens in the main thread of a coroutine.

This patch tries to move the migration incoming side to be run inside a
separate thread (mig/dst/main) just like the src (mig/src/main).  The
entrance to be migration_incoming_thread().

Quite a few things are needed to make it fly..  One note here is we need to
change all these things in one patch to not break anything.  The other way
to do this is add code to make all paths (that this patch touched) be ready
for either coroutine or thread.  That may cause confusions in another way.
So reviewers, please take my sincere apology on the hardness of reviewing
this patch: it covers a few modules at the same time, and with some risky
changes.

BQL Analysis
============

Firstly, when moving it over to the thread, it means the thread cannot take
BQL during the whole process of loading anymore, because otherwise it can
block main thread from using the BQL for all kinds of other concurrent
tasks (for example, processing QMP / HMP commands).

Here the first question to ask is: what needs BQL during precopy load, and
what doesn't?

Most of the load process shouldn't need BQL, especially when it's about
RAM.  After all, RAM is still the major chunk of data to move for a live
migration process.  VFIO started to change that, though, but still, VFIO is
per-device so that shouldn't need BQL either in most cases.

Generic device loads will need BQL, likely not when receiving VMSDs, but
when applying them.  One example is any post_load() could potentially
inject memory regions causing memory transactions to happen.  That'll need
to update the global address spaces, hence requires BQL.  The other one is
CPU sync operations, even if the sync alone may not need BQL (which is
still to be further justified), run_on_cpu() will need it.

For that, qemu_loadvm_state() and qemu_loadvm_state_main() functions need
to now take a "bql_held" parameter saying whether bql is held.  We could
use things like BQL_LOCK_GUARD(), but this patch goes with explicit
lockings rather than relying on bql_locked TLS variable.  In case of
migration, we always know whether BQL is held in different context as long
as we can still pass that information downwards.

COLO
====

COLO assumed the dest VM load happens in a coroutine.  After this patch,
it's not anymore.  Change that by invoking colo_incoming_co() directly from
the migration_incoming_thread().

The name (colo_incoming_co()) isn't proper anymore.  Change it to
colo_incoming_wait(), removing the coroutine annotation alongside.

Remove all the bql_lock() implications in COLO, e.g., colo_incoming_co()
used to release the lock for a short period while join().  Now it's not
needed.  Instead, taking BQL but only when needed (colo_release_ram_cache).

At the meantime, there's colo_incoming_co variable that used to store the
COLO incoming coroutine, only to be kicked off when a secondary failover
happens.

To recap, what should happen for such failover should be (taking example of
a QMP command x-colo-lost-heartbeat triggering on dest QEMU):

  - The QMP command will kick off both the coroutine and the COLO
    thread (colo_process_incoming_thread()), with something like:

    /* Notify COLO incoming thread that failover work is finished */
    qemu_event_set(&mis->colo_incoming_event);

    qemu_coroutine_enter(mis->colo_incoming_co);

  - The coroutine, which yielded itself before, now resumes after enter(),
    then it'll wait for the join():

    mis->colo_incoming_co = qemu_coroutine_self();
    qemu_coroutine_yield();
    mis->colo_incoming_co = NULL;

    /* Wait checkpoint incoming thread exit before free resource */
    qemu_thread_join(&th);

Here, when switching to a thread model, it should be fine removing
colo_incoming_co variable completely, because if so, the incoming thread
will (instead of yielding the coroutine) wait at qemu_thread_join() until
the colo thread completes execution (after receiving colo_incoming_event).

RDMA
====

With the prior patch making sure io_watch won't block for RDMA iochannels,
RDMA threads should only block at its io_readv/io_writev functions.  When a
disconnection is detected (as in rdma_cm_poll_handler()), the update to
"errored" field will be immediately reflected in the migration incoming
thread.  Hence the coroutine for RDMA is not needed anymore to kick the
thread out.

When the thread is available, we also can't have rdma_cm_poll_handler()
keep polling the fd and operate on it in the main thread.  Drop it
completely, and it should be fine because qemu_rdma_wait_comp_channel()
should also be monitoring it.

This almost reverts commit 923709896b1b01fb982c93492ad01b233e6b6023.

We need to do this change in this same patch that we introduce the thread,
unfortunately, otherwise we can have a risk of racing.

TODO
====

Currently the BQL is taken during loading of a START|FULL section.  When
the IO hangs (e.g. network issue) during this process, it could potentially
block others like the monitor servers.  One solution is breaking BQL to
smaller granule and leave IOs to be always BQL-free.  That'll need more
justifications.

For example, there are at least four things that need some closer
attention:

  - SaveVMHandlers's load_state(): this likely DO NOT need BQL, but we need
  to justify all of them (not to mention, some of them look like prone to
  be rewritten as VMSDs..)

  - VMSD's pre_load(): in most cases, this DO NOT really need BQL, but
  sometimes maybe it will!  Double checking on this will be needed.

  - VMSD's post_load(): in many cases, this DO need BQL, for example on
  address space operations.  Likely we should just take it for any
  post_load().

  - VMSD field's get(): this is tricky!  It could internally be anything
  even if it was only a field.  E.g. there can be users to use a SINGLE
  field to load a whole VMSD, which can further introduce more
  possibilities.

In general, QEMUFile IOs should not need BQL, that is when receiving the
VMSD data and waiting for e.g. the socket buffer to get refilled.  But
that's the easy part.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 include/migration/colo.h |  6 ++--
 migration/migration.h    | 14 +++-----
 migration/colo-stubs.c   |  2 +-
 migration/colo.c         | 24 ++++---------
 migration/migration.c    | 77 +++++++++++++++++++++++++---------------
 migration/rdma.c         | 34 +-----------------
 migration/savevm.c       |  8 ++---
 migration/trace-events   |  4 +--
 8 files changed, 69 insertions(+), 100 deletions(-)

diff --git a/include/migration/colo.h b/include/migration/colo.h
index d4fe422e4d..5de7d715a7 100644
--- a/include/migration/colo.h
+++ b/include/migration/colo.h
@@ -44,12 +44,10 @@ void colo_do_failover(void);
 void colo_checkpoint_delay_set(void);
 
 /*
- * Starts COLO incoming process. Called from process_incoming_migration_co()
+ * Starts COLO incoming process. Called from migration_incoming_thread()
  * after loading the state.
- *
- * Called with BQL locked, may temporary release BQL.
  */
-void coroutine_fn colo_incoming_co(void);
+void colo_incoming_wait(void);
 
 void colo_shutdown(void);
 #endif
diff --git a/migration/migration.h b/migration/migration.h
index e1c5029110..0d22dc8cc2 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -214,6 +214,10 @@ struct MigrationIncomingState {
     bool           have_listen_thread;
     QemuThread     listen_thread;
 
+    /* Migration main recv thread */
+    bool           have_recv_thread;
+    QemuThread     recv_thread;
+
     /* For the kernel to send us notifications */
     int       userfault_fd;
     /* To notify the fault_thread to wake, e.g., when need to quit */
@@ -272,15 +276,7 @@ struct MigrationIncomingState {
 
     MigrationStatus state;
 
-    /*
-     * The incoming migration coroutine, non-NULL during qemu_loadvm_state().
-     * Used to wake the migration incoming coroutine from rdma code. How much is
-     * it safe - it's a question.
-     */
-    Coroutine *loadvm_co;
-
-    /* The coroutine we should enter (back) after failover */
-    Coroutine *colo_incoming_co;
+    /* Notify secondary VM to move on */
     QemuEvent colo_incoming_event;
 
     /* Optional load threads pool and its thread exit request flag */
diff --git a/migration/colo-stubs.c b/migration/colo-stubs.c
index e22ce65234..ef77d1ab4b 100644
--- a/migration/colo-stubs.c
+++ b/migration/colo-stubs.c
@@ -9,7 +9,7 @@ void colo_shutdown(void)
 {
 }
 
-void coroutine_fn colo_incoming_co(void)
+void colo_incoming_wait(void)
 {
 }
 
diff --git a/migration/colo.c b/migration/colo.c
index 4fd586951a..81276a3e65 100644
--- a/migration/colo.c
+++ b/migration/colo.c
@@ -147,11 +147,6 @@ static void secondary_vm_do_failover(void)
     }
     /* Notify COLO incoming thread that failover work is finished */
     qemu_event_set(&mis->colo_incoming_event);
-
-    /* For Secondary VM, jump to incoming co */
-    if (mis->colo_incoming_co) {
-        qemu_coroutine_enter(mis->colo_incoming_co);
-    }
 }
 
 static void primary_vm_do_failover(void)
@@ -848,10 +843,8 @@ static void *colo_process_incoming_thread(void *opaque)
 
     mis->to_src_file = qemu_file_get_return_path(mis->from_src_file);
     /*
-     * Note: the communication between Primary side and Secondary side
-     * should be sequential, we set the fd to unblocked in migration incoming
-     * coroutine, and here we are in the COLO incoming thread, so it is ok to
-     * set the fd back to blocked.
+     * Here we are in the COLO incoming thread, so it is ok to set the fd
+     * to blocking.
      */
     if (!qemu_file_set_blocking(mis->from_src_file, true, &local_err)) {
         error_report_err(local_err);
@@ -927,27 +920,22 @@ out:
     return NULL;
 }
 
-void coroutine_fn colo_incoming_co(void)
+/* Wait for failover */
+void colo_incoming_wait(void)
 {
     MigrationIncomingState *mis = migration_incoming_get_current();
     QemuThread th;
 
-    assert(bql_locked());
     assert(migration_incoming_colo_enabled());
 
     qemu_thread_create(&th, MIGRATION_THREAD_DST_COLO,
                        colo_process_incoming_thread,
                        mis, QEMU_THREAD_JOINABLE);
 
-    mis->colo_incoming_co = qemu_coroutine_self();
-    qemu_coroutine_yield();
-    mis->colo_incoming_co = NULL;
-
-    bql_unlock();
     /* Wait checkpoint incoming thread exit before free resource */
     qemu_thread_join(&th);
-    bql_lock();
 
-    /* We hold the global BQL, so it is safe here */
+    bql_lock();
     colo_release_ram_cache();
+    bql_unlock();
 }
diff --git a/migration/migration.c b/migration/migration.c
index 38a584afae..728d02dbee 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -491,6 +491,11 @@ void migration_incoming_state_destroy(void)
         mis->postcopy_qemufile_dst = NULL;
     }
 
+    if (mis->have_recv_thread) {
+        qemu_thread_join(&mis->recv_thread);
+        mis->have_recv_thread = false;
+    }
+
     cpr_set_incoming_mode(MIG_MODE_NONE);
     yank_unregister_instance(MIGRATION_YANK_INSTANCE);
 }
@@ -861,30 +866,46 @@ static void process_incoming_migration_bh(void *opaque)
     migration_incoming_state_destroy();
 }
 
-static void coroutine_fn
-process_incoming_migration_co(void *opaque)
+static void migration_incoming_state_destroy_bh(void *opaque)
+{
+    struct MigrationIncomingState *mis = opaque;
+
+    migration_incoming_state_destroy();
+
+    if (mis->exit_on_error) {
+        /*
+         * NOTE: this exit() should better happen in the main thread, as
+         * the exit notifier may require BQL which can deadlock.  See
+         * commit e7bc0204e57836 for example.
+         */
+        exit(EXIT_FAILURE);
+    }
+}
+
+static void *migration_incoming_thread(void *opaque)
 {
     MigrationState *s = migrate_get_current();
-    MigrationIncomingState *mis = migration_incoming_get_current();
+    MigrationIncomingState *mis = opaque;
     PostcopyState ps;
     int ret;
     Error *local_err = NULL;
 
+    rcu_register_thread();
+
     assert(mis->from_src_file);
+    assert(!bql_locked());
 
     mis->largest_page_size = qemu_ram_pagesize_largest();
     postcopy_state_set(POSTCOPY_INCOMING_NONE);
     migrate_set_state(&mis->state, MIGRATION_STATUS_SETUP,
                       MIGRATION_STATUS_ACTIVE);
 
-    mis->loadvm_co = qemu_coroutine_self();
-    ret = qemu_loadvm_state(mis->from_src_file, true, &local_err);
-    mis->loadvm_co = NULL;
+    ret = qemu_loadvm_state(mis->from_src_file, false, &local_err);
 
     trace_vmstate_downtime_checkpoint("dst-precopy-loadvm-completed");
 
     ps = postcopy_state_get();
-    trace_process_incoming_migration_co_end(ret, ps);
+    trace_process_incoming_migration_end(ret, ps);
     if (ps != POSTCOPY_INCOMING_NONE) {
         if (ps == POSTCOPY_INCOMING_ADVISE) {
             /*
@@ -898,7 +919,7 @@ process_incoming_migration_co(void *opaque)
              * Postcopy was started, cleanup should happen at the end of the
              * postcopy thread.
              */
-            trace_process_incoming_migration_co_postcopy_end_main();
+            trace_process_incoming_migration_postcopy_end_main();
             goto out;
         }
         /* Else if something went wrong then just fall out of the normal exit */
@@ -911,8 +932,8 @@ process_incoming_migration_co(void *opaque)
     }
 
     if (migration_incoming_colo_enabled()) {
-        /* yield until COLO exit */
-        colo_incoming_co();
+        /* wait until COLO exits */
+        colo_incoming_wait();
     }
 
     migration_bh_schedule(process_incoming_migration_bh, mis);
@@ -924,28 +945,22 @@ fail:
     migrate_set_error(s, local_err);
     error_free(local_err);
 
-    migration_incoming_state_destroy();
+    WITH_QEMU_LOCK_GUARD(&s->error_mutex) {
+        error_report_err(s->error);
+        s->error = NULL;
+    }
 
-    if (mis->exit_on_error) {
-        WITH_QEMU_LOCK_GUARD(&s->error_mutex) {
-            error_report_err(s->error);
-            s->error = NULL;
-        }
+    /*
+     * There's some step of the destroy process that will need to happen in
+     * the main thread (e.g. joining this thread itself).  Leave to a BH.
+     */
+    migration_bh_schedule(migration_incoming_state_destroy_bh, (void *)mis);
 
-        exit(EXIT_FAILURE);
-    } else {
-        /*
-         * Report the error here in case that QEMU abruptly exits
-         * when postcopy is enabled.
-         */
-        WITH_QEMU_LOCK_GUARD(&s->error_mutex) {
-            error_report_err(s->error);
-            s->error = NULL;
-        }
-    }
 out:
     /* Pairs with the refcount taken in qmp_migrate_incoming() */
     migrate_incoming_unref_outgoing_state();
+    rcu_unregister_thread();
+    return NULL;
 }
 
 /**
@@ -963,8 +978,12 @@ static void migration_incoming_setup(QEMUFile *f)
 
 void migration_incoming_process(void)
 {
-    Coroutine *co = qemu_coroutine_create(process_incoming_migration_co, NULL);
-    qemu_coroutine_enter(co);
+    MigrationIncomingState *mis = migration_incoming_get_current();
+
+    mis->have_recv_thread = true;
+    qemu_thread_create(&mis->recv_thread, "mig/dst/main",
+                       migration_incoming_thread, mis,
+                       QEMU_THREAD_JOINABLE);
 }
 
 /* Returns true if recovered from a paused migration, otherwise false */
diff --git a/migration/rdma.c b/migration/rdma.c
index 0e5e02cdca..3389f6448b 100644
--- a/migration/rdma.c
+++ b/migration/rdma.c
@@ -3051,37 +3051,6 @@ int rdma_control_save_page(QEMUFile *f, ram_addr_t block_offset,
 
 static void rdma_accept_incoming_migration(void *opaque);
 
-static void rdma_cm_poll_handler(void *opaque)
-{
-    RDMAContext *rdma = opaque;
-    struct rdma_cm_event *cm_event;
-    MigrationIncomingState *mis = migration_incoming_get_current();
-
-    if (rdma_get_cm_event(rdma->channel, &cm_event) < 0) {
-        error_report("get_cm_event failed %d", errno);
-        return;
-    }
-
-    if (cm_event->event == RDMA_CM_EVENT_DISCONNECTED ||
-        cm_event->event == RDMA_CM_EVENT_DEVICE_REMOVAL) {
-        if (!rdma->errored &&
-            migration_incoming_get_current()->state !=
-              MIGRATION_STATUS_COMPLETED) {
-            error_report("receive cm event, cm event is %d", cm_event->event);
-            rdma->errored = true;
-            if (rdma->return_path) {
-                rdma->return_path->errored = true;
-            }
-        }
-        rdma_ack_cm_event(cm_event);
-        if (mis->loadvm_co) {
-            qemu_coroutine_enter(mis->loadvm_co);
-        }
-        return;
-    }
-    rdma_ack_cm_event(cm_event);
-}
-
 static int qemu_rdma_accept(RDMAContext *rdma)
 {
     Error *err = NULL;
@@ -3199,8 +3168,7 @@ static int qemu_rdma_accept(RDMAContext *rdma)
                             NULL,
                             (void *)(intptr_t)rdma->return_path);
     } else {
-        qemu_set_fd_handler(rdma->channel->fd, rdma_cm_poll_handler,
-                            NULL, rdma);
+        qemu_set_fd_handler(rdma->channel->fd, NULL, NULL, NULL);
     }
 
     ret = rdma_accept(rdma->cm_id, &conn_param);
diff --git a/migration/savevm.c b/migration/savevm.c
index 44aadc2f51..991f46593c 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2118,7 +2118,7 @@ static void *postcopy_ram_listen_thread(void *opaque)
     qemu_file_set_blocking(f, true, &error_fatal);
 
     /* TODO: sanity check that only postcopiable data will be loaded here */
-    load_res = qemu_loadvm_state_main(f, mis, true, &local_err);
+    load_res = qemu_loadvm_state_main(f, mis, false, &local_err);
 
     /*
      * This is tricky, but, mis->from_src_file can change after it
@@ -2415,11 +2415,11 @@ static void loadvm_postcopy_handle_resume(MigrationIncomingState *mis)
  * Immediately following this command is a blob of data containing an embedded
  * chunk of migration stream; read it and load it.
  *
- * @mis: Incoming state
- * @length: Length of packaged data to read
+ * @mis:      Incoming state
+ * @bql_held: Whether BQL is held already
+ * @errp:     The Error** to set when returning failures.
  *
  * Returns: Negative values on error
- *
  */
 static int loadvm_handle_cmd_packaged(MigrationIncomingState *mis,
                                       bool bql_held, Error **errp)
diff --git a/migration/trace-events b/migration/trace-events
index e8edd1fbba..2b7b522e73 100644
--- a/migration/trace-events
+++ b/migration/trace-events
@@ -193,8 +193,8 @@ source_return_path_thread_resume_ack(uint32_t v) "%"PRIu32
 source_return_path_thread_switchover_acked(void) ""
 migration_thread_low_pending(uint64_t pending) "%" PRIu64
 migrate_transferred(uint64_t transferred, uint64_t time_spent, uint64_t bandwidth, uint64_t avail_bw, uint64_t size) "transferred %" PRIu64 " time_spent %" PRIu64 " bandwidth %" PRIu64 " switchover_bw %" PRIu64 " max_size %" PRId64
-process_incoming_migration_co_end(int ret, int ps) "ret=%d postcopy-state=%d"
-process_incoming_migration_co_postcopy_end_main(void) ""
+process_incoming_migration_end(int ret, int ps) "ret=%d postcopy-state=%d"
+process_incoming_migration_postcopy_end_main(void) ""
 postcopy_preempt_enabled(bool value) "%d"
 migration_precopy_complete(void) ""
 
-- 
2.50.1

Re: [PATCH 08/13] migration: Thread-ify precopy vmstate load process

Posted by Lukas Straub 3 weeks, 2 days ago

On Wed, 22 Oct 2025 15:26:07 -0400
Peter Xu <peterx@redhat.com> wrote:

> Migration module was there for 10+ years.  Initially, it was in most cases
> based on coroutines.  As more features were added into the framework, like
> postcopy, multifd, etc.. it became a mixture of threads and coroutines.
> 
> I'm guessing coroutines just can't fix all issues that migration want to
> resolve.
> 
> After all these years, migration is now heavily based on a threaded model.
> 
> Now there's still a major part of migration framework that is still not
> thread-based, which is precopy load.  We do load in a separate thread in
> postcopy since the 1st day postcopy was introduced, however that requires a
> separate state transition from precopy loading all devices first, which
> still happens in the main thread of a coroutine.
> 
> This patch tries to move the migration incoming side to be run inside a
> separate thread (mig/dst/main) just like the src (mig/src/main).  The
> entrance to be migration_incoming_thread().
> 
> Quite a few things are needed to make it fly..  One note here is we need to
> change all these things in one patch to not break anything.  The other way
> to do this is add code to make all paths (that this patch touched) be ready
> for either coroutine or thread.  That may cause confusions in another way.
> So reviewers, please take my sincere apology on the hardness of reviewing
> this patch: it covers a few modules at the same time, and with some risky
> changes.
> 
> BQL Analysis
> ============
> 
> Firstly, when moving it over to the thread, it means the thread cannot take
> BQL during the whole process of loading anymore, because otherwise it can
> block main thread from using the BQL for all kinds of other concurrent
> tasks (for example, processing QMP / HMP commands).
> 
> Here the first question to ask is: what needs BQL during precopy load, and
> what doesn't?
> 
> Most of the load process shouldn't need BQL, especially when it's about
> RAM.  After all, RAM is still the major chunk of data to move for a live
> migration process.  VFIO started to change that, though, but still, VFIO is
> per-device so that shouldn't need BQL either in most cases.
> 
> Generic device loads will need BQL, likely not when receiving VMSDs, but
> when applying them.  One example is any post_load() could potentially
> inject memory regions causing memory transactions to happen.  That'll need
> to update the global address spaces, hence requires BQL.  The other one is
> CPU sync operations, even if the sync alone may not need BQL (which is
> still to be further justified), run_on_cpu() will need it.
> 
> For that, qemu_loadvm_state() and qemu_loadvm_state_main() functions need
> to now take a "bql_held" parameter saying whether bql is held.  We could
> use things like BQL_LOCK_GUARD(), but this patch goes with explicit
> lockings rather than relying on bql_locked TLS variable.  In case of
> migration, we always know whether BQL is held in different context as long
> as we can still pass that information downwards.
> 
> COLO
> ====
> 
> COLO assumed the dest VM load happens in a coroutine.  After this patch,
> it's not anymore.  Change that by invoking colo_incoming_co() directly from
> the migration_incoming_thread().
> 
> The name (colo_incoming_co()) isn't proper anymore.  Change it to
> colo_incoming_wait(), removing the coroutine annotation alongside.
> 
> Remove all the bql_lock() implications in COLO, e.g., colo_incoming_co()
> used to release the lock for a short period while join().  Now it's not
> needed.  Instead, taking BQL but only when needed (colo_release_ram_cache).
> 
> At the meantime, there's colo_incoming_co variable that used to store the
> COLO incoming coroutine, only to be kicked off when a secondary failover
> happens.
> 
> To recap, what should happen for such failover should be (taking example of
> a QMP command x-colo-lost-heartbeat triggering on dest QEMU):
> 
>   - The QMP command will kick off both the coroutine and the COLO
>     thread (colo_process_incoming_thread()), with something like:
> 
>     /* Notify COLO incoming thread that failover work is finished */
>     qemu_event_set(&mis->colo_incoming_event);
> 
>     qemu_coroutine_enter(mis->colo_incoming_co);
> 
>   - The coroutine, which yielded itself before, now resumes after enter(),
>     then it'll wait for the join():
> 
>     mis->colo_incoming_co = qemu_coroutine_self();
>     qemu_coroutine_yield();
>     mis->colo_incoming_co = NULL;
> 
>     /* Wait checkpoint incoming thread exit before free resource */
>     qemu_thread_join(&th);
> 
> Here, when switching to a thread model, it should be fine removing
> colo_incoming_co variable completely, because if so, the incoming thread
> will (instead of yielding the coroutine) wait at qemu_thread_join() until
> the colo thread completes execution (after receiving colo_incoming_event).
> 
> [...]
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  include/migration/colo.h |  6 ++--
>  migration/migration.h    | 14 +++-----
>  migration/colo-stubs.c   |  2 +-
>  migration/colo.c         | 24 ++++---------
>  migration/migration.c    | 77 +++++++++++++++++++++++++---------------
>  migration/rdma.c         | 34 +-----------------
>  migration/savevm.c       |  8 ++---
>  migration/trace-events   |  4 +--
>  8 files changed, 69 insertions(+), 100 deletions(-)
> 

Looks good to me, you got the COLO parts exactly right. Thank you for
putting so much effort in this.

Reviewed-by: Lukas Straub <lukasstraub2@web.de>

Best Regards,
Lukas Straub

Re: [PATCH 08/13] migration: Thread-ify precopy vmstate load process

Posted by Fabiano Rosas 1 month ago

Peter Xu <peterx@redhat.com> writes:

> Migration module was there for 10+ years.  Initially, it was in most cases
> based on coroutines.  As more features were added into the framework, like
> postcopy, multifd, etc.. it became a mixture of threads and coroutines.
>
> I'm guessing coroutines just can't fix all issues that migration want to
> resolve.
>
> After all these years, migration is now heavily based on a threaded model.
>
> Now there's still a major part of migration framework that is still not
> thread-based, which is precopy load.  We do load in a separate thread in
> postcopy since the 1st day postcopy was introduced, however that requires a
> separate state transition from precopy loading all devices first, which
> still happens in the main thread of a coroutine.
>
> This patch tries to move the migration incoming side to be run inside a
> separate thread (mig/dst/main) just like the src (mig/src/main).  The
> entrance to be migration_incoming_thread().
>
> Quite a few things are needed to make it fly..  One note here is we need to
> change all these things in one patch to not break anything.  The other way
> to do this is add code to make all paths (that this patch touched) be ready
> for either coroutine or thread.  That may cause confusions in another way.
> So reviewers, please take my sincere apology on the hardness of reviewing
> this patch: it covers a few modules at the same time, and with some risky
> changes.
>
> BQL Analysis
> ============
>
> Firstly, when moving it over to the thread, it means the thread cannot take
> BQL during the whole process of loading anymore, because otherwise it can
> block main thread from using the BQL for all kinds of other concurrent
> tasks (for example, processing QMP / HMP commands).
>
> Here the first question to ask is: what needs BQL during precopy load, and
> what doesn't?
>

I just noticed that the BQL held at process_incoming_migration_co is
also responsible for stopping qmp_migrate_set_capabilities from being
dispatched.

Any point during incoming migration when BQL is unlocked we have a
window where a capability could be changed. Same for parameters, for
that matter.

To make matters worse, the -incoming cmdline will trigger
qmp_migrate_incoming->...->migration_transport_compatible early on, but
until the channels finally connect and process_incoming_migration_co
starts it's possible to just change a capability in an incompatible way
and the transport will never be validated again.

One example:

-- >8 --
From 99bd88aa0a8b6d4e7c52196f25d344a2800b3d89 Mon Sep 17 00:00:00 2001
From: Fabiano Rosas <farosas@suse.de>
Date: Thu, 8 Jan 2026 17:21:20 -0300
Subject: [PATCH] tmp

---
 tests/qtest/migration/precopy-tests.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/tests/qtest/migration/precopy-tests.c b/tests/qtest/migration/precopy-tests.c
index aca7ed51ef..3f1a2870ee 100644
--- a/tests/qtest/migration/precopy-tests.c
+++ b/tests/qtest/migration/precopy-tests.c
@@ -158,6 +158,13 @@ static int new_rdma_link(char *buffer, bool ipv6)
     return -1;
 }
 
+static void *migrate_rdma_set_caps(QTestState *from, QTestState *to)
+{
+    migrate_set_capability(to, "mapped-ram", true);
+
+    return NULL;
+}
+
 static void __test_precopy_rdma_plain(MigrateCommon *args, bool ipv6)
 {
     char buffer[128] = {};
@@ -185,6 +192,7 @@ static void __test_precopy_rdma_plain(MigrateCommon *args, bool ipv6)
 
     args->listen_uri = uri;
     args->connect_uri = uri;
+    args->start_hook = migrate_rdma_set_caps;
 
     test_precopy_common(args);
 }
-- 
2.51.0

Re: [PATCH 08/13] migration: Thread-ify precopy vmstate load process

Posted by Peter Xu 3 weeks, 6 days ago

On Thu, Jan 08, 2026 at 05:27:37PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > Migration module was there for 10+ years.  Initially, it was in most cases
> > based on coroutines.  As more features were added into the framework, like
> > postcopy, multifd, etc.. it became a mixture of threads and coroutines.
> >
> > I'm guessing coroutines just can't fix all issues that migration want to
> > resolve.
> >
> > After all these years, migration is now heavily based on a threaded model.
> >
> > Now there's still a major part of migration framework that is still not
> > thread-based, which is precopy load.  We do load in a separate thread in
> > postcopy since the 1st day postcopy was introduced, however that requires a
> > separate state transition from precopy loading all devices first, which
> > still happens in the main thread of a coroutine.
> >
> > This patch tries to move the migration incoming side to be run inside a
> > separate thread (mig/dst/main) just like the src (mig/src/main).  The
> > entrance to be migration_incoming_thread().
> >
> > Quite a few things are needed to make it fly..  One note here is we need to
> > change all these things in one patch to not break anything.  The other way
> > to do this is add code to make all paths (that this patch touched) be ready
> > for either coroutine or thread.  That may cause confusions in another way.
> > So reviewers, please take my sincere apology on the hardness of reviewing
> > this patch: it covers a few modules at the same time, and with some risky
> > changes.
> >
> > BQL Analysis
> > ============
> >
> > Firstly, when moving it over to the thread, it means the thread cannot take
> > BQL during the whole process of loading anymore, because otherwise it can
> > block main thread from using the BQL for all kinds of other concurrent
> > tasks (for example, processing QMP / HMP commands).
> >
> > Here the first question to ask is: what needs BQL during precopy load, and
> > what doesn't?
> >
> 
> I just noticed that the BQL held at process_incoming_migration_co is
> also responsible for stopping qmp_migrate_set_capabilities from being
> dispatched.

I don't know if it is by design, or even if it will be guaranteed to work..

Consider the migration incoming rocoutine runs into qemu_get_byte(), and
then proactively yield the migration coroutine (qemu_coroutine_yield())
when the incoming port is blocked on read..

AFAIU, a proper fix for that (note, this will currently break tests) is:

bool migration_is_running(void)
 {
-    MigrationState *s = current_migration;
+    MigrationStatus state;
 
-    if (!s) {
-        return false;
+    if (runstate_check(RUN_STATE_INMIGRATE)) {
+        MigrationIncomingState *mis = migration_incoming_get_current();
+
+        if (!mis) {
+            return false;
+        }
+
+        state = mis->state;
+    } else {
+        MigrationState *s = migrate_get_current();
+
+        if (!s) {
+            return false;
+        }
+
+        state = s->state;
     }
 
-    switch (s->state) {
+    switch (state) {
     case MIGRATION_STATUS_ACTIVE:
     case MIGRATION_STATUS_POSTCOPY_DEVICE:
     case MIGRATION_STATUS_POSTCOPY_ACTIVE:

> 
> Any point during incoming migration when BQL is unlocked we have a
> window where a capability could be changed. Same for parameters, for
> that matter.
> 
> To make matters worse, the -incoming cmdline will trigger
> qmp_migrate_incoming->...->migration_transport_compatible early on, but
> until the channels finally connect and process_incoming_migration_co
> starts it's possible to just change a capability in an incompatible way
> and the transport will never be validated again.

Right.  Above should fix it, but I believe it also means after "-incoming
tcp:xxx" (or anything not "defer") we should forbid changing migration caps
or params on destination.

As discussed above, that'll at least break our qtests.  But frankly
speaking I think that's the right thing to do..  I hope libvirt always
works with "defer" and never update any caps/params after QMP
migrate_incoming.

So I wonder if I should continue with above patch, and then fix our qtests.
Your work from the other "merge caps+params" might also work here,
actually, if we make sure everything will be set alone with the QMP
migrate_incoming single command.

Let me know your initial thoughts, then I'll see what I can do..

Thanks,

> 
> One example:
> 
> -- >8 --
> From 99bd88aa0a8b6d4e7c52196f25d344a2800b3d89 Mon Sep 17 00:00:00 2001
> From: Fabiano Rosas <farosas@suse.de>
> Date: Thu, 8 Jan 2026 17:21:20 -0300
> Subject: [PATCH] tmp
> 
> ---
>  tests/qtest/migration/precopy-tests.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/tests/qtest/migration/precopy-tests.c b/tests/qtest/migration/precopy-tests.c
> index aca7ed51ef..3f1a2870ee 100644
> --- a/tests/qtest/migration/precopy-tests.c
> +++ b/tests/qtest/migration/precopy-tests.c
> @@ -158,6 +158,13 @@ static int new_rdma_link(char *buffer, bool ipv6)
>      return -1;
>  }
>  
> +static void *migrate_rdma_set_caps(QTestState *from, QTestState *to)
> +{
> +    migrate_set_capability(to, "mapped-ram", true);
> +
> +    return NULL;
> +}
> +
>  static void __test_precopy_rdma_plain(MigrateCommon *args, bool ipv6)
>  {
>      char buffer[128] = {};
> @@ -185,6 +192,7 @@ static void __test_precopy_rdma_plain(MigrateCommon *args, bool ipv6)
>  
>      args->listen_uri = uri;
>      args->connect_uri = uri;
> +    args->start_hook = migrate_rdma_set_caps;
>  
>      test_precopy_common(args);
>  }
> -- 
> 2.51.0
> 

-- 
Peter Xu

Re: [PATCH 08/13] migration: Thread-ify precopy vmstate load process

Posted by Fabiano Rosas 3 weeks, 6 days ago

Peter Xu <peterx@redhat.com> writes:

> On Thu, Jan 08, 2026 at 05:27:37PM -0300, Fabiano Rosas wrote:
>> Peter Xu <peterx@redhat.com> writes:
>> 
>> > Migration module was there for 10+ years.  Initially, it was in most cases
>> > based on coroutines.  As more features were added into the framework, like
>> > postcopy, multifd, etc.. it became a mixture of threads and coroutines.
>> >
>> > I'm guessing coroutines just can't fix all issues that migration want to
>> > resolve.
>> >
>> > After all these years, migration is now heavily based on a threaded model.
>> >
>> > Now there's still a major part of migration framework that is still not
>> > thread-based, which is precopy load.  We do load in a separate thread in
>> > postcopy since the 1st day postcopy was introduced, however that requires a
>> > separate state transition from precopy loading all devices first, which
>> > still happens in the main thread of a coroutine.
>> >
>> > This patch tries to move the migration incoming side to be run inside a
>> > separate thread (mig/dst/main) just like the src (mig/src/main).  The
>> > entrance to be migration_incoming_thread().
>> >
>> > Quite a few things are needed to make it fly..  One note here is we need to
>> > change all these things in one patch to not break anything.  The other way
>> > to do this is add code to make all paths (that this patch touched) be ready
>> > for either coroutine or thread.  That may cause confusions in another way.
>> > So reviewers, please take my sincere apology on the hardness of reviewing
>> > this patch: it covers a few modules at the same time, and with some risky
>> > changes.
>> >
>> > BQL Analysis
>> > ============
>> >
>> > Firstly, when moving it over to the thread, it means the thread cannot take
>> > BQL during the whole process of loading anymore, because otherwise it can
>> > block main thread from using the BQL for all kinds of other concurrent
>> > tasks (for example, processing QMP / HMP commands).
>> >
>> > Here the first question to ask is: what needs BQL during precopy load, and
>> > what doesn't?
>> >
>> 
>> I just noticed that the BQL held at process_incoming_migration_co is
>> also responsible for stopping qmp_migrate_set_capabilities from being
>> dispatched.
>
> I don't know if it is by design, or even if it will be guaranteed to work..
>

Regardless, we shouldn't rely on the BQL for this. The BQL should be
left as last resort for things that interact across subsystems. If
someone is issuing a migration command during a migration, the migration
code is exquisitely positioned to handle that itself.

> Consider the migration incoming rocoutine runs into qemu_get_byte(), and
> then proactively yield the migration coroutine (qemu_coroutine_yield())
> when the incoming port is blocked on read..
>
> AFAIU, a proper fix for that (note, this will currently break tests) is:
>
> bool migration_is_running(void)
>  {
> -    MigrationState *s = current_migration;
> +    MigrationStatus state;
>  
> -    if (!s) {
> -        return false;
> +    if (runstate_check(RUN_STATE_INMIGRATE)) {
> +        MigrationIncomingState *mis = migration_incoming_get_current();
> +
> +        if (!mis) {
> +            return false;
> +        }
> +
> +        state = mis->state;
> +    } else {
> +        MigrationState *s = migrate_get_current();
> +
> +        if (!s) {
> +            return false;
> +        }
> +
> +        state = s->state;
>      }
>  
> -    switch (s->state) {
> +    switch (state) {
>      case MIGRATION_STATUS_ACTIVE:
>      case MIGRATION_STATUS_POSTCOPY_DEVICE:
>      case MIGRATION_STATUS_POSTCOPY_ACTIVE:
>

LGTM

>> 
>> Any point during incoming migration when BQL is unlocked we have a
>> window where a capability could be changed. Same for parameters, for
>> that matter.
>> 
>> To make matters worse, the -incoming cmdline will trigger
>> qmp_migrate_incoming->...->migration_transport_compatible early on, but
>> until the channels finally connect and process_incoming_migration_co
>> starts it's possible to just change a capability in an incompatible way
>> and the transport will never be validated again.
>
> Right.  Above should fix it, but I believe it also means after "-incoming
> tcp:xxx" (or anything not "defer") we should forbid changing migration caps
> or params on destination.
>

Parameters are never forbidden, right? And we cannot forbid them with
is_running because some parameters are allowed to be changed while
running.

I feel we should have a more fine grained way of saying "this option
cannot be set at this moment", instead of just using the state as a
proxy. States can change, while the fact that from a certain point on,
certain options should not be touched anymore doesn't change.

Maybe a little infra like bdrv_op_is_blocked, i.e, a list of blocked
operations. It could be set in qmp_migrate and checked in
qmp_set_parameters/caps.

> As discussed above, that'll at least break our qtests.  But frankly
> speaking I think that's the right thing to do..  I hope libvirt always
> works with "defer" and never update any caps/params after QMP
> migrate_incoming.
>
> So I wonder if I should continue with above patch, and then fix our qtests.
> Your work from the other "merge caps+params" might also work here,
> actually, if we make sure everything will be set alone with the QMP
> migrate_incoming single command.
>

For incoming, yes. And this is maybe a point in favor of adding the
'config'.

For outgoing, there's still the point I mentioned above about how to
restrict _some_ options to be allowed at runtime and others not.

> Let me know your initial thoughts, then I'll see what I can do..
>

We should fix the bug, I think your patch is good for that.

Although this kind of overlaps with some things we've been discussing
with Prasad. I'd be super happy if the code magically stopped using
QAPI's MigrationStatus for internal tracking of migration state and
blocking of commands and so on.

Whatever comes first =)

---
Side note, did we ever discuss something like this?

struct MigrationState {
   <state>
   union {
     <outgoing>
     <incoming>
   }
}

there's so much stuff in these structs...

Re: [PATCH 08/13] migration: Thread-ify precopy vmstate load process

Posted by Peter Xu 3 weeks, 6 days ago

On Mon, Jan 12, 2026 at 04:04:12PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Thu, Jan 08, 2026 at 05:27:37PM -0300, Fabiano Rosas wrote:
> >> Peter Xu <peterx@redhat.com> writes:
> >> 
> >> > Migration module was there for 10+ years.  Initially, it was in most cases
> >> > based on coroutines.  As more features were added into the framework, like
> >> > postcopy, multifd, etc.. it became a mixture of threads and coroutines.
> >> >
> >> > I'm guessing coroutines just can't fix all issues that migration want to
> >> > resolve.
> >> >
> >> > After all these years, migration is now heavily based on a threaded model.
> >> >
> >> > Now there's still a major part of migration framework that is still not
> >> > thread-based, which is precopy load.  We do load in a separate thread in
> >> > postcopy since the 1st day postcopy was introduced, however that requires a
> >> > separate state transition from precopy loading all devices first, which
> >> > still happens in the main thread of a coroutine.
> >> >
> >> > This patch tries to move the migration incoming side to be run inside a
> >> > separate thread (mig/dst/main) just like the src (mig/src/main).  The
> >> > entrance to be migration_incoming_thread().
> >> >
> >> > Quite a few things are needed to make it fly..  One note here is we need to
> >> > change all these things in one patch to not break anything.  The other way
> >> > to do this is add code to make all paths (that this patch touched) be ready
> >> > for either coroutine or thread.  That may cause confusions in another way.
> >> > So reviewers, please take my sincere apology on the hardness of reviewing
> >> > this patch: it covers a few modules at the same time, and with some risky
> >> > changes.
> >> >
> >> > BQL Analysis
> >> > ============
> >> >
> >> > Firstly, when moving it over to the thread, it means the thread cannot take
> >> > BQL during the whole process of loading anymore, because otherwise it can
> >> > block main thread from using the BQL for all kinds of other concurrent
> >> > tasks (for example, processing QMP / HMP commands).
> >> >
> >> > Here the first question to ask is: what needs BQL during precopy load, and
> >> > what doesn't?
> >> >
> >> 
> >> I just noticed that the BQL held at process_incoming_migration_co is
> >> also responsible for stopping qmp_migrate_set_capabilities from being
> >> dispatched.
> >
> > I don't know if it is by design, or even if it will be guaranteed to work..
> >
> 
> Regardless, we shouldn't rely on the BQL for this. The BQL should be
> left as last resort for things that interact across subsystems. If
> someone is issuing a migration command during a migration, the migration
> code is exquisitely positioned to handle that itself.

Yes I agree.

> 
> > Consider the migration incoming rocoutine runs into qemu_get_byte(), and
> > then proactively yield the migration coroutine (qemu_coroutine_yield())
> > when the incoming port is blocked on read..
> >
> > AFAIU, a proper fix for that (note, this will currently break tests) is:
> >
> > bool migration_is_running(void)
> >  {
> > -    MigrationState *s = current_migration;
> > +    MigrationStatus state;
> >  
> > -    if (!s) {
> > -        return false;
> > +    if (runstate_check(RUN_STATE_INMIGRATE)) {
> > +        MigrationIncomingState *mis = migration_incoming_get_current();
> > +
> > +        if (!mis) {
> > +            return false;
> > +        }
> > +
> > +        state = mis->state;
> > +    } else {
> > +        MigrationState *s = migrate_get_current();
> > +
> > +        if (!s) {
> > +            return false;
> > +        }
> > +
> > +        state = s->state;
> >      }
> >  
> > -    switch (s->state) {
> > +    switch (state) {
> >      case MIGRATION_STATUS_ACTIVE:
> >      case MIGRATION_STATUS_POSTCOPY_DEVICE:
> >      case MIGRATION_STATUS_POSTCOPY_ACTIVE:
> >
> 
> LGTM
> 
> >> 
> >> Any point during incoming migration when BQL is unlocked we have a
> >> window where a capability could be changed. Same for parameters, for
> >> that matter.
> >> 
> >> To make matters worse, the -incoming cmdline will trigger
> >> qmp_migrate_incoming->...->migration_transport_compatible early on, but
> >> until the channels finally connect and process_incoming_migration_co
> >> starts it's possible to just change a capability in an incompatible way
> >> and the transport will never be validated again.
> >
> > Right.  Above should fix it, but I believe it also means after "-incoming
> > tcp:xxx" (or anything not "defer") we should forbid changing migration caps
> > or params on destination.
> >
> 
> Parameters are never forbidden, right? And we cannot forbid them with
> is_running because some parameters are allowed to be changed while
> running.

Right, my above statement was definitely inaccurate.

After merging caps and params we only have params.  We should only allow
some params to be changed anytime.  Most of the params (including caps)
should not allow changing during migration live on either src/dst.

> 
> I feel we should have a more fine grained way of saying "this option
> cannot be set at this moment", instead of just using the state as a
> proxy. States can change, while the fact that from a certain point on,
> certain options should not be touched anymore doesn't change.

IIUC that's what migration_is_running() about?

It's a matter of if such finer granularity is necessary for us. Let's
assume the simple scheme is:

  (a) Some params are allowed to be modified anytime, examples,
      downtime-limit, max-bandwidth, max-postcopy-bandwidth, etc.

  (b) All the rest params and all capabilities are not allowed to be modified
      when migration_is_running() returns true (my fixed version above)

The whitelisted (a) should really be the smallest set of params, and
justified one by one or it should fall into (b).

My hope is the simple scheme should work for us already.  After merging
caps, it's only about some params that can be set anytime, rest params can
only be set when migration is not running.

> 
> Maybe a little infra like bdrv_op_is_blocked, i.e, a list of blocked
> operations. It could be set in qmp_migrate and checked in
> qmp_set_parameters/caps.

Any example you can think of, that above simple scheme won't work for us?

> 
> > As discussed above, that'll at least break our qtests.  But frankly
> > speaking I think that's the right thing to do..  I hope libvirt always
> > works with "defer" and never update any caps/params after QMP
> > migrate_incoming.
> >
> > So I wonder if I should continue with above patch, and then fix our qtests.
> > Your work from the other "merge caps+params" might also work here,
> > actually, if we make sure everything will be set alone with the QMP
> > migrate_incoming single command.
> >
> 
> For incoming, yes. And this is maybe a point in favor of adding the
> 'config'.
> 
> For outgoing, there's still the point I mentioned above about how to
> restrict _some_ options to be allowed at runtime and others not.
> 
> > Let me know your initial thoughts, then I'll see what I can do..
> >
> 
> We should fix the bug, I think your patch is good for that.
> 
> Although this kind of overlaps with some things we've been discussing
> with Prasad. I'd be super happy if the code magically stopped using
> QAPI's MigrationStatus for internal tracking of migration state and
> blocking of commands and so on.
> 
> Whatever comes first =)

I didn't follow as closely on the discussion there.  I don't know if
changing MigrationStatus is a good idea..  we should have some really good
reason to make libvirt and all mgmt need a change.. but maybe I misread the
discussion.  I can wait until I read something solid if Prasad is going to
propose something.

> 
> ---
> Side note, did we ever discuss something like this?
> 
> struct MigrationState {
>    <state>
>    union {
>      <outgoing>
>      <incoming>
>    }
> }
> 
> there's so much stuff in these structs...

Yeah..  When merging state, we'll also need to be careful on overlapped
fields, e.g. expecting a COMPLETED state (from a completed incoming
migration) when starting a new outgoing migration.

We can likely leave this for later.

-- 
Peter Xu

Re: [PATCH 08/13] migration: Thread-ify precopy vmstate load process

Posted by Fabiano Rosas 3 weeks, 6 days ago

Peter Xu <peterx@redhat.com> writes:

> On Mon, Jan 12, 2026 at 04:04:12PM -0300, Fabiano Rosas wrote:
>> Peter Xu <peterx@redhat.com> writes:
>> 
>> > On Thu, Jan 08, 2026 at 05:27:37PM -0300, Fabiano Rosas wrote:
>> >> Peter Xu <peterx@redhat.com> writes:
>> >> 
>> >> > Migration module was there for 10+ years.  Initially, it was in most cases
>> >> > based on coroutines.  As more features were added into the framework, like
>> >> > postcopy, multifd, etc.. it became a mixture of threads and coroutines.
>> >> >
>> >> > I'm guessing coroutines just can't fix all issues that migration want to
>> >> > resolve.
>> >> >
>> >> > After all these years, migration is now heavily based on a threaded model.
>> >> >
>> >> > Now there's still a major part of migration framework that is still not
>> >> > thread-based, which is precopy load.  We do load in a separate thread in
>> >> > postcopy since the 1st day postcopy was introduced, however that requires a
>> >> > separate state transition from precopy loading all devices first, which
>> >> > still happens in the main thread of a coroutine.
>> >> >
>> >> > This patch tries to move the migration incoming side to be run inside a
>> >> > separate thread (mig/dst/main) just like the src (mig/src/main).  The
>> >> > entrance to be migration_incoming_thread().
>> >> >
>> >> > Quite a few things are needed to make it fly..  One note here is we need to
>> >> > change all these things in one patch to not break anything.  The other way
>> >> > to do this is add code to make all paths (that this patch touched) be ready
>> >> > for either coroutine or thread.  That may cause confusions in another way.
>> >> > So reviewers, please take my sincere apology on the hardness of reviewing
>> >> > this patch: it covers a few modules at the same time, and with some risky
>> >> > changes.
>> >> >
>> >> > BQL Analysis
>> >> > ============
>> >> >
>> >> > Firstly, when moving it over to the thread, it means the thread cannot take
>> >> > BQL during the whole process of loading anymore, because otherwise it can
>> >> > block main thread from using the BQL for all kinds of other concurrent
>> >> > tasks (for example, processing QMP / HMP commands).
>> >> >
>> >> > Here the first question to ask is: what needs BQL during precopy load, and
>> >> > what doesn't?
>> >> >
>> >> 
>> >> I just noticed that the BQL held at process_incoming_migration_co is
>> >> also responsible for stopping qmp_migrate_set_capabilities from being
>> >> dispatched.
>> >
>> > I don't know if it is by design, or even if it will be guaranteed to work..
>> >
>> 
>> Regardless, we shouldn't rely on the BQL for this. The BQL should be
>> left as last resort for things that interact across subsystems. If
>> someone is issuing a migration command during a migration, the migration
>> code is exquisitely positioned to handle that itself.
>
> Yes I agree.
>
>> 
>> > Consider the migration incoming rocoutine runs into qemu_get_byte(), and
>> > then proactively yield the migration coroutine (qemu_coroutine_yield())
>> > when the incoming port is blocked on read..
>> >
>> > AFAIU, a proper fix for that (note, this will currently break tests) is:
>> >
>> > bool migration_is_running(void)
>> >  {
>> > -    MigrationState *s = current_migration;
>> > +    MigrationStatus state;
>> >  
>> > -    if (!s) {
>> > -        return false;
>> > +    if (runstate_check(RUN_STATE_INMIGRATE)) {
>> > +        MigrationIncomingState *mis = migration_incoming_get_current();
>> > +
>> > +        if (!mis) {
>> > +            return false;
>> > +        }
>> > +
>> > +        state = mis->state;
>> > +    } else {
>> > +        MigrationState *s = migrate_get_current();
>> > +
>> > +        if (!s) {
>> > +            return false;
>> > +        }
>> > +
>> > +        state = s->state;
>> >      }
>> >  
>> > -    switch (s->state) {
>> > +    switch (state) {
>> >      case MIGRATION_STATUS_ACTIVE:
>> >      case MIGRATION_STATUS_POSTCOPY_DEVICE:
>> >      case MIGRATION_STATUS_POSTCOPY_ACTIVE:
>> >
>> 
>> LGTM
>> 
>> >> 
>> >> Any point during incoming migration when BQL is unlocked we have a
>> >> window where a capability could be changed. Same for parameters, for
>> >> that matter.
>> >> 
>> >> To make matters worse, the -incoming cmdline will trigger
>> >> qmp_migrate_incoming->...->migration_transport_compatible early on, but
>> >> until the channels finally connect and process_incoming_migration_co
>> >> starts it's possible to just change a capability in an incompatible way
>> >> and the transport will never be validated again.
>> >
>> > Right.  Above should fix it, but I believe it also means after "-incoming
>> > tcp:xxx" (or anything not "defer") we should forbid changing migration caps
>> > or params on destination.
>> >
>> 
>> Parameters are never forbidden, right? And we cannot forbid them with
>> is_running because some parameters are allowed to be changed while
>> running.
>
> Right, my above statement was definitely inaccurate.
>
> After merging caps and params we only have params.  We should only allow
> some params to be changed anytime.  Most of the params (including caps)
> should not allow changing during migration live on either src/dst.
>
>> 
>> I feel we should have a more fine grained way of saying "this option
>> cannot be set at this moment", instead of just using the state as a
>> proxy. States can change, while the fact that from a certain point on,
>> certain options should not be touched anymore doesn't change.
>
> IIUC that's what migration_is_running() about?
>

At a high level, yes, but I think that's actually a downside of
migration_is_running. It's not explicit.

E.g.: qmp_migrate -> migration_transport_compatible is the first time
capabilities are checked in the code. Guess what migration_is_running
returns at that point? (it's false)

If there's ever a change in BQL behavior, there'll be a bug and we'll
see another patch extending the scope of migration_is_running, either
adding existing states to it, or as Prasad proposed, adding a new state
that is set a little later/earlier.

What we technically want is to stop accepting new capabilities as soon
as we're about to use them for the first time. Not as soon as (state ==
ACTIVE || SETUP || ...) which is a more high level definition.

> It's a matter of if such finer granularity is necessary for us. Let's
> assume the simple scheme is:
>
>   (a) Some params are allowed to be modified anytime, examples,
>       downtime-limit, max-bandwidth, max-postcopy-bandwidth, etc.
>

What about during migration_cleanup() or migrate-cancel? Won't there
ever be something in the path of setting these params that would require
a "healthy" migration?

>   (b) All the rest params and all capabilities are not allowed to be modified
>       when migration_is_running() returns true (my fixed version above)
>
> The whitelisted (a) should really be the smallest set of params, and
> justified one by one or it should fall into (b).
>
> My hope is the simple scheme should work for us already.  After merging
> caps, it's only about some params that can be set anytime, rest params can
> only be set when migration is not running.
>

Sure, I'd just like to avoid adding another set of "if
(s->parameters->has_foobar)" if possible.

>> 
>> Maybe a little infra like bdrv_op_is_blocked, i.e, a list of blocked
>> operations. It could be set in qmp_migrate and checked in
>> qmp_set_parameters/caps.
>
> Any example you can think of, that above simple scheme won't work for us?
>

It works. I think it's fragile due to:

- the reliance on BQL to avoid concurrent invocations;

- the possible addition of new states for unrelated reasons
  risks regressions and it's just extra work to make sure everything is
  still blocked at the time it needs to be blocked.

- any bugs need to be fixed by moving states around, which is dubious
  since the migration flow itself (which the states should represent)
  has not changed in a while.

  We're still seeing patches such as dc487044d5 ("migration: Make
  migration_has_failed() work even for CANCELLING") and the recent patch
  from Prasad. So there is something changing elsewhere and exposing
  these bugs and it's not immediately clear what that something is (or
  else we would have avoided the bug).

To be clear, I'm not asking we solve these issues at this moment, it's
fine if we go forward with the simple scheme. I'm just trying to clarify
what I think the problems are.

>> 
>> > As discussed above, that'll at least break our qtests.  But frankly
>> > speaking I think that's the right thing to do..  I hope libvirt always
>> > works with "defer" and never update any caps/params after QMP
>> > migrate_incoming.
>> >
>> > So I wonder if I should continue with above patch, and then fix our qtests.
>> > Your work from the other "merge caps+params" might also work here,
>> > actually, if we make sure everything will be set alone with the QMP
>> > migrate_incoming single command.
>> >
>> 
>> For incoming, yes. And this is maybe a point in favor of adding the
>> 'config'.
>> 
>> For outgoing, there's still the point I mentioned above about how to
>> restrict _some_ options to be allowed at runtime and others not.
>> 
>> > Let me know your initial thoughts, then I'll see what I can do..
>> >
>> 
>> We should fix the bug, I think your patch is good for that.
>> 
>> Although this kind of overlaps with some things we've been discussing
>> with Prasad. I'd be super happy if the code magically stopped using
>> QAPI's MigrationStatus for internal tracking of migration state and
>> blocking of commands and so on.
>> 
>> Whatever comes first =)
>
> I didn't follow as closely on the discussion there.  I don't know if
> changing MigrationStatus is a good idea..  we should have some really good
> reason to make libvirt and all mgmt need a change.. but maybe I misread the
> discussion.  I can wait until I read something solid if Prasad is going to
> propose something.
>

We're moving towards decoupling MigrationStatus from these ordinary code
checks. MigrationStatus should be used for informing management about a
relevant change in the migration stage of progression.

As it stands, we're either "exposing implementation details" or
"inadvertently changing the API". Both are undesirable.

Of course, we might introduce the "internal state is off-sync with
externally visible states" issue, but hopefully we'll catch that during
review. =)

>> 
>> ---
>> Side note, did we ever discuss something like this?
>> 
>> struct MigrationState {
>>    <state>
>>    union {
>>      <outgoing>
>>      <incoming>
>>    }
>> }
>> 
>> there's so much stuff in these structs...
>
> Yeah..  When merging state, we'll also need to be careful on overlapped
> fields, e.g. expecting a COMPLETED state (from a completed incoming
> migration) when starting a new outgoing migration.
>
> We can likely leave this for later.

No worries, I'm just collecting your opinions on it.

Re: [PATCH 08/13] migration: Thread-ify precopy vmstate load process

Posted by Peter Xu 3 weeks, 5 days ago

On Tue, Jan 13, 2026 at 10:04:02AM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Mon, Jan 12, 2026 at 04:04:12PM -0300, Fabiano Rosas wrote:
> >> Peter Xu <peterx@redhat.com> writes:
> >> 
> >> > On Thu, Jan 08, 2026 at 05:27:37PM -0300, Fabiano Rosas wrote:
> >> >> Peter Xu <peterx@redhat.com> writes:
> >> >> 
> >> >> > Migration module was there for 10+ years.  Initially, it was in most cases
> >> >> > based on coroutines.  As more features were added into the framework, like
> >> >> > postcopy, multifd, etc.. it became a mixture of threads and coroutines.
> >> >> >
> >> >> > I'm guessing coroutines just can't fix all issues that migration want to
> >> >> > resolve.
> >> >> >
> >> >> > After all these years, migration is now heavily based on a threaded model.
> >> >> >
> >> >> > Now there's still a major part of migration framework that is still not
> >> >> > thread-based, which is precopy load.  We do load in a separate thread in
> >> >> > postcopy since the 1st day postcopy was introduced, however that requires a
> >> >> > separate state transition from precopy loading all devices first, which
> >> >> > still happens in the main thread of a coroutine.
> >> >> >
> >> >> > This patch tries to move the migration incoming side to be run inside a
> >> >> > separate thread (mig/dst/main) just like the src (mig/src/main).  The
> >> >> > entrance to be migration_incoming_thread().
> >> >> >
> >> >> > Quite a few things are needed to make it fly..  One note here is we need to
> >> >> > change all these things in one patch to not break anything.  The other way
> >> >> > to do this is add code to make all paths (that this patch touched) be ready
> >> >> > for either coroutine or thread.  That may cause confusions in another way.
> >> >> > So reviewers, please take my sincere apology on the hardness of reviewing
> >> >> > this patch: it covers a few modules at the same time, and with some risky
> >> >> > changes.
> >> >> >
> >> >> > BQL Analysis
> >> >> > ============
> >> >> >
> >> >> > Firstly, when moving it over to the thread, it means the thread cannot take
> >> >> > BQL during the whole process of loading anymore, because otherwise it can
> >> >> > block main thread from using the BQL for all kinds of other concurrent
> >> >> > tasks (for example, processing QMP / HMP commands).
> >> >> >
> >> >> > Here the first question to ask is: what needs BQL during precopy load, and
> >> >> > what doesn't?
> >> >> >
> >> >> 
> >> >> I just noticed that the BQL held at process_incoming_migration_co is
> >> >> also responsible for stopping qmp_migrate_set_capabilities from being
> >> >> dispatched.
> >> >
> >> > I don't know if it is by design, or even if it will be guaranteed to work..
> >> >
> >> 
> >> Regardless, we shouldn't rely on the BQL for this. The BQL should be
> >> left as last resort for things that interact across subsystems. If
> >> someone is issuing a migration command during a migration, the migration
> >> code is exquisitely positioned to handle that itself.
> >
> > Yes I agree.
> >
> >> 
> >> > Consider the migration incoming rocoutine runs into qemu_get_byte(), and
> >> > then proactively yield the migration coroutine (qemu_coroutine_yield())
> >> > when the incoming port is blocked on read..
> >> >
> >> > AFAIU, a proper fix for that (note, this will currently break tests) is:
> >> >
> >> > bool migration_is_running(void)
> >> >  {
> >> > -    MigrationState *s = current_migration;
> >> > +    MigrationStatus state;
> >> >  
> >> > -    if (!s) {
> >> > -        return false;
> >> > +    if (runstate_check(RUN_STATE_INMIGRATE)) {
> >> > +        MigrationIncomingState *mis = migration_incoming_get_current();
> >> > +
> >> > +        if (!mis) {
> >> > +            return false;
> >> > +        }
> >> > +
> >> > +        state = mis->state;
> >> > +    } else {
> >> > +        MigrationState *s = migrate_get_current();
> >> > +
> >> > +        if (!s) {
> >> > +            return false;
> >> > +        }
> >> > +
> >> > +        state = s->state;
> >> >      }
> >> >  
> >> > -    switch (s->state) {
> >> > +    switch (state) {
> >> >      case MIGRATION_STATUS_ACTIVE:
> >> >      case MIGRATION_STATUS_POSTCOPY_DEVICE:
> >> >      case MIGRATION_STATUS_POSTCOPY_ACTIVE:
> >> >
> >> 
> >> LGTM
> >> 
> >> >> 
> >> >> Any point during incoming migration when BQL is unlocked we have a
> >> >> window where a capability could be changed. Same for parameters, for
> >> >> that matter.
> >> >> 
> >> >> To make matters worse, the -incoming cmdline will trigger
> >> >> qmp_migrate_incoming->...->migration_transport_compatible early on, but
> >> >> until the channels finally connect and process_incoming_migration_co
> >> >> starts it's possible to just change a capability in an incompatible way
> >> >> and the transport will never be validated again.
> >> >
> >> > Right.  Above should fix it, but I believe it also means after "-incoming
> >> > tcp:xxx" (or anything not "defer") we should forbid changing migration caps
> >> > or params on destination.
> >> >
> >> 
> >> Parameters are never forbidden, right? And we cannot forbid them with
> >> is_running because some parameters are allowed to be changed while
> >> running.
> >
> > Right, my above statement was definitely inaccurate.
> >
> > After merging caps and params we only have params.  We should only allow
> > some params to be changed anytime.  Most of the params (including caps)
> > should not allow changing during migration live on either src/dst.
> >
> >> 
> >> I feel we should have a more fine grained way of saying "this option
> >> cannot be set at this moment", instead of just using the state as a
> >> proxy. States can change, while the fact that from a certain point on,
> >> certain options should not be touched anymore doesn't change.
> >
> > IIUC that's what migration_is_running() about?
> >
> 
> At a high level, yes, but I think that's actually a downside of
> migration_is_running. It's not explicit.
> 
> E.g.: qmp_migrate -> migration_transport_compatible is the first time
> capabilities are checked in the code. Guess what migration_is_running
> returns at that point? (it's false)
> 
> If there's ever a change in BQL behavior, there'll be a bug and we'll
> see another patch extending the scope of migration_is_running, either
> adding existing states to it, or as Prasad proposed, adding a new state
> that is set a little later/earlier.
> 
> What we technically want is to stop accepting new capabilities as soon
> as we're about to use them for the first time. Not as soon as (state ==
> ACTIVE || SETUP || ...) which is a more high level definition.

Isn't migration_incoming_state_setup() almost immediately invoked (after
migration_transport_compatible() check passed), which will start to make
the retval of migration_is_running() reflecting the fact?

IMHO as long as QMP handlers are run with any lock (currently, BQL), then
such layout should keep working?  That will make sure QMP "migrate" and
"migrate_set_parameters" be serialized, that's, AFAICT, what we only need.

> 
> > It's a matter of if such finer granularity is necessary for us. Let's
> > assume the simple scheme is:
> >
> >   (a) Some params are allowed to be modified anytime, examples,
> >       downtime-limit, max-bandwidth, max-postcopy-bandwidth, etc.
> >
> 
> What about during migration_cleanup() or migrate-cancel? Won't there
> ever be something in the path of setting these params that would require
> a "healthy" migration?

If we'll introduce FAILING, then plus CANCELLING I think we should cover
all race against migration_cleanup()?  As long as we'll have both FAILING
and CANCELLING to return true for migration_is_running().

> 
> >   (b) All the rest params and all capabilities are not allowed to be modified
> >       when migration_is_running() returns true (my fixed version above)
> >
> > The whitelisted (a) should really be the smallest set of params, and
> > justified one by one or it should fall into (b).
> >
> > My hope is the simple scheme should work for us already.  After merging
> > caps, it's only about some params that can be set anytime, rest params can
> > only be set when migration is not running.
> >
> 
> Sure, I'd just like to avoid adding another set of "if
> (s->parameters->has_foobar)" if possible.
> 
> >> 
> >> Maybe a little infra like bdrv_op_is_blocked, i.e, a list of blocked
> >> operations. It could be set in qmp_migrate and checked in
> >> qmp_set_parameters/caps.
> >
> > Any example you can think of, that above simple scheme won't work for us?
> >
> 
> It works. I think it's fragile due to:
> 
> - the reliance on BQL to avoid concurrent invocations;
> 
> - the possible addition of new states for unrelated reasons
>   risks regressions and it's just extra work to make sure everything is
>   still blocked at the time it needs to be blocked.

Yes, we'll need to make sure when new states introduced it needs to report
correctly in migration_is_running().  I think it's not an issue because we
need to do it right with/without relying on the function to ban updates on
caps/params.. It always must be done right as migration_is_running() is
also used in other use cases.

> 
> - any bugs need to be fixed by moving states around, which is dubious
>   since the migration flow itself (which the states should represent)
>   has not changed in a while.
> 
>   We're still seeing patches such as dc487044d5 ("migration: Make
>   migration_has_failed() work even for CANCELLING") and the recent patch
>   from Prasad. So there is something changing elsewhere and exposing
>   these bugs and it's not immediately clear what that something is (or
>   else we would have avoided the bug).
> 
> To be clear, I'm not asking we solve these issues at this moment, it's
> fine if we go forward with the simple scheme. I'm just trying to clarify
> what I think the problems are.
> 
> >> 
> >> > As discussed above, that'll at least break our qtests.  But frankly
> >> > speaking I think that's the right thing to do..  I hope libvirt always
> >> > works with "defer" and never update any caps/params after QMP
> >> > migrate_incoming.
> >> >
> >> > So I wonder if I should continue with above patch, and then fix our qtests.
> >> > Your work from the other "merge caps+params" might also work here,
> >> > actually, if we make sure everything will be set alone with the QMP
> >> > migrate_incoming single command.
> >> >
> >> 
> >> For incoming, yes. And this is maybe a point in favor of adding the
> >> 'config'.
> >> 
> >> For outgoing, there's still the point I mentioned above about how to
> >> restrict _some_ options to be allowed at runtime and others not.
> >> 
> >> > Let me know your initial thoughts, then I'll see what I can do..
> >> >
> >> 
> >> We should fix the bug, I think your patch is good for that.
> >> 
> >> Although this kind of overlaps with some things we've been discussing
> >> with Prasad. I'd be super happy if the code magically stopped using
> >> QAPI's MigrationStatus for internal tracking of migration state and
> >> blocking of commands and so on.
> >> 
> >> Whatever comes first =)
> >
> > I didn't follow as closely on the discussion there.  I don't know if
> > changing MigrationStatus is a good idea..  we should have some really good
> > reason to make libvirt and all mgmt need a change.. but maybe I misread the
> > discussion.  I can wait until I read something solid if Prasad is going to
> > propose something.
> >
> 
> We're moving towards decoupling MigrationStatus from these ordinary code
> checks. MigrationStatus should be used for informing management about a
> relevant change in the migration stage of progression.
> 
> As it stands, we're either "exposing implementation details" or
> "inadvertently changing the API". Both are undesirable.
> 
> Of course, we might introduce the "internal state is off-sync with
> externally visible states" issue, but hopefully we'll catch that during
> review. =)

I need to confess I'm still a bit lost on what it is about, and what is the
problem we're solving here.. But I can always wait and read the patches
when they come up.. :)

> 
> >> 
> >> ---
> >> Side note, did we ever discuss something like this?
> >> 
> >> struct MigrationState {
> >>    <state>
> >>    union {
> >>      <outgoing>
> >>      <incoming>
> >>    }
> >> }
> >> 
> >> there's so much stuff in these structs...
> >
> > Yeah..  When merging state, we'll also need to be careful on overlapped
> > fields, e.g. expecting a COMPLETED state (from a completed incoming
> > migration) when starting a new outgoing migration.
> >
> > We can likely leave this for later.
> 
> No worries, I'm just collecting your opinions on it.
> 

-- 
Peter Xu

Re: [PATCH 08/13] migration: Thread-ify precopy vmstate load process

Posted by Fabiano Rosas 3 weeks, 2 days ago

Peter Xu <peterx@redhat.com> writes:

> On Tue, Jan 13, 2026 at 10:04:02AM -0300, Fabiano Rosas wrote:
>> Peter Xu <peterx@redhat.com> writes:
>> 
>> > On Mon, Jan 12, 2026 at 04:04:12PM -0300, Fabiano Rosas wrote:
>> >> Peter Xu <peterx@redhat.com> writes:
>> >> 
>> >> > On Thu, Jan 08, 2026 at 05:27:37PM -0300, Fabiano Rosas wrote:
>> >> >> Peter Xu <peterx@redhat.com> writes:
>> >> >> 
>> >> >> > Migration module was there for 10+ years.  Initially, it was in most cases
>> >> >> > based on coroutines.  As more features were added into the framework, like
>> >> >> > postcopy, multifd, etc.. it became a mixture of threads and coroutines.
>> >> >> >
>> >> >> > I'm guessing coroutines just can't fix all issues that migration want to
>> >> >> > resolve.
>> >> >> >
>> >> >> > After all these years, migration is now heavily based on a threaded model.
>> >> >> >
>> >> >> > Now there's still a major part of migration framework that is still not
>> >> >> > thread-based, which is precopy load.  We do load in a separate thread in
>> >> >> > postcopy since the 1st day postcopy was introduced, however that requires a
>> >> >> > separate state transition from precopy loading all devices first, which
>> >> >> > still happens in the main thread of a coroutine.
>> >> >> >
>> >> >> > This patch tries to move the migration incoming side to be run inside a
>> >> >> > separate thread (mig/dst/main) just like the src (mig/src/main).  The
>> >> >> > entrance to be migration_incoming_thread().
>> >> >> >
>> >> >> > Quite a few things are needed to make it fly..  One note here is we need to
>> >> >> > change all these things in one patch to not break anything.  The other way
>> >> >> > to do this is add code to make all paths (that this patch touched) be ready
>> >> >> > for either coroutine or thread.  That may cause confusions in another way.
>> >> >> > So reviewers, please take my sincere apology on the hardness of reviewing
>> >> >> > this patch: it covers a few modules at the same time, and with some risky
>> >> >> > changes.
>> >> >> >
>> >> >> > BQL Analysis
>> >> >> > ============
>> >> >> >
>> >> >> > Firstly, when moving it over to the thread, it means the thread cannot take
>> >> >> > BQL during the whole process of loading anymore, because otherwise it can
>> >> >> > block main thread from using the BQL for all kinds of other concurrent
>> >> >> > tasks (for example, processing QMP / HMP commands).
>> >> >> >
>> >> >> > Here the first question to ask is: what needs BQL during precopy load, and
>> >> >> > what doesn't?
>> >> >> >
>> >> >> 
>> >> >> I just noticed that the BQL held at process_incoming_migration_co is
>> >> >> also responsible for stopping qmp_migrate_set_capabilities from being
>> >> >> dispatched.
>> >> >
>> >> > I don't know if it is by design, or even if it will be guaranteed to work..
>> >> >
>> >> 
>> >> Regardless, we shouldn't rely on the BQL for this. The BQL should be
>> >> left as last resort for things that interact across subsystems. If
>> >> someone is issuing a migration command during a migration, the migration
>> >> code is exquisitely positioned to handle that itself.
>> >
>> > Yes I agree.
>> >
>> >> 
>> >> > Consider the migration incoming rocoutine runs into qemu_get_byte(), and
>> >> > then proactively yield the migration coroutine (qemu_coroutine_yield())
>> >> > when the incoming port is blocked on read..
>> >> >
>> >> > AFAIU, a proper fix for that (note, this will currently break tests) is:
>> >> >
>> >> > bool migration_is_running(void)
>> >> >  {
>> >> > -    MigrationState *s = current_migration;
>> >> > +    MigrationStatus state;
>> >> >  
>> >> > -    if (!s) {
>> >> > -        return false;
>> >> > +    if (runstate_check(RUN_STATE_INMIGRATE)) {
>> >> > +        MigrationIncomingState *mis = migration_incoming_get_current();
>> >> > +
>> >> > +        if (!mis) {
>> >> > +            return false;
>> >> > +        }
>> >> > +
>> >> > +        state = mis->state;
>> >> > +    } else {
>> >> > +        MigrationState *s = migrate_get_current();
>> >> > +
>> >> > +        if (!s) {
>> >> > +            return false;
>> >> > +        }
>> >> > +
>> >> > +        state = s->state;
>> >> >      }
>> >> >  
>> >> > -    switch (s->state) {
>> >> > +    switch (state) {
>> >> >      case MIGRATION_STATUS_ACTIVE:
>> >> >      case MIGRATION_STATUS_POSTCOPY_DEVICE:
>> >> >      case MIGRATION_STATUS_POSTCOPY_ACTIVE:
>> >> >
>> >> 
>> >> LGTM
>> >> 
>> >> >> 
>> >> >> Any point during incoming migration when BQL is unlocked we have a
>> >> >> window where a capability could be changed. Same for parameters, for
>> >> >> that matter.
>> >> >> 
>> >> >> To make matters worse, the -incoming cmdline will trigger
>> >> >> qmp_migrate_incoming->...->migration_transport_compatible early on, but
>> >> >> until the channels finally connect and process_incoming_migration_co
>> >> >> starts it's possible to just change a capability in an incompatible way
>> >> >> and the transport will never be validated again.
>> >> >
>> >> > Right.  Above should fix it, but I believe it also means after "-incoming
>> >> > tcp:xxx" (or anything not "defer") we should forbid changing migration caps
>> >> > or params on destination.
>> >> >
>> >> 
>> >> Parameters are never forbidden, right? And we cannot forbid them with
>> >> is_running because some parameters are allowed to be changed while
>> >> running.
>> >
>> > Right, my above statement was definitely inaccurate.
>> >
>> > After merging caps and params we only have params.  We should only allow
>> > some params to be changed anytime.  Most of the params (including caps)
>> > should not allow changing during migration live on either src/dst.
>> >
>> >> 
>> >> I feel we should have a more fine grained way of saying "this option
>> >> cannot be set at this moment", instead of just using the state as a
>> >> proxy. States can change, while the fact that from a certain point on,
>> >> certain options should not be touched anymore doesn't change.
>> >
>> > IIUC that's what migration_is_running() about?
>> >
>> 
>> At a high level, yes, but I think that's actually a downside of
>> migration_is_running. It's not explicit.
>> 
>> E.g.: qmp_migrate -> migration_transport_compatible is the first time
>> capabilities are checked in the code. Guess what migration_is_running
>> returns at that point? (it's false)
>> 
>> If there's ever a change in BQL behavior, there'll be a bug and we'll
>> see another patch extending the scope of migration_is_running, either
>> adding existing states to it, or as Prasad proposed, adding a new state
>> that is set a little later/earlier.
>> 
>> What we technically want is to stop accepting new capabilities as soon
>> as we're about to use them for the first time. Not as soon as (state ==
>> ACTIVE || SETUP || ...) which is a more high level definition.
>
> Isn't migration_incoming_state_setup() almost immediately invoked (after
> migration_transport_compatible() check passed), which will start to make
> the retval of migration_is_running() reflecting the fact?
>
> IMHO as long as QMP handlers are run with any lock (currently, BQL), then
> such layout should keep working?  That will make sure QMP "migrate" and
> "migrate_set_parameters" be serialized, that's, AFAICT, what we only need.
>
>> 
>> > It's a matter of if such finer granularity is necessary for us. Let's
>> > assume the simple scheme is:
>> >
>> >   (a) Some params are allowed to be modified anytime, examples,
>> >       downtime-limit, max-bandwidth, max-postcopy-bandwidth, etc.
>> >
>> 
>> What about during migration_cleanup() or migrate-cancel? Won't there
>> ever be something in the path of setting these params that would require
>> a "healthy" migration?
>
> If we'll introduce FAILING, then plus CANCELLING I think we should cover
> all race against migration_cleanup()?  As long as we'll have both FAILING
> and CANCELLING to return true for migration_is_running().
>
>> 
>> >   (b) All the rest params and all capabilities are not allowed to be modified
>> >       when migration_is_running() returns true (my fixed version above)
>> >
>> > The whitelisted (a) should really be the smallest set of params, and
>> > justified one by one or it should fall into (b).
>> >
>> > My hope is the simple scheme should work for us already.  After merging
>> > caps, it's only about some params that can be set anytime, rest params can
>> > only be set when migration is not running.
>> >
>> 
>> Sure, I'd just like to avoid adding another set of "if
>> (s->parameters->has_foobar)" if possible.
>> 
>> >> 
>> >> Maybe a little infra like bdrv_op_is_blocked, i.e, a list of blocked
>> >> operations. It could be set in qmp_migrate and checked in
>> >> qmp_set_parameters/caps.
>> >
>> > Any example you can think of, that above simple scheme won't work for us?
>> >
>> 
>> It works. I think it's fragile due to:
>> 
>> - the reliance on BQL to avoid concurrent invocations;
>> 
>> - the possible addition of new states for unrelated reasons
>>   risks regressions and it's just extra work to make sure everything is
>>   still blocked at the time it needs to be blocked.
>
> Yes, we'll need to make sure when new states introduced it needs to report
> correctly in migration_is_running().  I think it's not an issue because we
> need to do it right with/without relying on the function to ban updates on
> caps/params.. It always must be done right as migration_is_running() is
> also used in other use cases.
>
>> 
>> - any bugs need to be fixed by moving states around, which is dubious
>>   since the migration flow itself (which the states should represent)
>>   has not changed in a while.
>> 
>>   We're still seeing patches such as dc487044d5 ("migration: Make
>>   migration_has_failed() work even for CANCELLING") and the recent patch
>>   from Prasad. So there is something changing elsewhere and exposing
>>   these bugs and it's not immediately clear what that something is (or
>>   else we would have avoided the bug).
>> 
>> To be clear, I'm not asking we solve these issues at this moment, it's
>> fine if we go forward with the simple scheme. I'm just trying to clarify
>> what I think the problems are.
>> 
>> >> 
>> >> > As discussed above, that'll at least break our qtests.  But frankly
>> >> > speaking I think that's the right thing to do..  I hope libvirt always
>> >> > works with "defer" and never update any caps/params after QMP
>> >> > migrate_incoming.
>> >> >
>> >> > So I wonder if I should continue with above patch, and then fix our qtests.
>> >> > Your work from the other "merge caps+params" might also work here,
>> >> > actually, if we make sure everything will be set alone with the QMP
>> >> > migrate_incoming single command.
>> >> >
>> >> 
>> >> For incoming, yes. And this is maybe a point in favor of adding the
>> >> 'config'.
>> >> 
>> >> For outgoing, there's still the point I mentioned above about how to
>> >> restrict _some_ options to be allowed at runtime and others not.
>> >> 
>> >> > Let me know your initial thoughts, then I'll see what I can do..
>> >> >
>> >> 
>> >> We should fix the bug, I think your patch is good for that.
>> >> 
>> >> Although this kind of overlaps with some things we've been discussing
>> >> with Prasad. I'd be super happy if the code magically stopped using
>> >> QAPI's MigrationStatus for internal tracking of migration state and
>> >> blocking of commands and so on.
>> >> 
>> >> Whatever comes first =)
>> >
>> > I didn't follow as closely on the discussion there.  I don't know if
>> > changing MigrationStatus is a good idea..  we should have some really good
>> > reason to make libvirt and all mgmt need a change.. but maybe I misread the
>> > discussion.  I can wait until I read something solid if Prasad is going to
>> > propose something.
>> >
>> 
>> We're moving towards decoupling MigrationStatus from these ordinary code
>> checks. MigrationStatus should be used for informing management about a
>> relevant change in the migration stage of progression.
>> 
>> As it stands, we're either "exposing implementation details" or
>> "inadvertently changing the API". Both are undesirable.
>> 
>> Of course, we might introduce the "internal state is off-sync with
>> externally visible states" issue, but hopefully we'll catch that during
>> review. =)
>
> I need to confess I'm still a bit lost on what it is about, and what is the
> problem we're solving here.. But I can always wait and read the patches
> when they come up.. :)
>
>> 
>> >> 
>> >> ---
>> >> Side note, did we ever discuss something like this?
>> >> 
>> >> struct MigrationState {
>> >>    <state>
>> >>    union {
>> >>      <outgoing>
>> >>      <incoming>
>> >>    }
>> >> }
>> >> 
>> >> there's so much stuff in these structs...
>> >
>> > Yeah..  When merging state, we'll also need to be careful on overlapped
>> > fields, e.g. expecting a COMPLETED state (from a completed incoming
>> > migration) when starting a new outgoing migration.
>> >
>> > We can likely leave this for later.
>> 
>> No worries, I'm just collecting your opinions on it.
>> 

Another one for the pile:

#5  0x00007f0beda4fb32 in __assert_fail (assertion=0x55b8c3ed8cb8 "mode >= 0 ...) at assert.c:103
#6  0x000055b8c3a1d7a9 in migrate_mode () at ../migration/options.c:882
#7  0x000055b8c3a1084e in fill_source_migration_info (info=0x55b8f1291650) at ../migration/migration.c:1322
#8  0x000055b8c3a10cae in qmp_query_migrate (errp=0x7fff5742ae80) at ../migration/migration.c:1438
#9  0x000055b8c3d4bc2a in qmp_marshal_query_migrate (args=0x55b8f11f2280, ret=0x7f0becd25da8, errp=0x7f0becd25da0) at qapi/qapi-commands-migration.c:48
#10 0x000055b8c3d9a13b in do_qmp_dispatch_bh (opaque=0x7f0becd25e40) at ../qapi/qmp-dispatch.c:128
#11 0x000055b8c3dc6366 in aio_bh_call (bh=0x55b8f12922d0) at ../util/async.c:173
#12 0x000055b8c3dc6482 in aio_bh_poll (ctx=0x55b8f10741e0) at ../util/async.c:220
#13 0x000055b8c3da9832 in aio_poll (ctx=0x55b8f10741e0, blocking=false) at ../util/aio-posix.c:719
#14 0x000055b8c3cea017 in monitor_cleanup () at ../monitor/monitor.c:676
#15 0x000055b8c39ef646 in qemu_cleanup (status=0) at ../system/runstate.c:999
#16 0x000055b8c3cec38e in qemu_default_main (opaque=0x0) at ../system/main.c:51
#17 0x000055b8c3cec430 in main (argc=33, argv=0x7fff5742b208) at
#../system/main.c:93

(gdb) p/x mode
$8 = 0xcccccccc

Re: [PATCH 08/13] migration: Thread-ify precopy vmstate load process

Posted by Peter Xu 2 weeks, 5 days ago

On Fri, Jan 16, 2026 at 06:48:21PM -0300, Fabiano Rosas wrote:
> Another one for the pile:
> 
> #5  0x00007f0beda4fb32 in __assert_fail (assertion=0x55b8c3ed8cb8 "mode >= 0 ...) at assert.c:103
> #6  0x000055b8c3a1d7a9 in migrate_mode () at ../migration/options.c:882
> #7  0x000055b8c3a1084e in fill_source_migration_info (info=0x55b8f1291650) at ../migration/migration.c:1322
> #8  0x000055b8c3a10cae in qmp_query_migrate (errp=0x7fff5742ae80) at ../migration/migration.c:1438
> #9  0x000055b8c3d4bc2a in qmp_marshal_query_migrate (args=0x55b8f11f2280, ret=0x7f0becd25da8, errp=0x7f0becd25da0) at qapi/qapi-commands-migration.c:48
> #10 0x000055b8c3d9a13b in do_qmp_dispatch_bh (opaque=0x7f0becd25e40) at ../qapi/qmp-dispatch.c:128
> #11 0x000055b8c3dc6366 in aio_bh_call (bh=0x55b8f12922d0) at ../util/async.c:173
> #12 0x000055b8c3dc6482 in aio_bh_poll (ctx=0x55b8f10741e0) at ../util/async.c:220
> #13 0x000055b8c3da9832 in aio_poll (ctx=0x55b8f10741e0, blocking=false) at ../util/aio-posix.c:719
> #14 0x000055b8c3cea017 in monitor_cleanup () at ../monitor/monitor.c:676
> #15 0x000055b8c39ef646 in qemu_cleanup (status=0) at ../system/runstate.c:999
> #16 0x000055b8c3cec38e in qemu_default_main (opaque=0x0) at ../system/main.c:51
> #17 0x000055b8c3cec430 in main (argc=33, argv=0x7fff5742b208) at
> #../system/main.c:93
> 
> (gdb) p/x mode
> $8 = 0xcccccccc

What's the reproducer?  Is it easy to reproduce?

I wonder if current_migration released already, or if monitor should still
process any QMP handler if the VM is shutting down..

Is this only happening after this series applied?  I can't yet see how the
threadify affected it..

-- 
Peter Xu

Re: [PATCH 08/13] migration: Thread-ify precopy vmstate load process

Posted by Fabiano Rosas 2 weeks, 5 days ago

Peter Xu <peterx@redhat.com> writes:

> On Fri, Jan 16, 2026 at 06:48:21PM -0300, Fabiano Rosas wrote:
>> Another one for the pile:
>> 
>> #5  0x00007f0beda4fb32 in __assert_fail (assertion=0x55b8c3ed8cb8 "mode >= 0 ...) at assert.c:103
>> #6  0x000055b8c3a1d7a9 in migrate_mode () at ../migration/options.c:882
>> #7  0x000055b8c3a1084e in fill_source_migration_info (info=0x55b8f1291650) at ../migration/migration.c:1322
>> #8  0x000055b8c3a10cae in qmp_query_migrate (errp=0x7fff5742ae80) at ../migration/migration.c:1438
>> #9  0x000055b8c3d4bc2a in qmp_marshal_query_migrate (args=0x55b8f11f2280, ret=0x7f0becd25da8, errp=0x7f0becd25da0) at qapi/qapi-commands-migration.c:48
>> #10 0x000055b8c3d9a13b in do_qmp_dispatch_bh (opaque=0x7f0becd25e40) at ../qapi/qmp-dispatch.c:128
>> #11 0x000055b8c3dc6366 in aio_bh_call (bh=0x55b8f12922d0) at ../util/async.c:173
>> #12 0x000055b8c3dc6482 in aio_bh_poll (ctx=0x55b8f10741e0) at ../util/async.c:220
>> #13 0x000055b8c3da9832 in aio_poll (ctx=0x55b8f10741e0, blocking=false) at ../util/aio-posix.c:719
>> #14 0x000055b8c3cea017 in monitor_cleanup () at ../monitor/monitor.c:676
>> #15 0x000055b8c39ef646 in qemu_cleanup (status=0) at ../system/runstate.c:999
>> #16 0x000055b8c3cec38e in qemu_default_main (opaque=0x0) at ../system/main.c:51
>> #17 0x000055b8c3cec430 in main (argc=33, argv=0x7fff5742b208) at
>> #../system/main.c:93
>> 
>> (gdb) p/x mode
>> $8 = 0xcccccccc
>
> What's the reproducer?  Is it easy to reproduce?
>

Just hammering some make check instances in parallel, as usual.

> I wonder if current_migration released already, or if monitor should still
> process any QMP handler if the VM is shutting down..
>

monitor_cleanup says it will continue dispatching and just never send a
response to the command. It seems migration will need it's own way of
tracking this.

Maybe as a first step we could clear current_migration at
migration_instance_finalize().
(I see device-introspect-test doesn't like it. I'll look into it)

> Is this only happening after this series applied?  I can't yet see how the
> threadify affected it..

No, sorry, I posted here because I think this affects this series'
assumptions about the BQL.

Re: [PATCH 08/13] migration: Thread-ify precopy vmstate load process

Posted by Peter Xu 2 weeks, 5 days ago

On Tue, Jan 20, 2026 at 03:54:19PM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Fri, Jan 16, 2026 at 06:48:21PM -0300, Fabiano Rosas wrote:
> >> Another one for the pile:
> >> 
> >> #5  0x00007f0beda4fb32 in __assert_fail (assertion=0x55b8c3ed8cb8 "mode >= 0 ...) at assert.c:103
> >> #6  0x000055b8c3a1d7a9 in migrate_mode () at ../migration/options.c:882
> >> #7  0x000055b8c3a1084e in fill_source_migration_info (info=0x55b8f1291650) at ../migration/migration.c:1322
> >> #8  0x000055b8c3a10cae in qmp_query_migrate (errp=0x7fff5742ae80) at ../migration/migration.c:1438
> >> #9  0x000055b8c3d4bc2a in qmp_marshal_query_migrate (args=0x55b8f11f2280, ret=0x7f0becd25da8, errp=0x7f0becd25da0) at qapi/qapi-commands-migration.c:48
> >> #10 0x000055b8c3d9a13b in do_qmp_dispatch_bh (opaque=0x7f0becd25e40) at ../qapi/qmp-dispatch.c:128
> >> #11 0x000055b8c3dc6366 in aio_bh_call (bh=0x55b8f12922d0) at ../util/async.c:173
> >> #12 0x000055b8c3dc6482 in aio_bh_poll (ctx=0x55b8f10741e0) at ../util/async.c:220
> >> #13 0x000055b8c3da9832 in aio_poll (ctx=0x55b8f10741e0, blocking=false) at ../util/aio-posix.c:719
> >> #14 0x000055b8c3cea017 in monitor_cleanup () at ../monitor/monitor.c:676
> >> #15 0x000055b8c39ef646 in qemu_cleanup (status=0) at ../system/runstate.c:999
> >> #16 0x000055b8c3cec38e in qemu_default_main (opaque=0x0) at ../system/main.c:51
> >> #17 0x000055b8c3cec430 in main (argc=33, argv=0x7fff5742b208) at
> >> #../system/main.c:93
> >> 
> >> (gdb) p/x mode
> >> $8 = 0xcccccccc
> >
> > What's the reproducer?  Is it easy to reproduce?
> >
> 
> Just hammering some make check instances in parallel, as usual.
> 
> > I wonder if current_migration released already, or if monitor should still
> > process any QMP handler if the VM is shutting down..
> >
> 
> monitor_cleanup says it will continue dispatching and just never send a
> response to the command. It seems migration will need it's own way of
> tracking this.
> 
> Maybe as a first step we could clear current_migration at
> migration_instance_finalize().
> (I see device-introspect-test doesn't like it. I'll look into it)
> 
> > Is this only happening after this series applied?  I can't yet see how the
> > threadify affected it..
> 
> No, sorry, I posted here because I think this affects this series'
> assumptions about the BQL.

Aha, you reminded me on this work, remember to check it out (I forgot most
of it.. as usual..):

https://lore.kernel.org/qemu-devel/20241029211607.2114845-8-peterx@redhat.com

I also recall besides the singleton idea, we can also take a mutex and
release current_migration at finalize(), then fetch current_migration will
instead need to take a refcount and also with mutex held.

Anyway, thanks for looking into it.  Now, it's all yours. :)

-- 
Peter Xu

Re: [PATCH 08/13] migration: Thread-ify precopy vmstate load process

Posted by Zhijian Li (Fujitsu) 3 months, 1 week ago


On 23/10/2025 03:26, Peter Xu wrote:
> Migration module was there for 10+ years.  Initially, it was in most cases
> based on coroutines.  As more features were added into the framework, like
> postcopy, multifd, etc.. it became a mixture of threads and coroutines.
> 
> I'm guessing coroutines just can't fix all issues that migration want to
> resolve.
> 
> After all these years, migration is now heavily based on a threaded model.
> 
> Now there's still a major part of migration framework that is still not
> thread-based, which is precopy load.  We do load in a separate thread in
> postcopy since the 1st day postcopy was introduced, however that requires a
> separate state transition from precopy loading all devices first, which
> still happens in the main thread of a coroutine.
> 
> This patch tries to move the migration incoming side to be run inside a
> separate thread (mig/dst/main) just like the src (mig/src/main).  The
> entrance to be migration_incoming_thread().
> 
> Quite a few things are needed to make it fly..  One note here is we need to
> change all these things in one patch to not break anything.  The other way
> to do this is add code to make all paths (that this patch touched) be ready
> for either coroutine or thread.  That may cause confusions in another way.
> So reviewers, please take my sincere apology on the hardness of reviewing
> this patch: it covers a few modules at the same time, and with some risky
> changes.
> 
> BQL Analysis
> ============
> 
> Firstly, when moving it over to the thread, it means the thread cannot take
> BQL during the whole process of loading anymore, because otherwise it can
> block main thread from using the BQL for all kinds of other concurrent
> tasks (for example, processing QMP / HMP commands).
> 
> Here the first question to ask is: what needs BQL during precopy load, and
> what doesn't?
> 
> Most of the load process shouldn't need BQL, especially when it's about
> RAM.  After all, RAM is still the major chunk of data to move for a live
> migration process.  VFIO started to change that, though, but still, VFIO is
> per-device so that shouldn't need BQL either in most cases.
> 
> Generic device loads will need BQL, likely not when receiving VMSDs, but
> when applying them.  One example is any post_load() could potentially
> inject memory regions causing memory transactions to happen.  That'll need
> to update the global address spaces, hence requires BQL.  The other one is
> CPU sync operations, even if the sync alone may not need BQL (which is
> still to be further justified), run_on_cpu() will need it.
> 
> For that, qemu_loadvm_state() and qemu_loadvm_state_main() functions need
> to now take a "bql_held" parameter saying whether bql is held.  We could
> use things like BQL_LOCK_GUARD(), but this patch goes with explicit
> lockings rather than relying on bql_locked TLS variable.  In case of
> migration, we always know whether BQL is held in different context as long
> as we can still pass that information downwards.
> 
> COLO
> ====
> 
> COLO assumed the dest VM load happens in a coroutine.  After this patch,
> it's not anymore.  Change that by invoking colo_incoming_co() directly from
> the migration_incoming_thread().
> 
> The name (colo_incoming_co()) isn't proper anymore.  Change it to
> colo_incoming_wait(), removing the coroutine annotation alongside.
> 
> Remove all the bql_lock() implications in COLO, e.g., colo_incoming_co()
> used to release the lock for a short period while join().  Now it's not
> needed.  Instead, taking BQL but only when needed (colo_release_ram_cache).
> 
> At the meantime, there's colo_incoming_co variable that used to store the
> COLO incoming coroutine, only to be kicked off when a secondary failover
> happens.
> 
> To recap, what should happen for such failover should be (taking example of
> a QMP command x-colo-lost-heartbeat triggering on dest QEMU):
> 
>    - The QMP command will kick off both the coroutine and the COLO
>      thread (colo_process_incoming_thread()), with something like:
> 
>      /* Notify COLO incoming thread that failover work is finished */
>      qemu_event_set(&mis->colo_incoming_event);
> 
>      qemu_coroutine_enter(mis->colo_incoming_co);
> 
>    - The coroutine, which yielded itself before, now resumes after enter(),
>      then it'll wait for the join():
> 
>      mis->colo_incoming_co = qemu_coroutine_self();
>      qemu_coroutine_yield();
>      mis->colo_incoming_co = NULL;
> 
>      /* Wait checkpoint incoming thread exit before free resource */
>      qemu_thread_join(&th);
> 
> Here, when switching to a thread model, it should be fine removing
> colo_incoming_co variable completely, because if so, the incoming thread
> will (instead of yielding the coroutine) wait at qemu_thread_join() until
> the colo thread completes execution (after receiving colo_incoming_event).
> 
> RDMA
> ====
> 
> With the prior patch making sure io_watch won't block for RDMA iochannels,
> RDMA threads should only block at its io_readv/io_writev functions.  When a
> disconnection is detected (as in rdma_cm_poll_handler()), the update to
> "errored" field will be immediately reflected in the migration incoming
> thread.  Hence the coroutine for RDMA is not needed anymore to kick the
> thread out.
> 
> When the thread is available, we also can't have rdma_cm_poll_handler()
> keep polling the fd and operate on it in the main thread.  Drop it
> completely, and it should be fine because qemu_rdma_wait_comp_channel()
> should also be monitoring it.
> 
> This almost reverts commit 923709896b1b01fb982c93492ad01b233e6b6023.
> 
> We need to do this change in this same patch that we introduce the thread,
> unfortunately, otherwise we can have a risk of racing.
> 
> TODO
> ====
> 
> Currently the BQL is taken during loading of a START|FULL section.  When
> the IO hangs (e.g. network issue) during this process, it could potentially
> block others like the monitor servers.  One solution is breaking BQL to
> smaller granule and leave IOs to be always BQL-free.  That'll need more
> justifications.
> 
> For example, there are at least four things that need some closer
> attention:
> 
>    - SaveVMHandlers's load_state(): this likely DO NOT need BQL, but we need
>    to justify all of them (not to mention, some of them look like prone to
>    be rewritten as VMSDs..)
> 
>    - VMSD's pre_load(): in most cases, this DO NOT really need BQL, but
>    sometimes maybe it will!  Double checking on this will be needed.
> 
>    - VMSD's post_load(): in many cases, this DO need BQL, for example on
>    address space operations.  Likely we should just take it for any
>    post_load().
> 
>    - VMSD field's get(): this is tricky!  It could internally be anything
>    even if it was only a field.  E.g. there can be users to use a SINGLE
>    field to load a whole VMSD, which can further introduce more
>    possibilities.
> 
> In general, QEMUFile IOs should not need BQL, that is when receiving the
> VMSD data and waiting for e.g. the socket buffer to get refilled.  But
> that's the easy part.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>


Both COLO and RDMA changes look good to me

Reviewed-by: Li Zhijian <lizhijian@fujitsu.com> # COLO and RDMA

And with an addition fixes[1] for COLO, the whole patchset:

Tested-by: Li Zhijian <lizhijian@fujitsu.com> # COLO and RDMA

[1] https://lore.kernel.org/qemu-devel/20251104013606.1937764-1-lizhijian@fujitsu.com/


> ---
>   include/migration/colo.h |  6 ++--
>   migration/migration.h    | 14 +++-----
>   migration/colo-stubs.c   |  2 +-
>   migration/colo.c         | 24 ++++---------
>   migration/migration.c    | 77 +++++++++++++++++++++++++---------------
>   migration/rdma.c         | 34 +-----------------
>   migration/savevm.c       |  8 ++---
>   migration/trace-events   |  4 +--
>   8 files changed, 69 insertions(+), 100 deletions(-)
> 
> diff --git a/include/migration/colo.h b/include/migration/colo.h
> index d4fe422e4d..5de7d715a7 100644
> --- a/include/migration/colo.h
> +++ b/include/migration/colo.h
> @@ -44,12 +44,10 @@ void colo_do_failover(void);
>   void colo_checkpoint_delay_set(void);
>   
>   /*
> - * Starts COLO incoming process. Called from process_incoming_migration_co()
> + * Starts COLO incoming process. Called from migration_incoming_thread()
>    * after loading the state.
> - *
> - * Called with BQL locked, may temporary release BQL.
>    */
> -void coroutine_fn colo_incoming_co(void);
> +void colo_incoming_wait(void);
>   
>   void colo_shutdown(void);
>   #endif
> diff --git a/migration/migration.h b/migration/migration.h
> index e1c5029110..0d22dc8cc2 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -214,6 +214,10 @@ struct MigrationIncomingState {
>       bool           have_listen_thread;
>       QemuThread     listen_thread;
>   
> +    /* Migration main recv thread */
> +    bool           have_recv_thread;
> +    QemuThread     recv_thread;
> +
>       /* For the kernel to send us notifications */
>       int       userfault_fd;
>       /* To notify the fault_thread to wake, e.g., when need to quit */
> @@ -272,15 +276,7 @@ struct MigrationIncomingState {
>   
>       MigrationStatus state;
>   
> -    /*
> -     * The incoming migration coroutine, non-NULL during qemu_loadvm_state().
> -     * Used to wake the migration incoming coroutine from rdma code. How much is
> -     * it safe - it's a question.
> -     */
> -    Coroutine *loadvm_co;
> -
> -    /* The coroutine we should enter (back) after failover */
> -    Coroutine *colo_incoming_co;
> +    /* Notify secondary VM to move on */
>       QemuEvent colo_incoming_event;
>   
>       /* Optional load threads pool and its thread exit request flag */
> diff --git a/migration/colo-stubs.c b/migration/colo-stubs.c
> index e22ce65234..ef77d1ab4b 100644
> --- a/migration/colo-stubs.c
> +++ b/migration/colo-stubs.c
> @@ -9,7 +9,7 @@ void colo_shutdown(void)
>   {
>   }
>   
> -void coroutine_fn colo_incoming_co(void)
> +void colo_incoming_wait(void)
>   {
>   }
>   
> diff --git a/migration/colo.c b/migration/colo.c
> index 4fd586951a..81276a3e65 100644
> --- a/migration/colo.c
> +++ b/migration/colo.c
> @@ -147,11 +147,6 @@ static void secondary_vm_do_failover(void)
>       }
>       /* Notify COLO incoming thread that failover work is finished */
>       qemu_event_set(&mis->colo_incoming_event);
> -
> -    /* For Secondary VM, jump to incoming co */
> -    if (mis->colo_incoming_co) {
> -        qemu_coroutine_enter(mis->colo_incoming_co);
> -    }
>   }
>   
>   static void primary_vm_do_failover(void)
> @@ -848,10 +843,8 @@ static void *colo_process_incoming_thread(void *opaque)
>   
>       mis->to_src_file = qemu_file_get_return_path(mis->from_src_file);
>       /*
> -     * Note: the communication between Primary side and Secondary side
> -     * should be sequential, we set the fd to unblocked in migration incoming
> -     * coroutine, and here we are in the COLO incoming thread, so it is ok to
> -     * set the fd back to blocked.
> +     * Here we are in the COLO incoming thread, so it is ok to set the fd
> +     * to blocking.
>        */
>       if (!qemu_file_set_blocking(mis->from_src_file, true, &local_err)) {
>           error_report_err(local_err);
> @@ -927,27 +920,22 @@ out:
>       return NULL;
>   }
>   
> -void coroutine_fn colo_incoming_co(void)
> +/* Wait for failover */
> +void colo_incoming_wait(void)
>   {
>       MigrationIncomingState *mis = migration_incoming_get_current();
>       QemuThread th;
>   
> -    assert(bql_locked());
>       assert(migration_incoming_colo_enabled());
>   
>       qemu_thread_create(&th, MIGRATION_THREAD_DST_COLO,
>                          colo_process_incoming_thread,
>                          mis, QEMU_THREAD_JOINABLE);
>   
> -    mis->colo_incoming_co = qemu_coroutine_self();
> -    qemu_coroutine_yield();
> -    mis->colo_incoming_co = NULL;
> -
> -    bql_unlock();
>       /* Wait checkpoint incoming thread exit before free resource */
>       qemu_thread_join(&th);
> -    bql_lock();
>   
> -    /* We hold the global BQL, so it is safe here */
> +    bql_lock();
>       colo_release_ram_cache();
> +    bql_unlock();
>   }
> diff --git a/migration/migration.c b/migration/migration.c
> index 38a584afae..728d02dbee 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -491,6 +491,11 @@ void migration_incoming_state_destroy(void)
>           mis->postcopy_qemufile_dst = NULL;
>       }
>   
> +    if (mis->have_recv_thread) {
> +        qemu_thread_join(&mis->recv_thread);
> +        mis->have_recv_thread = false;
> +    }
> +
>       cpr_set_incoming_mode(MIG_MODE_NONE);
>       yank_unregister_instance(MIGRATION_YANK_INSTANCE);
>   }
> @@ -861,30 +866,46 @@ static void process_incoming_migration_bh(void *opaque)
>       migration_incoming_state_destroy();
>   }
>   
> -static void coroutine_fn
> -process_incoming_migration_co(void *opaque)
> +static void migration_incoming_state_destroy_bh(void *opaque)
> +{
> +    struct MigrationIncomingState *mis = opaque;
> +
> +    migration_incoming_state_destroy();
> +
> +    if (mis->exit_on_error) {
> +        /*
> +         * NOTE: this exit() should better happen in the main thread, as
> +         * the exit notifier may require BQL which can deadlock.  See
> +         * commit e7bc0204e57836 for example.
> +         */
> +        exit(EXIT_FAILURE);
> +    }
> +}
> +
> +static void *migration_incoming_thread(void *opaque)
>   {
>       MigrationState *s = migrate_get_current();
> -    MigrationIncomingState *mis = migration_incoming_get_current();
> +    MigrationIncomingState *mis = opaque;
>       PostcopyState ps;
>       int ret;
>       Error *local_err = NULL;
>   
> +    rcu_register_thread();
> +
>       assert(mis->from_src_file);
> +    assert(!bql_locked());
>   
>       mis->largest_page_size = qemu_ram_pagesize_largest();
>       postcopy_state_set(POSTCOPY_INCOMING_NONE);
>       migrate_set_state(&mis->state, MIGRATION_STATUS_SETUP,
>                         MIGRATION_STATUS_ACTIVE);
>   
> -    mis->loadvm_co = qemu_coroutine_self();
> -    ret = qemu_loadvm_state(mis->from_src_file, true, &local_err);
> -    mis->loadvm_co = NULL;
> +    ret = qemu_loadvm_state(mis->from_src_file, false, &local_err);
>   
>       trace_vmstate_downtime_checkpoint("dst-precopy-loadvm-completed");
>   
>       ps = postcopy_state_get();
> -    trace_process_incoming_migration_co_end(ret, ps);
> +    trace_process_incoming_migration_end(ret, ps);
>       if (ps != POSTCOPY_INCOMING_NONE) {
>           if (ps == POSTCOPY_INCOMING_ADVISE) {
>               /*
> @@ -898,7 +919,7 @@ process_incoming_migration_co(void *opaque)
>                * Postcopy was started, cleanup should happen at the end of the
>                * postcopy thread.
>                */
> -            trace_process_incoming_migration_co_postcopy_end_main();
> +            trace_process_incoming_migration_postcopy_end_main();
>               goto out;
>           }
>           /* Else if something went wrong then just fall out of the normal exit */
> @@ -911,8 +932,8 @@ process_incoming_migration_co(void *opaque)
>       }
>   
>       if (migration_incoming_colo_enabled()) {
> -        /* yield until COLO exit */
> -        colo_incoming_co();
> +        /* wait until COLO exits */
> +        colo_incoming_wait();
>       }
>   
>       migration_bh_schedule(process_incoming_migration_bh, mis);
> @@ -924,28 +945,22 @@ fail:
>       migrate_set_error(s, local_err);
>       error_free(local_err);
>   
> -    migration_incoming_state_destroy();
> +    WITH_QEMU_LOCK_GUARD(&s->error_mutex) {
> +        error_report_err(s->error);
> +        s->error = NULL;
> +    }
>   
> -    if (mis->exit_on_error) {
> -        WITH_QEMU_LOCK_GUARD(&s->error_mutex) {
> -            error_report_err(s->error);
> -            s->error = NULL;
> -        }
> +    /*
> +     * There's some step of the destroy process that will need to happen in
> +     * the main thread (e.g. joining this thread itself).  Leave to a BH.
> +     */
> +    migration_bh_schedule(migration_incoming_state_destroy_bh, (void *)mis);
>   
> -        exit(EXIT_FAILURE);
> -    } else {
> -        /*
> -         * Report the error here in case that QEMU abruptly exits
> -         * when postcopy is enabled.
> -         */
> -        WITH_QEMU_LOCK_GUARD(&s->error_mutex) {
> -            error_report_err(s->error);
> -            s->error = NULL;
> -        }
> -    }
>   out:
>       /* Pairs with the refcount taken in qmp_migrate_incoming() */
>       migrate_incoming_unref_outgoing_state();
> +    rcu_unregister_thread();
> +    return NULL;
>   }
>   
>   /**
> @@ -963,8 +978,12 @@ static void migration_incoming_setup(QEMUFile *f)
>   
>   void migration_incoming_process(void)
>   {
> -    Coroutine *co = qemu_coroutine_create(process_incoming_migration_co, NULL);
> -    qemu_coroutine_enter(co);
> +    MigrationIncomingState *mis = migration_incoming_get_current();
> +
> +    mis->have_recv_thread = true;
> +    qemu_thread_create(&mis->recv_thread, "mig/dst/main",
> +                       migration_incoming_thread, mis,
> +                       QEMU_THREAD_JOINABLE);
>   }
>   
>   /* Returns true if recovered from a paused migration, otherwise false */
> diff --git a/migration/rdma.c b/migration/rdma.c
> index 0e5e02cdca..3389f6448b 100644
> --- a/migration/rdma.c
> +++ b/migration/rdma.c
> @@ -3051,37 +3051,6 @@ int rdma_control_save_page(QEMUFile *f, ram_addr_t block_offset,
>   
>   static void rdma_accept_incoming_migration(void *opaque);
>   
> -static void rdma_cm_poll_handler(void *opaque)
> -{
> -    RDMAContext *rdma = opaque;
> -    struct rdma_cm_event *cm_event;
> -    MigrationIncomingState *mis = migration_incoming_get_current();
> -
> -    if (rdma_get_cm_event(rdma->channel, &cm_event) < 0) {
> -        error_report("get_cm_event failed %d", errno);
> -        return;
> -    }
> -
> -    if (cm_event->event == RDMA_CM_EVENT_DISCONNECTED ||
> -        cm_event->event == RDMA_CM_EVENT_DEVICE_REMOVAL) {
> -        if (!rdma->errored &&
> -            migration_incoming_get_current()->state !=
> -              MIGRATION_STATUS_COMPLETED) {
> -            error_report("receive cm event, cm event is %d", cm_event->event);
> -            rdma->errored = true;
> -            if (rdma->return_path) {
> -                rdma->return_path->errored = true;
> -            }
> -        }
> -        rdma_ack_cm_event(cm_event);
> -        if (mis->loadvm_co) {
> -            qemu_coroutine_enter(mis->loadvm_co);
> -        }
> -        return;
> -    }
> -    rdma_ack_cm_event(cm_event);
> -}
> -
>   static int qemu_rdma_accept(RDMAContext *rdma)
>   {
>       Error *err = NULL;
> @@ -3199,8 +3168,7 @@ static int qemu_rdma_accept(RDMAContext *rdma)
>                               NULL,
>                               (void *)(intptr_t)rdma->return_path);
>       } else {
> -        qemu_set_fd_handler(rdma->channel->fd, rdma_cm_poll_handler,
> -                            NULL, rdma);
> +        qemu_set_fd_handler(rdma->channel->fd, NULL, NULL, NULL);
>       }
>   
>       ret = rdma_accept(rdma->cm_id, &conn_param);
> diff --git a/migration/savevm.c b/migration/savevm.c
> index 44aadc2f51..991f46593c 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -2118,7 +2118,7 @@ static void *postcopy_ram_listen_thread(void *opaque)
>       qemu_file_set_blocking(f, true, &error_fatal);
>   
>       /* TODO: sanity check that only postcopiable data will be loaded here */
> -    load_res = qemu_loadvm_state_main(f, mis, true, &local_err);
> +    load_res = qemu_loadvm_state_main(f, mis, false, &local_err);
>   
>       /*
>        * This is tricky, but, mis->from_src_file can change after it
> @@ -2415,11 +2415,11 @@ static void loadvm_postcopy_handle_resume(MigrationIncomingState *mis)
>    * Immediately following this command is a blob of data containing an embedded
>    * chunk of migration stream; read it and load it.
>    *
> - * @mis: Incoming state
> - * @length: Length of packaged data to read
> + * @mis:      Incoming state
> + * @bql_held: Whether BQL is held already
> + * @errp:     The Error** to set when returning failures.
>    *
>    * Returns: Negative values on error
> - *
>    */
>   static int loadvm_handle_cmd_packaged(MigrationIncomingState *mis,
>                                         bool bql_held, Error **errp)
> diff --git a/migration/trace-events b/migration/trace-events
> index e8edd1fbba..2b7b522e73 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -193,8 +193,8 @@ source_return_path_thread_resume_ack(uint32_t v) "%"PRIu32
>   source_return_path_thread_switchover_acked(void) ""
>   migration_thread_low_pending(uint64_t pending) "%" PRIu64
>   migrate_transferred(uint64_t transferred, uint64_t time_spent, uint64_t bandwidth, uint64_t avail_bw, uint64_t size) "transferred %" PRIu64 " time_spent %" PRIu64 " bandwidth %" PRIu64 " switchover_bw %" PRIu64 " max_size %" PRId64
> -process_incoming_migration_co_end(int ret, int ps) "ret=%d postcopy-state=%d"
> -process_incoming_migration_co_postcopy_end_main(void) ""
> +process_incoming_migration_end(int ret, int ps) "ret=%d postcopy-state=%d"
> +process_incoming_migration_postcopy_end_main(void) ""
>   postcopy_preempt_enabled(bool value) "%d"
>   migration_precopy_complete(void) ""
>

Re: [PATCH 08/13] migration: Thread-ify precopy vmstate load process

Posted by Peter Xu 2 months ago

On Tue, Nov 04, 2025 at 02:40:53AM +0000, Zhijian Li (Fujitsu) wrote:
> Both COLO and RDMA changes look good to me
> 
> Reviewed-by: Li Zhijian <lizhijian@fujitsu.com> # COLO and RDMA
> 
> And with an addition fixes[1] for COLO, the whole patchset:
> 
> Tested-by: Li Zhijian <lizhijian@fujitsu.com> # COLO and RDMA
> 
> [1] https://lore.kernel.org/qemu-devel/20251104013606.1937764-1-lizhijian@fujitsu.com/

This is helpful, thanks a lot Zhijian.

-- 
Peter Xu