migration: for snapshots, hold the BQL during setup callbacks

[PATCH] migration: for snapshots, hold the BQL during setup callbacks

Posted by Fiona Ebner 12 months ago

In spirit, this is a partial revert of commit 9b09503752 ("migration:
run setup callbacks out of big lock"), but only for the snapshot case.

For snapshots, the bdrv_writev_vmstate() function is used during setup
(in QIOChannelBlock backing the QEMUFile), but not holding the BQL
while calling it could lead to an assertion failure. To understand
how, first note the following:

1. Generated coroutine wrappers for block layer functions spawn the
coroutine and use AIO_WAIT_WHILE()/aio_poll() to wait for it.
2. If the host OS switches threads at an inconvenient time, it can
happen that a bottom half scheduled for the main thread's AioContext
is executed as part of a vCPU thread's aio_poll().

An example leading to the assertion failure is as follows:

main thread:
1. A snapshot-save QMP command gets issued.
2. snapshot_save_job_bh() is scheduled.

vCPU thread:
3. aio_poll() for the main thread's AioContext is called (e.g. when
the guest writes to a pflash device, as part of blk_pwrite which is a
generated coroutine wrapper).
4. snapshot_save_job_bh() is executed as part of aio_poll().
3. qemu_savevm_state() is called.
4. qemu_mutex_unlock_iothread() is called. Now
qemu_get_current_aio_context() returns 0x0.
5. bdrv_writev_vmstate() is executed during the usual savevm setup.
But this function is a generated coroutine wrapper, so it uses
AIO_WAIT_WHILE. There, the assertion
assert(qemu_get_current_aio_context() == qemu_get_aio_context());
will fail.

To fix it, ensure that the BQL is held during setup. To avoid changing
the behavior for migration too, introduce conditionals for the setup
callbacks that need the BQL and only take the lock if it's not already
held.

Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---
 include/migration/register.h   |  2 +-
 migration/block-dirty-bitmap.c | 15 ++++++++++++---
 migration/block.c              | 15 ++++++++++++---
 migration/ram.c                | 16 +++++++++++++---
 migration/savevm.c             |  2 --
 5 files changed, 38 insertions(+), 12 deletions(-)

diff --git a/include/migration/register.h b/include/migration/register.h
index a8dfd8fefd..fa9b0b0f10 100644
--- a/include/migration/register.h
+++ b/include/migration/register.h
@@ -43,9 +43,9 @@ typedef struct SaveVMHandlers {
      * by other locks.
      */
     int (*save_live_iterate)(QEMUFile *f, void *opaque);
+    int (*save_setup)(QEMUFile *f, void *opaque);
 
     /* This runs outside the iothread lock!  */
-    int (*save_setup)(QEMUFile *f, void *opaque);
     /* Note for save_live_pending:
      * must_precopy:
      * - must be migrated in precopy or in stopped state
diff --git a/migration/block-dirty-bitmap.c b/migration/block-dirty-bitmap.c
index 6624f39bc6..4943f4d644 100644
--- a/migration/block-dirty-bitmap.c
+++ b/migration/block-dirty-bitmap.c
@@ -1217,10 +1217,17 @@ static int dirty_bitmap_save_setup(QEMUFile *f, void *opaque)
 {
     DBMSaveState *s = &((DBMState *)opaque)->save;
     SaveBitmapState *dbms = NULL;
+    bool release_lock = false;
 
-    qemu_mutex_lock_iothread();
+    /* For snapshots, the BQL is held during setup. */
+    if (!qemu_mutex_iothread_locked()) {
+        qemu_mutex_lock_iothread();
+        release_lock = true;
+    }
     if (init_dirty_bitmap_migration(s) < 0) {
-        qemu_mutex_unlock_iothread();
+        if (release_lock) {
+            qemu_mutex_unlock_iothread();
+        }
         return -1;
     }
 
@@ -1228,7 +1235,9 @@ static int dirty_bitmap_save_setup(QEMUFile *f, void *opaque)
         send_bitmap_start(f, s, dbms);
     }
     qemu_put_bitmap_flags(f, DIRTY_BITMAP_MIG_FLAG_EOS);
-    qemu_mutex_unlock_iothread();
+    if (release_lock) {
+        qemu_mutex_unlock_iothread();
+    }
     return 0;
 }
 
diff --git a/migration/block.c b/migration/block.c
index 6d532ac7a2..399cf86cb4 100644
--- a/migration/block.c
+++ b/migration/block.c
@@ -717,21 +717,30 @@ static void block_migration_cleanup(void *opaque)
 static int block_save_setup(QEMUFile *f, void *opaque)
 {
     int ret;
+    bool release_lock = false;
 
     trace_migration_block_save("setup", block_mig_state.submitted,
                                block_mig_state.transferred);
 
-    qemu_mutex_lock_iothread();
+    /* For snapshots, the BQL is held during setup. */
+    if (!qemu_mutex_iothread_locked()) {
+        qemu_mutex_lock_iothread();
+        release_lock = true;
+    }
     ret = init_blk_migration(f);
     if (ret < 0) {
-        qemu_mutex_unlock_iothread();
+        if (release_lock) {
+            qemu_mutex_unlock_iothread();
+        }
         return ret;
     }
 
     /* start track dirty blocks */
     ret = set_dirty_tracking();
 
-    qemu_mutex_unlock_iothread();
+    if (release_lock) {
+        qemu_mutex_unlock_iothread();
+    }
 
     if (ret) {
         return ret;
diff --git a/migration/ram.c b/migration/ram.c
index 7d81c4a39e..c05c2d9506 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -3106,8 +3106,16 @@ static void migration_bitmap_clear_discarded_pages(RAMState *rs)
 
 static void ram_init_bitmaps(RAMState *rs)
 {
-    /* For memory_global_dirty_log_start below.  */
-    qemu_mutex_lock_iothread();
+    bool release_lock = false;
+
+    /*
+     * For memory_global_dirty_log_start below.
+     * For snapshots, the BQL is held during setup.
+     */
+    if (!qemu_mutex_iothread_locked()) {
+        qemu_mutex_lock_iothread();
+        release_lock = true;
+    }
     qemu_mutex_lock_ramlist();
 
     WITH_RCU_READ_LOCK_GUARD() {
@@ -3119,7 +3127,9 @@ static void ram_init_bitmaps(RAMState *rs)
         }
     }
     qemu_mutex_unlock_ramlist();
-    qemu_mutex_unlock_iothread();
+    if (release_lock) {
+        qemu_mutex_unlock_iothread();
+    }
 
     /*
      * After an eventual first bitmap sync, fixup the initial bitmap
diff --git a/migration/savevm.c b/migration/savevm.c
index a9d0a88e62..d545b01516 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -1626,10 +1626,8 @@ static int qemu_savevm_state(QEMUFile *f, Error **errp)
     memset(&compression_counters, 0, sizeof(compression_counters));
     ms->to_dst_file = f;
 
-    qemu_mutex_unlock_iothread();
     qemu_savevm_state_header(f);
     qemu_savevm_state_setup(f);
-    qemu_mutex_lock_iothread();
 
     while (qemu_file_get_error(f) == 0) {
         if (qemu_savevm_state_iterate(f, false) > 0) {
-- 
2.30.2

Re: [PATCH] migration: for snapshots, hold the BQL during setup callbacks

Posted by Peter Xu 12 months ago

On Fri, May 05, 2023 at 03:46:52PM +0200, Fiona Ebner wrote:
> To fix it, ensure that the BQL is held during setup. To avoid changing
> the behavior for migration too, introduce conditionals for the setup
> callbacks that need the BQL and only take the lock if it's not already
> held.

The major complexity of this patch is the "conditionally taking" part.

Pure question: what is the benefit of not holding BQL always during
save_setup(), if after all we have this coroutine issue to be solved?

I can understand that it helps us to avoid taking BQL too long, but we'll
need to take it anyway during ramblock dirty track initializations, and so
far IIUC it's the major time to be consumed during setup().

Commit message of 9b0950375277467 says, "Only the migration_bitmap_sync()
call needs the iothread lock". Firstly I think it's also covering
enablement of dirty tracking:

+    qemu_mutex_lock_iothread();
+    qemu_mutex_lock_ramlist();
+    bytes_transferred = 0;
+    reset_ram_globals();
+
     memory_global_dirty_log_start();
     migration_bitmap_sync();
+    qemu_mutex_unlock_iothread();

And I think enablement itself can be slow too, maybe even slower than
migration_bitmap_sync() especially with KVM_DIRTY_LOG_INITIALLY_SET
supported in the kernel.

Meanwhile I always got confused on why we need to sync dirty bitmap when
setup at all.  Say, what if we drop migration_bitmap_sync() here?  After
all, shouldn't all pages be dirty from start (ram_list_init_bitmaps())?

This is slightly off-topic, but I'd like to know if someone can help
answer.

My whole point is still questioning whether we can unconditionally take bql
during save_setup().

I could have missed something, though, where we want to do in setup() but
avoid holding BQL.  Help needed on figuring this out (and if there is, IMHO
it'll be worthwhile to put that into comment of save_setup() hook).

Thanks,

-- 
Peter Xu

Re: [PATCH] migration: for snapshots, hold the BQL during setup callbacks

Posted by Juan Quintela 12 months ago

Peter Xu <peterx@redhat.com> wrote:

Hi

[Adding Kevin to the party]

> On Fri, May 05, 2023 at 03:46:52PM +0200, Fiona Ebner wrote:
>> To fix it, ensure that the BQL is held during setup. To avoid changing
>> the behavior for migration too, introduce conditionals for the setup
>> callbacks that need the BQL and only take the lock if it's not already
>> held.
>
> The major complexity of this patch is the "conditionally taking" part.

Yeap.

I don't want that bit.

This is another case of:
- I have a problem
- I will use recursive mutexes to solve it

Now you have two problems O:-)

> Pure question: what is the benefit of not holding BQL always during
> save_setup(), if after all we have this coroutine issue to be solved?

Dunno.

I would like that paolo commented on this.  I "reviewed the code" 10
years ago.  I don't remember at all why we wanted to change that.

> I can understand that it helps us to avoid taking BQL too long, but we'll
> need to take it anyway during ramblock dirty track initializations, and so
> far IIUC it's the major time to be consumed during setup().
>
> Commit message of 9b0950375277467 says, "Only the migration_bitmap_sync()
> call needs the iothread lock". Firstly I think it's also covering
> enablement of dirty tracking:
>
> +    qemu_mutex_lock_iothread();
> +    qemu_mutex_lock_ramlist();
> +    bytes_transferred = 0;
> +    reset_ram_globals();
> +
>      memory_global_dirty_log_start();
>      migration_bitmap_sync();
> +    qemu_mutex_unlock_iothread();
>
> And I think enablement itself can be slow too, maybe even slower than
> migration_bitmap_sync() especially with KVM_DIRTY_LOG_INITIALLY_SET
> supported in the kernel.
>
> Meanwhile I always got confused on why we need to sync dirty bitmap when
> setup at all.  Say, what if we drop migration_bitmap_sync() here?  After
> all, shouldn't all pages be dirty from start (ram_list_init_bitmaps())?

How do you convince KVM (or the other lists) to start doing dirty
tracking?  Doing a bitmap sync.

And yeap, probably there is a better way of doing it.

> This is slightly off-topic, but I'd like to know if someone can help
> answer.
>
> My whole point is still questioning whether we can unconditionally take bql
> during save_setup().

I agree with you.

> I could have missed something, though, where we want to do in setup() but
> avoid holding BQL.  Help needed on figuring this out (and if there is, IMHO
> it'll be worthwhile to put that into comment of save_setup() hook).

I am more towards revert completely
9b0950375277467fd74a9075624477ae43b9bb22

and call it a day.  On migration we don't use coroutines on the sending
side (I mean the migration code, the block layer uses coroutines for
everything/anything).

Paolo, Stefan any clues for not run setup with the BKL?

Later, Juan.

Re: [PATCH] migration: for snapshots, hold the BQL during setup callbacks

Posted by Fiona Ebner 11 months, 2 weeks ago

Am 10.05.23 um 08:31 schrieb Juan Quintela:
> I am more towards revert completely
> 9b0950375277467fd74a9075624477ae43b9bb22
> 
> and call it a day.  On migration we don't use coroutines on the sending
> side (I mean the migration code, the block layer uses coroutines for
> everything/anything).
> 

So I was preparing v2 for this patch, also taking the BQL during setup
for migration to avoid the conditional locking. But it turns out that
the original issue (i.e. snapshot code running in the vCPU thread) is
not solved completely. (And this is not only the issue with the
virito-net I ran into the other time [0])

This time I ran it without the net device and ended up with a hang [1].
I think what happens is the following, but am not sure:

vCPU thread
calls vm_stop(RUN_STATE_SAVE_VM)

main thread
calls pause_all_vcpus() and starts waiting

vCPU thread
finished doing the snapshot and calls vm_start()

Now the main thread will never progress, because all_vcpus_paused() was
never true.

Not sure what can be done about that and not sure if there are other
code paths were a vCPU thread could trigger vm_stop() followed by
vm_start() without stopping itself in between?

Best Regards,
Fiona

[0]: https://lists.nongnu.org/archive/html/qemu-devel/2023-05/msg00140.html

[1]:

> Thread 1 (Thread 0x7ffff359e2c0 (LWP 121263) "qemu-system-x86"):
> #0  __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x555556a1f868 <qemu_pause_cond+40>) at ./nptl/futex-internal.c:57
> #1  __futex_abstimed_wait_common (futex_word=futex_word@entry=0x555556a1f868 <qemu_pause_cond+40>, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0, cancel=cancel@entry=true) at ./nptl/futex-internal.c:87
> #2  0x00007ffff62ded9b in __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x555556a1f868 <qemu_pause_cond+40>, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at ./nptl/futex-internal.c:139
> #3  0x00007ffff62e13f8 in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x555556a1f7a0 <qemu_global_mutex>, cond=0x555556a1f840 <qemu_pause_cond>) at ./nptl/pthread_cond_wait.c:503
> #4  ___pthread_cond_wait (cond=0x555556a1f840 <qemu_pause_cond>, mutex=0x555556a1f7a0 <qemu_global_mutex>) at ./nptl/pthread_cond_wait.c:618
> #5  0x000055555605ef9d in qemu_cond_wait_impl (cond=0x555556a1f840 <qemu_pause_cond>, mutex=0x555556a1f7a0 <qemu_global_mutex>, file=0x5555561d483b "../softmmu/cpus.c", line=573) at ../util/qemu-thread-posix.c:225
> #6  0x0000555555b36891 in pause_all_vcpus () at ../softmmu/cpus.c:573
> #7  0x0000555555b35ef3 in do_vm_stop (state=RUN_STATE_SAVE_VM, send_stop=true) at ../softmmu/cpus.c:262
> #8  0x0000555555b36cab in vm_stop (state=RUN_STATE_SAVE_VM) at ../softmmu/cpus.c:676
> #9  0x0000555555b443ed in main_loop_should_exit (status=0x7fffffffd944) at ../softmmu/runstate.c:723
> #10 0x0000555555b4443e in qemu_main_loop () at ../softmmu/runstate.c:735
> #11 0x0000555555e57ec3 in qemu_default_main () at ../softmmu/main.c:37
> #12 0x0000555555e57ef9 in main (argc=56, argv=0x7fffffffdaa8) at ../softmmu/main.c:48
> (gdb) bt
> #0  resume_all_vcpus () at ../softmmu/cpus.c:592
> #1  0x0000555555b36d85 in vm_start () at ../softmmu/cpus.c:724
> #2  0x0000555555b932ee in save_snapshot (name=0x7fff5c197340 "snap0", overwrite=false, vmstate=0x7fff5c302ca0 "sata0", has_devices=true, devices=0x7fff5c1968d0, errp=0x7fff5c39aec8) at ../migration/savevm.c:2992
> #3  0x0000555555b93c49 in snapshot_save_job_bh (opaque=0x7fff5c39ae00) at ../migration/savevm.c:3255
> #4  0x000055555607733f in aio_bh_call (bh=0x7fff5c1f9560) at ../util/async.c:169
> #5  0x000055555607745a in aio_bh_poll (ctx=0x555556b806b0) at ../util/async.c:216
> #6  0x000055555605a95c in aio_poll (ctx=0x555556b806b0, blocking=true) at ../util/aio-posix.c:732
> #7  0x0000555555ea7e70 in bdrv_poll_co (s=0x7ffff0ee7e50) at /home/febner/repos/qemu/block/block-gen.h:43
> #8  0x0000555555eaaa76 in blk_pwrite (blk=0x555556df79a0, offset=189440, bytes=512, buf=0x7fff6322e400, flags=0) at block/block-gen.c:1754
> #9  0x00005555559077c9 in pflash_update (pfl=0x555556dda2e0, offset=189440, size=1) at ../hw/block/pflash_cfi01.c:394
> #10 0x0000555555907ecd in pflash_write (pfl=0x555556dda2e0, offset=189507, value=114, width=1, be=0) at ../hw/block/pflash_cfi01.c:522
> #11 0x0000555555908479 in pflash_mem_write_with_attrs (opaque=0x555556dda2e0, addr=189507, value=114, len=1, attrs=...) at ../hw/block/pflash_cfi01.c:681
> #12 0x0000555555da3c17 in memory_region_write_with_attrs_accessor (mr=0x555556dda6a0, addr=189507, value=0x7ffff0ee8068, size=1, shift=0, mask=255, attrs=...) at ../softmmu/memory.c:514
> #13 0x0000555555da3e33 in access_with_adjusted_size (addr=189507, value=0x7ffff0ee8068, size=1, access_size_min=1, access_size_max=4, access_fn=0x555555da3b1b <memory_region_write_with_attrs_accessor>, mr=0x555556dda6a0, attrs=...) at ../softmmu/memory.c:569
> #14 0x0000555555da711b in memory_region_dispatch_write (mr=0x555556dda6a0, addr=189507, data=114, op=MO_8, attrs=...) at ../softmmu/memory.c:1540
> #15 0x0000555555db45dc in flatview_write_continue (fv=0x555557f53f80, addr=4290962499, attrs=..., ptr=0x7ffff7fbc028, len=1, addr1=189507, l=1, mr=0x555556dda6a0) at ../softmmu/physmem.c:2641
> #16 0x0000555555db473f in flatview_write (fv=0x555557f53f80, addr=4290962499, attrs=..., buf=0x7ffff7fbc028, len=1) at ../softmmu/physmem.c:2683
> #17 0x0000555555db4aef in address_space_write (as=0x555556a27780 <address_space_memory>, addr=4290962499, attrs=..., buf=0x7ffff7fbc028, len=1) at ../softmmu/physmem.c:2779
> #18 0x0000555555db4b5c in address_space_rw (as=0x555556a27780 <address_space_memory>, addr=4290962499, attrs=..., buf=0x7ffff7fbc028, len=1, is_write=true) at ../softmmu/physmem.c:2789
> #19 0x0000555555e4869b in kvm_cpu_exec (cpu=0x555556f2b050) at ../accel/kvm/kvm-all.c:3036
> #20 0x0000555555e4b40e in kvm_vcpu_thread_fn (arg=0x555556f2b050) at ../accel/kvm/kvm-accel-ops.c:51
> #21 0x000055555605f7e5 in qemu_thread_start (args=0x555556e0f0f0) at ../util/qemu-thread-posix.c:541
> #22 0x00007ffff62e1fd4 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
> #23 0x00007ffff63625bc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
After continuing, the main thread still waits:

> (gdb) bt
> #0  __futex_abstimed_wait_common64 (private=0, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x555556a1f868 <qemu_pause_cond+40>) at ./nptl/futex-internal.c:57
> #1  __futex_abstimed_wait_common (futex_word=futex_word@entry=0x555556a1f868 <qemu_pause_cond+40>, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0, cancel=cancel@entry=true) at ./nptl/futex-internal.c:87
> #2  0x00007ffff62ded9b in __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x555556a1f868 <qemu_pause_cond+40>, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=0) at ./nptl/futex-internal.c:139
> #3  0x00007ffff62e13f8 in __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x555556a1f7a0 <qemu_global_mutex>, cond=0x555556a1f840 <qemu_pause_cond>) at ./nptl/pthread_cond_wait.c:503
> #4  ___pthread_cond_wait (cond=0x555556a1f840 <qemu_pause_cond>, mutex=0x555556a1f7a0 <qemu_global_mutex>) at ./nptl/pthread_cond_wait.c:618
> #5  0x000055555605ef9d in qemu_cond_wait_impl (cond=0x555556a1f840 <qemu_pause_cond>, mutex=0x555556a1f7a0 <qemu_global_mutex>, file=0x5555561d483b "../softmmu/cpus.c", line=573) at ../util/qemu-thread-posix.c:225
> #6  0x0000555555b36891 in pause_all_vcpus () at ../softmmu/cpus.c:573
> #7  0x0000555555b35ef3 in do_vm_stop (state=RUN_STATE_SAVE_VM, send_stop=true) at ../softmmu/cpus.c:262
> #8  0x0000555555b36cab in vm_stop (state=RUN_STATE_SAVE_VM) at ../softmmu/cpus.c:676
> #9  0x0000555555b443ed in main_loop_should_exit (status=0x7fffffffd944) at ../softmmu/runstate.c:723
> #10 0x0000555555b4443e in qemu_main_loop () at ../softmmu/runstate.c:735
> #11 0x0000555555e57ec3 in qemu_default_main () at ../softmmu/main.c:37
> #12 0x0000555555e57ef9 in main (argc=56, argv=0x7fffffffdaa8) at ../softmmu/main.c:48

Re: [PATCH] migration: for snapshots, hold the BQL during setup callbacks

Posted by Peter Xu 11 months, 3 weeks ago

On Wed, May 10, 2023 at 08:31:13AM +0200, Juan Quintela wrote:
> Peter Xu <peterx@redhat.com> wrote:
> 
> Hi
> 
> [Adding Kevin to the party]
> 
> > On Fri, May 05, 2023 at 03:46:52PM +0200, Fiona Ebner wrote:
> >> To fix it, ensure that the BQL is held during setup. To avoid changing
> >> the behavior for migration too, introduce conditionals for the setup
> >> callbacks that need the BQL and only take the lock if it's not already
> >> held.
> >
> > The major complexity of this patch is the "conditionally taking" part.
> 
> Yeap.
> 
> I don't want that bit.
> 
> This is another case of:
> - I have a problem
> - I will use recursive mutexes to solve it
> 
> Now you have two problems O:-)
> 
> > Pure question: what is the benefit of not holding BQL always during
> > save_setup(), if after all we have this coroutine issue to be solved?
> 
> Dunno.
> 
> I would like that paolo commented on this.  I "reviewed the code" 10
> years ago.  I don't remember at all why we wanted to change that.
> 
> > I can understand that it helps us to avoid taking BQL too long, but we'll
> > need to take it anyway during ramblock dirty track initializations, and so
> > far IIUC it's the major time to be consumed during setup().
> >
> > Commit message of 9b0950375277467 says, "Only the migration_bitmap_sync()
> > call needs the iothread lock". Firstly I think it's also covering
> > enablement of dirty tracking:
> >
> > +    qemu_mutex_lock_iothread();
> > +    qemu_mutex_lock_ramlist();
> > +    bytes_transferred = 0;
> > +    reset_ram_globals();
> > +
> >      memory_global_dirty_log_start();
> >      migration_bitmap_sync();
> > +    qemu_mutex_unlock_iothread();
> >
> > And I think enablement itself can be slow too, maybe even slower than
> > migration_bitmap_sync() especially with KVM_DIRTY_LOG_INITIALLY_SET
> > supported in the kernel.
> >
> > Meanwhile I always got confused on why we need to sync dirty bitmap when
> > setup at all.  Say, what if we drop migration_bitmap_sync() here?  After
> > all, shouldn't all pages be dirty from start (ram_list_init_bitmaps())?
> 
> How do you convince KVM (or the other lists) to start doing dirty
> tracking?  Doing a bitmap sync.

I think memory_global_dirty_log_start() kicks off the tracking already.

Take KVM as example, normally the case is KVM_MEM_LOG_DIRTY_PAGES is set
there, then ioctl(KVM_SET_USER_MEMORY_REGION) will start dirty tracking for
all of the guest memory slots necessary (including wr-protect all pages).

KVM_DIRTY_LOG_INITIALLY_SET is slightly special, it'll skip that procedure
during ioctl(KVM_SET_USER_MEMORY_REGION), but that also means the kernel
assumed the userapp (QEMU) marked all pages dirty initially (always the
case for QEMU, I think..).  Hence in this case the sync doesn't help either
because we'll simply get no new dirty bits in this shot..

> 
> And yeap, probably there is a better way of doing it.
> 
> > This is slightly off-topic, but I'd like to know if someone can help
> > answer.
> >
> > My whole point is still questioning whether we can unconditionally take bql
> > during save_setup().
> 
> I agree with you.
> 
> > I could have missed something, though, where we want to do in setup() but
> > avoid holding BQL.  Help needed on figuring this out (and if there is, IMHO
> > it'll be worthwhile to put that into comment of save_setup() hook).
> 
> I am more towards revert completely
> 9b0950375277467fd74a9075624477ae43b9bb22
> 
> and call it a day.  On migration we don't use coroutines on the sending
> side (I mean the migration code, the block layer uses coroutines for
> everything/anything).
> 
> Paolo, Stefan any clues for not run setup with the BKL?
> 
> Later, Juan.
> 

-- 
Peter Xu