[v1] migration: reduce time of loading non-iterable vmstate

[RFC PATCH] migration: reduce time of loading non-iterable vmstate

Posted by Chuang Xu 1 year, 5 months ago

The duration of loading non-iterable vmstate accounts for a significant
portion of downtime (starting with the timestamp of source qemu stop and
ending with the timestamp of target qemu start). Most of the time is spent
committing memory region changes repeatedly.

This patch packs all the changes to memory region during the period of
loading non-iterable vmstate in a single memory transaction. With the
increase of devices, this patch will greatly improve the performance.

Here are the test results:
test vm info:
- 32 CPUs 128GB RAM
- 8 16-queue vhost-net device
- 16 4-queue vhost-user-blk device.

	time of loading non-iterable vmstate
before		about 210 ms
after		about 40 ms

Signed-off-by: Chuang Xu <xuchuangxclwt@bytedance.com>
---
 migration/migration.c | 1 +
 migration/migration.h | 2 ++
 migration/savevm.c    | 8 ++++++++
 3 files changed, 11 insertions(+)

diff --git a/migration/migration.c b/migration/migration.c
index e6f8bc2478..ed20704552 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -224,6 +224,7 @@ void migration_object_init(void)
     qemu_sem_init(&current_incoming->postcopy_pause_sem_fast_load, 0);
     qemu_mutex_init(&current_incoming->page_request_mutex);
     current_incoming->page_requested = g_tree_new(page_request_addr_cmp);
+    current_incoming->start_pack_mr_change = false;
 
     migration_object_check(current_migration, &error_fatal);
 
diff --git a/migration/migration.h b/migration/migration.h
index 58b245b138..86597f5feb 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -186,6 +186,8 @@ struct MigrationIncomingState {
      * contains valid information.
      */
     QemuMutex page_request_mutex;
+
+    bool start_pack_mr_change;
 };
 
 MigrationIncomingState *migration_incoming_get_current(void);
diff --git a/migration/savevm.c b/migration/savevm.c
index 48e85c052c..a073009a74 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -2630,6 +2630,12 @@ retry:
         switch (section_type) {
         case QEMU_VM_SECTION_START:
         case QEMU_VM_SECTION_FULL:
+            /* call memory_region_transaction_begin() before loading non-iterable vmstate */
+            if (section_type == QEMU_VM_SECTION_FULL && !mis->start_pack_mr_change) {
+                memory_region_transaction_begin();
+                mis->start_pack_mr_change = true;
+            }
+
             ret = qemu_loadvm_section_start_full(f, mis);
             if (ret < 0) {
                 goto out;
@@ -2650,6 +2656,8 @@ retry:
             }
             break;
         case QEMU_VM_EOF:
+            /* call memory_region_transaction_commit() after loading non-iterable vmstate */
+            memory_region_transaction_commit();
             /* This is the end of migration */
             goto out;
         default:
-- 
2.20.1

Re: [RFC PATCH] migration: reduce time of loading non-iterable vmstate

Posted by Peter Xu 1 year, 5 months ago

On Fri, Nov 18, 2022 at 04:36:48PM +0800, Chuang Xu wrote:
> The duration of loading non-iterable vmstate accounts for a significant
> portion of downtime (starting with the timestamp of source qemu stop and
> ending with the timestamp of target qemu start). Most of the time is spent
> committing memory region changes repeatedly.
> 
> This patch packs all the changes to memory region during the period of
> loading non-iterable vmstate in a single memory transaction. With the
> increase of devices, this patch will greatly improve the performance.
> 
> Here are the test results:
> test vm info:
> - 32 CPUs 128GB RAM
> - 8 16-queue vhost-net device
> - 16 4-queue vhost-user-blk device.
> 
> 	time of loading non-iterable vmstate
> before		about 210 ms
> after		about 40 ms
> 
> Signed-off-by: Chuang Xu <xuchuangxclwt@bytedance.com>

This is an interesting idea..  I think it means at least the address space
operations will all be messed up if happening during the precopy loading
progress, but I don't directly see its happening either.  For example, in
most post_load()s of vmsd I think the devices should just write directly to
its buffers, accessing MRs directly, even if they want DMAs or just update
fields to correct states.  Even so, I'm not super confident that holds
true, not to mention any other side effects (e.g., would we release bql
during precopy for any reason?).

Copy Paolo and PeterM for some extra eyes.

> ---
>  migration/migration.c | 1 +
>  migration/migration.h | 2 ++
>  migration/savevm.c    | 8 ++++++++
>  3 files changed, 11 insertions(+)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index e6f8bc2478..ed20704552 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -224,6 +224,7 @@ void migration_object_init(void)
>      qemu_sem_init(&current_incoming->postcopy_pause_sem_fast_load, 0);
>      qemu_mutex_init(&current_incoming->page_request_mutex);
>      current_incoming->page_requested = g_tree_new(page_request_addr_cmp);
> +    current_incoming->start_pack_mr_change = false;
>  
>      migration_object_check(current_migration, &error_fatal);
>  
> diff --git a/migration/migration.h b/migration/migration.h
> index 58b245b138..86597f5feb 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -186,6 +186,8 @@ struct MigrationIncomingState {
>       * contains valid information.
>       */
>      QemuMutex page_request_mutex;
> +
> +    bool start_pack_mr_change;
>  };
>  
>  MigrationIncomingState *migration_incoming_get_current(void);
> diff --git a/migration/savevm.c b/migration/savevm.c
> index 48e85c052c..a073009a74 100644
> --- a/migration/savevm.c
> +++ b/migration/savevm.c
> @@ -2630,6 +2630,12 @@ retry:
>          switch (section_type) {
>          case QEMU_VM_SECTION_START:
>          case QEMU_VM_SECTION_FULL:
> +            /* call memory_region_transaction_begin() before loading non-iterable vmstate */
> +            if (section_type == QEMU_VM_SECTION_FULL && !mis->start_pack_mr_change) {
> +                memory_region_transaction_begin();
> +                mis->start_pack_mr_change = true;

This is slightly hacky to me.  Can we just wrap the begin/commit inside the
whole qemu_loadvm_state_main() call?

> +            }
> +
>              ret = qemu_loadvm_section_start_full(f, mis);
>              if (ret < 0) {
>                  goto out;
> @@ -2650,6 +2656,8 @@ retry:
>              }
>              break;
>          case QEMU_VM_EOF:
> +            /* call memory_region_transaction_commit() after loading non-iterable vmstate */
> +            memory_region_transaction_commit();
>              /* This is the end of migration */
>              goto out;
>          default:
> -- 
> 2.20.1
> 

-- 
Peter Xu

Re: [RFC PATCH] migration: reduce time of loading non-iterable vmstate

Posted by Chuang Xu 1 year, 5 months ago

On 2022/11/25 上午12:40, Peter Xu wrote:
> On Fri, Nov 18, 2022 at 04:36:48PM +0800, Chuang Xu wrote:
>> The duration of loading non-iterable vmstate accounts for a significant
>> portion of downtime (starting with the timestamp of source qemu stop and
>> ending with the timestamp of target qemu start). Most of the time is spent
>> committing memory region changes repeatedly.
>>
>> This patch packs all the changes to memory region during the period of
>> loading non-iterable vmstate in a single memory transaction. With the
>> increase of devices, this patch will greatly improve the performance.
>>
>> Here are the test results:
>> test vm info:
>> - 32 CPUs 128GB RAM
>> - 8 16-queue vhost-net device
>> - 16 4-queue vhost-user-blk device.
>>
>> 	time of loading non-iterable vmstate
>> before		about 210 ms
>> after		about 40 ms
>>
>> Signed-off-by: Chuang Xu<xuchuangxclwt@bytedance.com>
> This is an interesting idea..  I think it means at least the address space
> operations will all be messed up if happening during the precopy loading

Sorry, I don't quite understand the meaning of "messed up" here.. Maybe I need
more information about how the address space operations will be messed up.

> progress, but I don't directly see its happening either.  For example, in
> most post_load()s of vmsd I think the devices should just write directly to
> its buffers, accessing MRs directly, even if they want DMAs or just update
> fields to correct states.  Even so, I'm not super confident that holds

And I'm not sure whether the "its happening" means "begin/commit happening"
or "messed up happening"? If it's the former, Here are what I observe:
the stage of loading iterable vmstate doesn't call begin/commit, but the
stage of loading noniterable vmstate calls a large amount of begin/commit
in field->info->get() operation. For example:

#0  memory_region_transaction_commit () at ../softmmu/memory.c:1085
#1  0x0000559b6f683523 in pci_update_mappings (d=d@entry=0x7f5cd8682010) at ../hw/pci/pci.c:1361
#2  0x0000559b6f683a1f in get_pci_config_device (f=<optimized out>, pv=0x7f5cd86820a0, size=256, field=<optimized out>) at ../hw/pci/pci.c:545
#3  0x0000559b6f5fcd86 in vmstate_load_state (f=f@entry=0x559b757eb4b0, vmsd=vmsd@entry=0x559b70909d40 <vmstate_pci_device>, opaque=opaque@entry=0x7f5cd8682010, version_id=2)
     at ../migration/vmstate.c:143
#4  0x0000559b6f68466f in pci_device_load (s=s@entry=0x7f5cd8682010, f=f@entry=0x559b757eb4b0) at ../hw/pci/pci.c:664
#5  0x0000559b6f6ad38a in virtio_pci_load_config (d=0x7f5cd8682010, f=0x559b757eb4b0) at ../hw/virtio/virtio-pci.c:181
#6  0x0000559b6f7dfe91 in virtio_load (vdev=0x7f5cd868a1a0, f=0x559b757eb4b0, version_id=1) at ../hw/virtio/virtio.c:3071
#7  0x0000559b6f5fcd86 in vmstate_load_state (f=f@entry=0x559b757eb4b0, vmsd=0x559b709ae260 <vmstate_vhost_user_blk>, opaque=0x7f5cd868a1a0, version_id=1) at ../migration/vmstate.c:143
#8  0x0000559b6f62da48 in vmstate_load (f=0x559b757eb4b0, se=0x559b7591c010) at ../migration/savevm.c:913
#9  0x0000559b6f631334 in qemu_loadvm_section_start_full (mis=0x559b73f1a580, f=0x559b757eb4b0) at ../migration/savevm.c:2741
#10 qemu_loadvm_state_main (f=f@entry=0x559b757eb4b0, mis=mis@entry=0x559b73f1a580) at ../migration/savevm.c:2937
#11 0x0000559b6f632faa in qemu_loadvm_state (f=0x559b757eb4b0) at ../migration/savevm.c:3018
#12 0x0000559b6f6d2ece in process_incoming_migration_co (opaque=<optimized out>) at ../migration/migration.c:574
#13 0x0000559b6f9f9f0b in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at ../util/coroutine-ucontext.c:173
#14 0x00007f5cfeecf000 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#15 0x00007fff04a2e8f0 in ?? ()
#16 0x0000000000000000 in ?? ()

> true, not to mention any other side effects (e.g., would we release bql
> during precopy for any reason?).
>
> Copy Paolo and PeterM for some extra eyes.
>
What I observe is that during the loading process, migration thread will call Condwait to
wait for the vcpu threads to complete tasks, such as kvm_apic_post_load, and rcu thread
will acquire the bql to do the flatview_destroy operation. So far, I haven't seen the
side effects of these two situations.

>> ---
>>   migration/migration.c | 1 +
>>   migration/migration.h | 2 ++
>>   migration/savevm.c    | 8 ++++++++
>>   3 files changed, 11 insertions(+)
>>
>> diff --git a/migration/migration.c b/migration/migration.c
>> index e6f8bc2478..ed20704552 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -224,6 +224,7 @@ void migration_object_init(void)
>>       qemu_sem_init(&current_incoming->postcopy_pause_sem_fast_load, 0);
>>       qemu_mutex_init(&current_incoming->page_request_mutex);
>>       current_incoming->page_requested = g_tree_new(page_request_addr_cmp);
>> +    current_incoming->start_pack_mr_change = false;
>>   
>>       migration_object_check(current_migration, &error_fatal);
>>   
>> diff --git a/migration/migration.h b/migration/migration.h
>> index 58b245b138..86597f5feb 100644
>> --- a/migration/migration.h
>> +++ b/migration/migration.h
>> @@ -186,6 +186,8 @@ struct MigrationIncomingState {
>>        * contains valid information.
>>        */
>>       QemuMutex page_request_mutex;
>> +
>> +    bool start_pack_mr_change;
>>   };
>>   
>>   MigrationIncomingState *migration_incoming_get_current(void);
>> diff --git a/migration/savevm.c b/migration/savevm.c
>> index 48e85c052c..a073009a74 100644
>> --- a/migration/savevm.c
>> +++ b/migration/savevm.c
>> @@ -2630,6 +2630,12 @@ retry:
>>           switch (section_type) {
>>           case QEMU_VM_SECTION_START:
>>           case QEMU_VM_SECTION_FULL:
>> +            /* call memory_region_transaction_begin() before loading non-iterable vmstate */
>> +            if (section_type == QEMU_VM_SECTION_FULL && !mis->start_pack_mr_change) {
>> +                memory_region_transaction_begin();
>> +                mis->start_pack_mr_change = true;
> This is slightly hacky to me.  Can we just wrap the begin/commit inside the
> whole qemu_loadvm_state_main() call?

The iterative copy phase doesn't call begin/commit, so There seems to be no essential
difference between these two codes. I did try to wrap the begin/commit inside the whole
qemu_loadvm_state_main() call, this way also worked well.
But only calling begin/commit before/after the period of loading non-iterable vmstate may
have less unkown side effect?

>
>> +            }
>> +
>>               ret = qemu_loadvm_section_start_full(f, mis);
>>               if (ret < 0) {
>>                   goto out;
>> @@ -2650,6 +2656,8 @@ retry:
>>               }
>>               break;
>>           case QEMU_VM_EOF:
>> +            /* call memory_region_transaction_commit() after loading non-iterable vmstate */
>> +            memory_region_transaction_commit();
>>               /* This is the end of migration */
>>               goto out;
>>           default:
>> -- 
>> 2.20.1
>>
Peter, Thanks a lot for your advice! Hope for more suggestions from you!

Re: [RFC PATCH] migration: reduce time of loading non-iterable vmstate

Posted by Peter Xu 1 year, 5 months ago

On Mon, Nov 28, 2022 at 05:42:43PM +0800, Chuang Xu wrote:
> 
> On 2022/11/25 上午12:40, Peter Xu wrote:
> > On Fri, Nov 18, 2022 at 04:36:48PM +0800, Chuang Xu wrote:
> > > The duration of loading non-iterable vmstate accounts for a significant
> > > portion of downtime (starting with the timestamp of source qemu stop and
> > > ending with the timestamp of target qemu start). Most of the time is spent
> > > committing memory region changes repeatedly.
> > > 
> > > This patch packs all the changes to memory region during the period of
> > > loading non-iterable vmstate in a single memory transaction. With the
> > > increase of devices, this patch will greatly improve the performance.
> > > 
> > > Here are the test results:
> > > test vm info:
> > > - 32 CPUs 128GB RAM
> > > - 8 16-queue vhost-net device
> > > - 16 4-queue vhost-user-blk device.
> > > 
> > > 	time of loading non-iterable vmstate
> > > before		about 210 ms
> > > after		about 40 ms
> > > 
> > > Signed-off-by: Chuang Xu<xuchuangxclwt@bytedance.com>
> > This is an interesting idea..  I think it means at least the address space
> > operations will all be messed up if happening during the precopy loading
> 
> Sorry, I don't quite understand the meaning of "messed up" here.. Maybe I need
> more information about how the address space operations will be messed up.

AFAIK the major thing we do during commit of memory regions is to apply the
memory region changes to the rest (flatviews, or ioeventfds), basically it
makes everything matching with the new memory region layout.

If we allow memory region commit to be postponed for the whole loading
process, it means at least from flat view pov any further things like:

  address_space_write(&address_space_memory, ...)

Could write to wrong places because the flat views are not updated.

> 
> > progress, but I don't directly see its happening either.  For example, in
> > most post_load()s of vmsd I think the devices should just write directly to
> > its buffers, accessing MRs directly, even if they want DMAs or just update
> > fields to correct states.  Even so, I'm not super confident that holds
> 
> And I'm not sure whether the "its happening" means "begin/commit happening"
> or "messed up happening"? If it's the former, Here are what I observe:
> the stage of loading iterable vmstate doesn't call begin/commit, but the
> stage of loading noniterable vmstate calls a large amount of begin/commit
> in field->info->get() operation. For example:
> 
> #0  memory_region_transaction_commit () at ../softmmu/memory.c:1085
> #1  0x0000559b6f683523 in pci_update_mappings (d=d@entry=0x7f5cd8682010) at ../hw/pci/pci.c:1361
> #2  0x0000559b6f683a1f in get_pci_config_device (f=<optimized out>, pv=0x7f5cd86820a0, size=256, field=<optimized out>) at ../hw/pci/pci.c:545
> #3  0x0000559b6f5fcd86 in vmstate_load_state (f=f@entry=0x559b757eb4b0, vmsd=vmsd@entry=0x559b70909d40 <vmstate_pci_device>, opaque=opaque@entry=0x7f5cd8682010, version_id=2)
>     at ../migration/vmstate.c:143
> #4  0x0000559b6f68466f in pci_device_load (s=s@entry=0x7f5cd8682010, f=f@entry=0x559b757eb4b0) at ../hw/pci/pci.c:664
> #5  0x0000559b6f6ad38a in virtio_pci_load_config (d=0x7f5cd8682010, f=0x559b757eb4b0) at ../hw/virtio/virtio-pci.c:181
> #6  0x0000559b6f7dfe91 in virtio_load (vdev=0x7f5cd868a1a0, f=0x559b757eb4b0, version_id=1) at ../hw/virtio/virtio.c:3071
> #7  0x0000559b6f5fcd86 in vmstate_load_state (f=f@entry=0x559b757eb4b0, vmsd=0x559b709ae260 <vmstate_vhost_user_blk>, opaque=0x7f5cd868a1a0, version_id=1) at ../migration/vmstate.c:143
> #8  0x0000559b6f62da48 in vmstate_load (f=0x559b757eb4b0, se=0x559b7591c010) at ../migration/savevm.c:913
> #9  0x0000559b6f631334 in qemu_loadvm_section_start_full (mis=0x559b73f1a580, f=0x559b757eb4b0) at ../migration/savevm.c:2741
> #10 qemu_loadvm_state_main (f=f@entry=0x559b757eb4b0, mis=mis@entry=0x559b73f1a580) at ../migration/savevm.c:2937
> #11 0x0000559b6f632faa in qemu_loadvm_state (f=0x559b757eb4b0) at ../migration/savevm.c:3018
> #12 0x0000559b6f6d2ece in process_incoming_migration_co (opaque=<optimized out>) at ../migration/migration.c:574
> #13 0x0000559b6f9f9f0b in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at ../util/coroutine-ucontext.c:173
> #14 0x00007f5cfeecf000 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
> #15 0x00007fff04a2e8f0 in ?? ()
> #16 0x0000000000000000 in ?? ()
> 
> > true, not to mention any other side effects (e.g., would we release bql
> > during precopy for any reason?).
> > 
> > Copy Paolo and PeterM for some extra eyes.
> > 
> What I observe is that during the loading process, migration thread will call Condwait to
> wait for the vcpu threads to complete tasks, such as kvm_apic_post_load, and rcu thread
> will acquire the bql to do the flatview_destroy operation. So far, I haven't seen the
> side effects of these two situations.

Yes that's something I'd worry about.

The current memory API should be defined as: when we release the bql we
should guarantee the memory layout is persistent and no pending
transactions.  I used to have a patchset just for that because when
violating that rule it's prone to very weird bugs:

https://lore.kernel.org/all/20210728183151.195139-8-peterx@redhat.com/

One example report that was caused by wrongly releasing bql and you can
have a feeling of it by the stack dumped (after above patchset applied):

https://lore.kernel.org/qemu-devel/CH0PR02MB7898BBD73D0F3F7D5003BB178BE19@CH0PR02MB7898.namprd02.prod.outlook.com/

Said that, it's not exact the case here since it's not releasing bql during
a memory commit phase, so probably no immediate problem as rcu thread will
just ignore any updates to be committed.  It might be safe to do it like
that (and making sure no vcpu is running), but worth serious thoughts.

As a start, maybe you can try with poison address_space_to_flatview() (by
e.g. checking the start_pack_mr_change flag and assert it is not set)
during this process to see whether any call stack can even try to
dereference a flatview.

It's just that I didn't figure a good way to "prove" its validity, even if
I think this is an interesting idea worth thinking to shrink the downtime.

> 
> > > ---
> > >   migration/migration.c | 1 +
> > >   migration/migration.h | 2 ++
> > >   migration/savevm.c    | 8 ++++++++
> > >   3 files changed, 11 insertions(+)
> > > 
> > > diff --git a/migration/migration.c b/migration/migration.c
> > > index e6f8bc2478..ed20704552 100644
> > > --- a/migration/migration.c
> > > +++ b/migration/migration.c
> > > @@ -224,6 +224,7 @@ void migration_object_init(void)
> > >       qemu_sem_init(&current_incoming->postcopy_pause_sem_fast_load, 0);
> > >       qemu_mutex_init(&current_incoming->page_request_mutex);
> > >       current_incoming->page_requested = g_tree_new(page_request_addr_cmp);
> > > +    current_incoming->start_pack_mr_change = false;
> > >       migration_object_check(current_migration, &error_fatal);
> > > diff --git a/migration/migration.h b/migration/migration.h
> > > index 58b245b138..86597f5feb 100644
> > > --- a/migration/migration.h
> > > +++ b/migration/migration.h
> > > @@ -186,6 +186,8 @@ struct MigrationIncomingState {
> > >        * contains valid information.
> > >        */
> > >       QemuMutex page_request_mutex;
> > > +
> > > +    bool start_pack_mr_change;
> > >   };
> > >   MigrationIncomingState *migration_incoming_get_current(void);
> > > diff --git a/migration/savevm.c b/migration/savevm.c
> > > index 48e85c052c..a073009a74 100644
> > > --- a/migration/savevm.c
> > > +++ b/migration/savevm.c
> > > @@ -2630,6 +2630,12 @@ retry:
> > >           switch (section_type) {
> > >           case QEMU_VM_SECTION_START:
> > >           case QEMU_VM_SECTION_FULL:
> > > +            /* call memory_region_transaction_begin() before loading non-iterable vmstate */
> > > +            if (section_type == QEMU_VM_SECTION_FULL && !mis->start_pack_mr_change) {
> > > +                memory_region_transaction_begin();
> > > +                mis->start_pack_mr_change = true;
> > This is slightly hacky to me.  Can we just wrap the begin/commit inside the
> > whole qemu_loadvm_state_main() call?
> 
> The iterative copy phase doesn't call begin/commit, so There seems to be no essential
> difference between these two codes. I did try to wrap the begin/commit inside the whole
> qemu_loadvm_state_main() call, this way also worked well.
> But only calling begin/commit before/after the period of loading non-iterable vmstate may
> have less unkown side effect?

I don't worry much on the iterative migration phase, because they should be
mostly pure data movements unless I miss something important.  Having them
wrap qemu_loadvm_state_main() can avoid the flag completely and avoid the
special treatment within these migration internal flags which is hacky, imo.

> 
> > 
> > > +            }
> > > +
> > >               ret = qemu_loadvm_section_start_full(f, mis);
> > >               if (ret < 0) {
> > >                   goto out;
> > > @@ -2650,6 +2656,8 @@ retry:
> > >               }
> > >               break;
> > >           case QEMU_VM_EOF:
> > > +            /* call memory_region_transaction_commit() after loading non-iterable vmstate */
> > > +            memory_region_transaction_commit();
> > >               /* This is the end of migration */
> > >               goto out;
> > >           default:
> > > -- 
> > > 2.20.1
> > > 
> Peter, Thanks a lot for your advice! Hope for more suggestions from you!

-- 
Peter Xu

Re: [RFC PATCH] migration: reduce time of loading non-iterable vmstate

Posted by Chuang Xu 1 year, 4 months ago

Peter, I'm sorry I didn't reply to your email in time, because I was busy with
other work last week. Here is my latest progress.

On 2022/11/29 上午1:41, Peter Xu wrote:
> On Mon, Nov 28, 2022 at 05:42:43PM +0800, Chuang Xu wrote:
>> On 2022/11/25 上午12:40, Peter Xu wrote:
>>> On Fri, Nov 18, 2022 at 04:36:48PM +0800, Chuang Xu wrote:
>>>> The duration of loading non-iterable vmstate accounts for a significant
>>>> portion of downtime (starting with the timestamp of source qemu stop and
>>>> ending with the timestamp of target qemu start). Most of the time is spent
>>>> committing memory region changes repeatedly.
>>>>
>>>> This patch packs all the changes to memory region during the period of
>>>> loading non-iterable vmstate in a single memory transaction. With the
>>>> increase of devices, this patch will greatly improve the performance.
>>>>
>>>> Here are the test results:
>>>> test vm info:
>>>> - 32 CPUs 128GB RAM
>>>> - 8 16-queue vhost-net device
>>>> - 16 4-queue vhost-user-blk device.
>>>>
>>>> 	time of loading non-iterable vmstate
>>>> before		about 210 ms
>>>> after		about 40 ms
>>>>
>>>> Signed-off-by: Chuang Xu<xuchuangxclwt@bytedance.com>
>>> This is an interesting idea..  I think it means at least the address space
>>> operations will all be messed up if happening during the precopy loading
>> Sorry, I don't quite understand the meaning of "messed up" here.. Maybe I need
>> more information about how the address space operations will be messed up.
> AFAIK the major thing we do during commit of memory regions is to apply the
> memory region changes to the rest (flatviews, or ioeventfds), basically it
> makes everything matching with the new memory region layout.
>
> If we allow memory region commit to be postponed for the whole loading
> process, it means at least from flat view pov any further things like:
>
>    address_space_write(&address_space_memory, ...)
>
> Could write to wrong places because the flat views are not updated.

I have tested migration on normal qemu and optimized qemu repeatedly,
I haven't trace any other operation on target qemu's mr (such as
address_space_write...) happens so far.

>>> progress, but I don't directly see its happening either.  For example, in
>>> most post_load()s of vmsd I think the devices should just write directly to
>>> its buffers, accessing MRs directly, even if they want DMAs or just update
>>> fields to correct states.  Even so, I'm not super confident that holds
>> And I'm not sure whether the "its happening" means "begin/commit happening"
>> or "messed up happening"? If it's the former, Here are what I observe:
>> the stage of loading iterable vmstate doesn't call begin/commit, but the
>> stage of loading noniterable vmstate calls a large amount of begin/commit
>> in field->info->get() operation. For example:
>>
>> #0  memory_region_transaction_commit () at ../softmmu/memory.c:1085
>> #1  0x0000559b6f683523 in pci_update_mappings (d=d@entry=0x7f5cd8682010) at ../hw/pci/pci.c:1361
>> #2  0x0000559b6f683a1f in get_pci_config_device (f=<optimized out>, pv=0x7f5cd86820a0, size=256, field=<optimized out>) at ../hw/pci/pci.c:545
>> #3  0x0000559b6f5fcd86 in vmstate_load_state (f=f@entry=0x559b757eb4b0, vmsd=vmsd@entry=0x559b70909d40 <vmstate_pci_device>, opaque=opaque@entry=0x7f5cd8682010, version_id=2)
>>      at ../migration/vmstate.c:143
>> #4  0x0000559b6f68466f in pci_device_load (s=s@entry=0x7f5cd8682010, f=f@entry=0x559b757eb4b0) at ../hw/pci/pci.c:664
>> #5  0x0000559b6f6ad38a in virtio_pci_load_config (d=0x7f5cd8682010, f=0x559b757eb4b0) at ../hw/virtio/virtio-pci.c:181
>> #6  0x0000559b6f7dfe91 in virtio_load (vdev=0x7f5cd868a1a0, f=0x559b757eb4b0, version_id=1) at ../hw/virtio/virtio.c:3071
>> #7  0x0000559b6f5fcd86 in vmstate_load_state (f=f@entry=0x559b757eb4b0, vmsd=0x559b709ae260 <vmstate_vhost_user_blk>, opaque=0x7f5cd868a1a0, version_id=1) at ../migration/vmstate.c:143
>> #8  0x0000559b6f62da48 in vmstate_load (f=0x559b757eb4b0, se=0x559b7591c010) at ../migration/savevm.c:913
>> #9  0x0000559b6f631334 in qemu_loadvm_section_start_full (mis=0x559b73f1a580, f=0x559b757eb4b0) at ../migration/savevm.c:2741
>> #10 qemu_loadvm_state_main (f=f@entry=0x559b757eb4b0, mis=mis@entry=0x559b73f1a580) at ../migration/savevm.c:2937
>> #11 0x0000559b6f632faa in qemu_loadvm_state (f=0x559b757eb4b0) at ../migration/savevm.c:3018
>> #12 0x0000559b6f6d2ece in process_incoming_migration_co (opaque=<optimized out>) at ../migration/migration.c:574
>> #13 0x0000559b6f9f9f0b in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at ../util/coroutine-ucontext.c:173
>> #14 0x00007f5cfeecf000 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
>> #15 0x00007fff04a2e8f0 in ?? ()
>> #16 0x0000000000000000 in ?? ()
>>
>>> true, not to mention any other side effects (e.g., would we release bql
>>> during precopy for any reason?).
>>>
>>> Copy Paolo and PeterM for some extra eyes.
>>>
>> What I observe is that during the loading process, migration thread will call Condwait to
>> wait for the vcpu threads to complete tasks, such as kvm_apic_post_load, and rcu thread
>> will acquire the bql to do the flatview_destroy operation. So far, I haven't seen the
>> side effects of these two situations.
> Yes that's something I'd worry about.
>
> The current memory API should be defined as: when we release the bql we
> should guarantee the memory layout is persistent and no pending
> transactions.  I used to have a patchset just for that because when
> violating that rule it's prone to very weird bugs:
>
> https://lore.kernel.org/all/20210728183151.195139-8-peterx@redhat.com/
>
> One example report that was caused by wrongly releasing bql and you can
> have a feeling of it by the stack dumped (after above patchset applied):
>
> https://lore.kernel.org/qemu-devel/CH0PR02MB7898BBD73D0F3F7D5003BB178BE19@CH0PR02MB7898.namprd02.prod.outlook.com/
>
> Said that, it's not exact the case here since it's not releasing bql during
> a memory commit phase, so probably no immediate problem as rcu thread will
> just ignore any updates to be committed.  It might be safe to do it like
> that (and making sure no vcpu is running), but worth serious thoughts.
>
> As a start, maybe you can try with poison address_space_to_flatview() (by
> e.g. checking the start_pack_mr_change flag and assert it is not set)
> during this process to see whether any call stack can even try to
> dereference a flatview.
>
> It's just that I didn't figure a good way to "prove" its validity, even if
> I think this is an interesting idea worth thinking to shrink the downtime.

Thanks for your sugguestions!
I used a thread local variable to identify whether the current thread is a
migration thread(main thread of target qemu) and I modified the code of
qemu_coroutine_switch to make sure the thread local variable true only in
process_incoming_migration_co call stack. If the target qemu detects that
start_pack_mr_change is set and address_space_to_flatview() is called in
non-migrating threads or non-migrating coroutine, it will crash. I tested
migration for lots of times, there was no crash. Does this prove the validity
to some extent?

>>>> ---
>>>>    migration/migration.c | 1 +
>>>>    migration/migration.h | 2 ++
>>>>    migration/savevm.c    | 8 ++++++++
>>>>    3 files changed, 11 insertions(+)
>>>>
>>>> diff --git a/migration/migration.c b/migration/migration.c
>>>> index e6f8bc2478..ed20704552 100644
>>>> --- a/migration/migration.c
>>>> +++ b/migration/migration.c
>>>> @@ -224,6 +224,7 @@ void migration_object_init(void)
>>>>        qemu_sem_init(&current_incoming->postcopy_pause_sem_fast_load, 0);
>>>>        qemu_mutex_init(&current_incoming->page_request_mutex);
>>>>        current_incoming->page_requested = g_tree_new(page_request_addr_cmp);
>>>> +    current_incoming->start_pack_mr_change = false;
>>>>        migration_object_check(current_migration, &error_fatal);
>>>> diff --git a/migration/migration.h b/migration/migration.h
>>>> index 58b245b138..86597f5feb 100644
>>>> --- a/migration/migration.h
>>>> +++ b/migration/migration.h
>>>> @@ -186,6 +186,8 @@ struct MigrationIncomingState {
>>>>         * contains valid information.
>>>>         */
>>>>        QemuMutex page_request_mutex;
>>>> +
>>>> +    bool start_pack_mr_change;
>>>>    };
>>>>    MigrationIncomingState *migration_incoming_get_current(void);
>>>> diff --git a/migration/savevm.c b/migration/savevm.c
>>>> index 48e85c052c..a073009a74 100644
>>>> --- a/migration/savevm.c
>>>> +++ b/migration/savevm.c
>>>> @@ -2630,6 +2630,12 @@ retry:
>>>>            switch (section_type) {
>>>>            case QEMU_VM_SECTION_START:
>>>>            case QEMU_VM_SECTION_FULL:
>>>> +            /* call memory_region_transaction_begin() before loading non-iterable vmstate */
>>>> +            if (section_type == QEMU_VM_SECTION_FULL && !mis->start_pack_mr_change) {
>>>> +                memory_region_transaction_begin();
>>>> +                mis->start_pack_mr_change = true;
>>> This is slightly hacky to me.  Can we just wrap the begin/commit inside the
>>> whole qemu_loadvm_state_main() call?
>> The iterative copy phase doesn't call begin/commit, so There seems to be no essential
>> difference between these two codes. I did try to wrap the begin/commit inside the whole
>> qemu_loadvm_state_main() call, this way also worked well.
>> But only calling begin/commit before/after the period of loading non-iterable vmstate may
>> have less unkown side effect?
> I don't worry much on the iterative migration phase, because they should be
> mostly pure data movements unless I miss something important.  Having them
> wrap qemu_loadvm_state_main() can avoid the flag completely and avoid the
> special treatment within these migration internal flags which is hacky, imo.
>
In my latest patch for testing, I wrap the begin/commit inside the whole
qemu_loadvm_state_main() call as you suggested. So far everything works well.

>>>> +            }
>>>> +
>>>>                ret = qemu_loadvm_section_start_full(f, mis);
>>>>                if (ret < 0) {
>>>>                    goto out;
>>>> @@ -2650,6 +2656,8 @@ retry:
>>>>                }
>>>>                break;
>>>>            case QEMU_VM_EOF:
>>>> +            /* call memory_region_transaction_commit() after loading non-iterable vmstate */
>>>> +            memory_region_transaction_commit();
>>>>                /* This is the end of migration */
>>>>                goto out;
>>>>            default:
>>>> -- 
>>>> 2.20.1
>>>>
>> Peter, Thanks a lot for your advice! Hope for more suggestions from you!

Re: [RFC PATCH] migration: reduce time of loading non-iterable vmstate

Posted by Peter Xu 1 year, 4 months ago

Chuang,

No worry on the delay; you're faster than when I read yours. :)

On Mon, Dec 05, 2022 at 02:56:15PM +0800, Chuang Xu wrote:
> > As a start, maybe you can try with poison address_space_to_flatview() (by
> > e.g. checking the start_pack_mr_change flag and assert it is not set)
> > during this process to see whether any call stack can even try to
> > dereference a flatview.
> > 
> > It's just that I didn't figure a good way to "prove" its validity, even if
> > I think this is an interesting idea worth thinking to shrink the downtime.
> 
> Thanks for your sugguestions!
> I used a thread local variable to identify whether the current thread is a
> migration thread(main thread of target qemu) and I modified the code of
> qemu_coroutine_switch to make sure the thread local variable true only in
> process_incoming_migration_co call stack. If the target qemu detects that
> start_pack_mr_change is set and address_space_to_flatview() is called in
> non-migrating threads or non-migrating coroutine, it will crash.

Are you using the thread var just to avoid the assert triggering in the
migration thread when commiting memory changes?

I think _maybe_ another cleaner way to sanity check this is directly upon
the depth:

static inline FlatView *address_space_to_flatview(AddressSpace *as)
{
    /*
     * Before using any flatview, sanity check we're not during a memory
     * region transaction or the map can be invalid.  Note that this can
     * also be called during commit phase of memory transaction, but that
     * should also only happen when the depth decreases to 0 first.
     */
    assert(memory_region_transaction_depth == 0);
    return qatomic_rcu_read(&as->current_map);
}

That should also cover the safe cases of memory transaction commits during
migration.

> I tested migration for lots of times, there was no crash. Does this prove
> the validity to some extent?

Yes I think so, it's just that if we cannot 100% prove it's safe (e.g. you
cannot cover all the code paths in qemu that migration can trigger) then we
may need some sanity check like above along with the solution to make sure
even if something wrong it won't go wrong as weird.

And if we want to try this out, it'll better be at the start of a dev cycle
and we fix things or revert before the next rc0 releases.

I'm not sure whether that assert might be too strong, we can use an error
instead, but so far I don't see how that can happen and if that happens I
feel like it's bad enough, so maybe not so much.  Then AFAICT we can
completely drop start_pack_mr_change with that stronger check.

If you agree with above, feel free to have two patches in the new version,
making the depth assert a separate patch.  At the meantime, let's see
whether you can get some other comments from others.

-- 
Peter Xu

Re: [External] Re: [RFC PATCH] migration: reduce time of loading non-iterable vmstate

Posted by Chuang Xu 1 year, 4 months ago

On 2022/12/6 上午12:28, Peter Xu wrote:
> Chuang,
>
> No worry on the delay; you're faster than when I read yours. :)
>
> On Mon, Dec 05, 2022 at 02:56:15PM +0800, Chuang Xu wrote:
>>> As a start, maybe you can try with poison address_space_to_flatview() (by
>>> e.g. checking the start_pack_mr_change flag and assert it is not set)
>>> during this process to see whether any call stack can even try to
>>> dereference a flatview.
>>>
>>> It's just that I didn't figure a good way to "prove" its validity, even if
>>> I think this is an interesting idea worth thinking to shrink the downtime.
>> Thanks for your sugguestions!
>> I used a thread local variable to identify whether the current thread is a
>> migration thread(main thread of target qemu) and I modified the code of
>> qemu_coroutine_switch to make sure the thread local variable true only in
>> process_incoming_migration_co call stack. If the target qemu detects that
>> start_pack_mr_change is set and address_space_to_flatview() is called in
>> non-migrating threads or non-migrating coroutine, it will crash.
> Are you using the thread var just to avoid the assert triggering in the
> migration thread when commiting memory changes?
>
> I think _maybe_ another cleaner way to sanity check this is directly upon
> the depth:
>
> static inline FlatView *address_space_to_flatview(AddressSpace *as)
> {
>      /*
>       * Before using any flatview, sanity check we're not during a memory
>       * region transaction or the map can be invalid.  Note that this can
>       * also be called during commit phase of memory transaction, but that
>       * should also only happen when the depth decreases to 0 first.
>       */
>      assert(memory_region_transaction_depth == 0);
>      return qatomic_rcu_read(&as->current_map);
> }
>
> That should also cover the safe cases of memory transaction commits during
> migration.
>
Peter, I tried this way and found that the target qemu will crash.

Here is the gdb backtrace:

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007ff2929d851a in __GI_abort () at abort.c:118
#2  0x00007ff2929cfe67 in __assert_fail_base (fmt=<optimized out>, assertion=assertion@entry=0x55a32578cdc0 "memory_region_transaction_depth == 0", file=file@entry=0x55a32575d9b0 "/data00/migration/qemu-5.2.0/include/exec/memory.h",
     line=line@entry=766, function=function@entry=0x55a32578d6e0 <__PRETTY_FUNCTION__.20463> "address_space_to_flatview") at assert.c:92
#3  0x00007ff2929cff12 in __GI___assert_fail (assertion=assertion@entry=0x55a32578cdc0 "memory_region_transaction_depth == 0", file=file@entry=0x55a32575d9b0 "/data00/migration/qemu-5.2.0/include/exec/memory.h", line=line@entry=766,
     function=function@entry=0x55a32578d6e0 <__PRETTY_FUNCTION__.20463> "address_space_to_flatview") at assert.c:101
#4  0x000055a324b2ed5e in address_space_to_flatview (as=0x55a326132580 <address_space_memory>) at /data00/migration/qemu-5.2.0/include/exec/memory.h:766
#5  0x000055a324e79559 in address_space_to_flatview (as=0x55a326132580 <address_space_memory>) at ../softmmu/memory.c:811
#6  address_space_get_flatview (as=0x55a326132580 <address_space_memory>) at ../softmmu/memory.c:805
#7  0x000055a324e96474 in address_space_cache_init (cache=cache@entry=0x55a32a4fb000, as=<optimized out>, addr=addr@entry=68404985856, len=len@entry=4096, is_write=false) at ../softmmu/physmem.c:3307
#8  0x000055a324ea9cba in virtio_init_region_cache (vdev=0x55a32985d9a0, n=0) at ../hw/virtio/virtio.c:185
#9  0x000055a324eaa615 in virtio_load (vdev=0x55a32985d9a0, f=<optimized out>, version_id=<optimized out>) at ../hw/virtio/virtio.c:3203
#10 0x000055a324c6ab96 in vmstate_load_state (f=f@entry=0x55a329dc0c00, vmsd=0x55a325fc1a60 <vmstate_virtio_scsi>, opaque=0x55a32985d9a0, version_id=1) at ../migration/vmstate.c:143
#11 0x000055a324cda138 in vmstate_load (f=0x55a329dc0c00, se=0x55a329941c90) at ../migration/savevm.c:913
#12 0x000055a324cdda34 in qemu_loadvm_section_start_full (mis=0x55a3284ef9e0, f=0x55a329dc0c00) at ../migration/savevm.c:2741
#13 qemu_loadvm_state_main (f=f@entry=0x55a329dc0c00, mis=mis@entry=0x55a3284ef9e0) at ../migration/savevm.c:2939
#14 0x000055a324cdf66a in qemu_loadvm_state (f=0x55a329dc0c00) at ../migration/savevm.c:3021
#15 0x000055a324d14b4e in process_incoming_migration_co (opaque=<optimized out>) at ../migration/migration.c:574
#16 0x000055a32501ae3b in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at ../util/coroutine-ucontext.c:173
#17 0x00007ff2929e8000 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#18 0x00007ffed80dc2a0 in ?? ()
#19 0x0000000000000000 in ?? ()

address_space_cache_init() is the only caller of address_space_to_flatview
I can find in vmstate_load call stack so far. Although I think the mr used
by address_space_cache_init() won't be affected by the delay of
memory_region_transaction_commit(), we really need a mechanism to prevent
the modified mr from being used.

Maybe we can build a stale list:
If a subregion is added, add its parent to the stale list(considering that
new subregion's priority has uncertain effects on flatviews).
If a subregion is deleted, add itself to the stale list.
When memory_region_transaction_commit() regenerates flatviews, clear the
stale list.
when address_space_translate_internal() is called, check whether the mr
looked up matches one of mrs（or its child）in the stale list. If yes, a
crash will be triggered.

There may be many details to consider in this mechanism. Hope you can give
some suggestions on its feasibility.

>> I tested migration for lots of times, there was no crash. Does this prove
>> the validity to some extent?
> Yes I think so, it's just that if we cannot 100% prove it's safe (e.g. you
> cannot cover all the code paths in qemu that migration can trigger) then we
> may need some sanity check like above along with the solution to make sure
> even if something wrong it won't go wrong as weird.
>
> And if we want to try this out, it'll better be at the start of a dev cycle
> and we fix things or revert before the next rc0 releases.
>
> I'm not sure whether that assert might be too strong, we can use an error
> instead, but so far I don't see how that can happen and if that happens I
> feel like it's bad enough, so maybe not so much.  Then AFAICT we can
> completely drop start_pack_mr_change with that stronger check.
> If you agree with above, feel free to have two patches in the new version,
> making the depth assert a separate patch.  At the meantime, let's see
> whether you can get some other comments from others.
>
Yes, start_pack_mr_change isn't needed any more. I'll drop it in the new patches.

Re: [External] Re: [RFC PATCH] migration: reduce time of loading non-iterable vmstate

Posted by Peter Xu 1 year, 4 months ago

On Thu, Dec 08, 2022 at 12:07:03AM +0800, Chuang Xu wrote:
> 
> On 2022/12/6 上午12:28, Peter Xu wrote:
> > Chuang,
> > 
> > No worry on the delay; you're faster than when I read yours. :)
> > 
> > On Mon, Dec 05, 2022 at 02:56:15PM +0800, Chuang Xu wrote:
> > > > As a start, maybe you can try with poison address_space_to_flatview() (by
> > > > e.g. checking the start_pack_mr_change flag and assert it is not set)
> > > > during this process to see whether any call stack can even try to
> > > > dereference a flatview.
> > > > 
> > > > It's just that I didn't figure a good way to "prove" its validity, even if
> > > > I think this is an interesting idea worth thinking to shrink the downtime.
> > > Thanks for your sugguestions!
> > > I used a thread local variable to identify whether the current thread is a
> > > migration thread(main thread of target qemu) and I modified the code of
> > > qemu_coroutine_switch to make sure the thread local variable true only in
> > > process_incoming_migration_co call stack. If the target qemu detects that
> > > start_pack_mr_change is set and address_space_to_flatview() is called in
> > > non-migrating threads or non-migrating coroutine, it will crash.
> > Are you using the thread var just to avoid the assert triggering in the
> > migration thread when commiting memory changes?
> > 
> > I think _maybe_ another cleaner way to sanity check this is directly upon
> > the depth:
> > 
> > static inline FlatView *address_space_to_flatview(AddressSpace *as)
> > {
> >      /*
> >       * Before using any flatview, sanity check we're not during a memory
> >       * region transaction or the map can be invalid.  Note that this can
> >       * also be called during commit phase of memory transaction, but that
> >       * should also only happen when the depth decreases to 0 first.
> >       */
> >      assert(memory_region_transaction_depth == 0);
> >      return qatomic_rcu_read(&as->current_map);
> > }
> > 
> > That should also cover the safe cases of memory transaction commits during
> > migration.
> > 
> Peter, I tried this way and found that the target qemu will crash.
> 
> Here is the gdb backtrace:
> 
> #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
> #1  0x00007ff2929d851a in __GI_abort () at abort.c:118
> #2  0x00007ff2929cfe67 in __assert_fail_base (fmt=<optimized out>, assertion=assertion@entry=0x55a32578cdc0 "memory_region_transaction_depth == 0", file=file@entry=0x55a32575d9b0 "/data00/migration/qemu-5.2.0/include/exec/memory.h",
>     line=line@entry=766, function=function@entry=0x55a32578d6e0 <__PRETTY_FUNCTION__.20463> "address_space_to_flatview") at assert.c:92
> #3  0x00007ff2929cff12 in __GI___assert_fail (assertion=assertion@entry=0x55a32578cdc0 "memory_region_transaction_depth == 0", file=file@entry=0x55a32575d9b0 "/data00/migration/qemu-5.2.0/include/exec/memory.h", line=line@entry=766,
>     function=function@entry=0x55a32578d6e0 <__PRETTY_FUNCTION__.20463> "address_space_to_flatview") at assert.c:101
> #4  0x000055a324b2ed5e in address_space_to_flatview (as=0x55a326132580 <address_space_memory>) at /data00/migration/qemu-5.2.0/include/exec/memory.h:766
> #5  0x000055a324e79559 in address_space_to_flatview (as=0x55a326132580 <address_space_memory>) at ../softmmu/memory.c:811
> #6  address_space_get_flatview (as=0x55a326132580 <address_space_memory>) at ../softmmu/memory.c:805
> #7  0x000055a324e96474 in address_space_cache_init (cache=cache@entry=0x55a32a4fb000, as=<optimized out>, addr=addr@entry=68404985856, len=len@entry=4096, is_write=false) at ../softmmu/physmem.c:3307
> #8  0x000055a324ea9cba in virtio_init_region_cache (vdev=0x55a32985d9a0, n=0) at ../hw/virtio/virtio.c:185
> #9  0x000055a324eaa615 in virtio_load (vdev=0x55a32985d9a0, f=<optimized out>, version_id=<optimized out>) at ../hw/virtio/virtio.c:3203
> #10 0x000055a324c6ab96 in vmstate_load_state (f=f@entry=0x55a329dc0c00, vmsd=0x55a325fc1a60 <vmstate_virtio_scsi>, opaque=0x55a32985d9a0, version_id=1) at ../migration/vmstate.c:143
> #11 0x000055a324cda138 in vmstate_load (f=0x55a329dc0c00, se=0x55a329941c90) at ../migration/savevm.c:913
> #12 0x000055a324cdda34 in qemu_loadvm_section_start_full (mis=0x55a3284ef9e0, f=0x55a329dc0c00) at ../migration/savevm.c:2741
> #13 qemu_loadvm_state_main (f=f@entry=0x55a329dc0c00, mis=mis@entry=0x55a3284ef9e0) at ../migration/savevm.c:2939
> #14 0x000055a324cdf66a in qemu_loadvm_state (f=0x55a329dc0c00) at ../migration/savevm.c:3021
> #15 0x000055a324d14b4e in process_incoming_migration_co (opaque=<optimized out>) at ../migration/migration.c:574
> #16 0x000055a32501ae3b in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at ../util/coroutine-ucontext.c:173
> #17 0x00007ff2929e8000 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
> #18 0x00007ffed80dc2a0 in ?? ()
> #19 0x0000000000000000 in ?? ()
> 
> address_space_cache_init() is the only caller of address_space_to_flatview
> I can find in vmstate_load call stack so far. Although I think the mr used
> by address_space_cache_init() won't be affected by the delay of
> memory_region_transaction_commit(), we really need a mechanism to prevent
> the modified mr from being used.
> 
> Maybe we can build a stale list:
> If a subregion is added, add its parent to the stale list(considering that
> new subregion's priority has uncertain effects on flatviews).
> If a subregion is deleted, add itself to the stale list.
> When memory_region_transaction_commit() regenerates flatviews, clear the
> stale list.
> when address_space_translate_internal() is called, check whether the mr
> looked up matches one of mrs（or its child）in the stale list. If yes, a
> crash will be triggered.

I'm not sure that'll work, though.  Consider this graph:

                            A
                           / \
                          B   C
                       (p=1) (p=0)

A,B,C are MRs, B&C are subregions to A.  When B's priority is higher (p=1),
any access to A will go upon B, so far so good.

Then, let's assume D comes under C with even higher priority:

                            A
                           / \
                          B   C
                       (p=1) (p=0)
                              |
                              D
                             (p=2)


Adding C into stale list won't work because when with the old flatview
it'll point to B instead, while B is not in the stale list.  The AS
operation will carry out without noticing it's already wrong.

> 
> There may be many details to consider in this mechanism. Hope you can give
> some suggestions on its feasibility.

For this specific case, I'm wildly thinking whether we can just postpone
the init of the vring cache until migration completes.

One thing to mention from what I read it: we'll need to update all the
caches in virtio_memory_listener_commit() anyway, when the batched commit()
happens when migration completes with your approach, so we'll rebuild the
vring cache once and for all which looks also nice if possible.

There's some details to consider. E.g. the commit() happens only when
memory_region_update_pending==true.  We may want to make sure the cache is
initialized unconditionally, at least.  Not sure whether that's doable,
though.

Thanks,

-- 
Peter Xu

Re: [RFC PATCH] migration: reduce time of loading non-iterable vmstate

Posted by Chuang Xu 1 year, 4 months ago

On 2022/12/8 上午6:08, Peter Xu wrote:
> On Thu, Dec 08, 2022 at 12:07:03AM +0800, Chuang Xu wrote:
>> On 2022/12/6 上午12:28, Peter Xu wrote:
>>> Chuang,
>>>
>>> No worry on the delay; you're faster than when I read yours. :)
>>>
>>> On Mon, Dec 05, 2022 at 02:56:15PM +0800, Chuang Xu wrote:
>>>>> As a start, maybe you can try with poison address_space_to_flatview() (by
>>>>> e.g. checking the start_pack_mr_change flag and assert it is not set)
>>>>> during this process to see whether any call stack can even try to
>>>>> dereference a flatview.
>>>>>
>>>>> It's just that I didn't figure a good way to "prove" its validity, even if
>>>>> I think this is an interesting idea worth thinking to shrink the downtime.
>>>> Thanks for your sugguestions!
>>>> I used a thread local variable to identify whether the current thread is a
>>>> migration thread(main thread of target qemu) and I modified the code of
>>>> qemu_coroutine_switch to make sure the thread local variable true only in
>>>> process_incoming_migration_co call stack. If the target qemu detects that
>>>> start_pack_mr_change is set and address_space_to_flatview() is called in
>>>> non-migrating threads or non-migrating coroutine, it will crash.
>>> Are you using the thread var just to avoid the assert triggering in the
>>> migration thread when commiting memory changes?
>>>
>>> I think _maybe_ another cleaner way to sanity check this is directly upon
>>> the depth:
>>>
>>> static inline FlatView *address_space_to_flatview(AddressSpace *as)
>>> {
>>>       /*
>>>        * Before using any flatview, sanity check we're not during a memory
>>>        * region transaction or the map can be invalid.  Note that this can
>>>        * also be called during commit phase of memory transaction, but that
>>>        * should also only happen when the depth decreases to 0 first.
>>>        */
>>>       assert(memory_region_transaction_depth == 0);
>>>       return qatomic_rcu_read(&as->current_map);
>>> }
>>>
>>> That should also cover the safe cases of memory transaction commits during
>>> migration.
>>>
>> Peter, I tried this way and found that the target qemu will crash.
>>
>> Here is the gdb backtrace:
>>
>> #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
>> #1  0x00007ff2929d851a in __GI_abort () at abort.c:118
>> #2  0x00007ff2929cfe67 in __assert_fail_base (fmt=<optimized out>, assertion=assertion@entry=0x55a32578cdc0 "memory_region_transaction_depth == 0", file=file@entry=0x55a32575d9b0 "/data00/migration/qemu-5.2.0/include/exec/memory.h",
>>      line=line@entry=766, function=function@entry=0x55a32578d6e0 <__PRETTY_FUNCTION__.20463> "address_space_to_flatview") at assert.c:92
>> #3  0x00007ff2929cff12 in __GI___assert_fail (assertion=assertion@entry=0x55a32578cdc0 "memory_region_transaction_depth == 0", file=file@entry=0x55a32575d9b0 "/data00/migration/qemu-5.2.0/include/exec/memory.h", line=line@entry=766,
>>      function=function@entry=0x55a32578d6e0 <__PRETTY_FUNCTION__.20463> "address_space_to_flatview") at assert.c:101
>> #4  0x000055a324b2ed5e in address_space_to_flatview (as=0x55a326132580 <address_space_memory>) at /data00/migration/qemu-5.2.0/include/exec/memory.h:766
>> #5  0x000055a324e79559 in address_space_to_flatview (as=0x55a326132580 <address_space_memory>) at ../softmmu/memory.c:811
>> #6  address_space_get_flatview (as=0x55a326132580 <address_space_memory>) at ../softmmu/memory.c:805
>> #7  0x000055a324e96474 in address_space_cache_init (cache=cache@entry=0x55a32a4fb000, as=<optimized out>, addr=addr@entry=68404985856, len=len@entry=4096, is_write=false) at ../softmmu/physmem.c:3307
>> #8  0x000055a324ea9cba in virtio_init_region_cache (vdev=0x55a32985d9a0, n=0) at ../hw/virtio/virtio.c:185
>> #9  0x000055a324eaa615 in virtio_load (vdev=0x55a32985d9a0, f=<optimized out>, version_id=<optimized out>) at ../hw/virtio/virtio.c:3203
>> #10 0x000055a324c6ab96 in vmstate_load_state (f=f@entry=0x55a329dc0c00, vmsd=0x55a325fc1a60 <vmstate_virtio_scsi>, opaque=0x55a32985d9a0, version_id=1) at ../migration/vmstate.c:143
>> #11 0x000055a324cda138 in vmstate_load (f=0x55a329dc0c00, se=0x55a329941c90) at ../migration/savevm.c:913
>> #12 0x000055a324cdda34 in qemu_loadvm_section_start_full (mis=0x55a3284ef9e0, f=0x55a329dc0c00) at ../migration/savevm.c:2741
>> #13 qemu_loadvm_state_main (f=f@entry=0x55a329dc0c00, mis=mis@entry=0x55a3284ef9e0) at ../migration/savevm.c:2939
>> #14 0x000055a324cdf66a in qemu_loadvm_state (f=0x55a329dc0c00) at ../migration/savevm.c:3021
>> #15 0x000055a324d14b4e in process_incoming_migration_co (opaque=<optimized out>) at ../migration/migration.c:574
>> #16 0x000055a32501ae3b in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at ../util/coroutine-ucontext.c:173
>> #17 0x00007ff2929e8000 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
>> #18 0x00007ffed80dc2a0 in ?? ()
>> #19 0x0000000000000000 in ?? ()
>>
>> address_space_cache_init() is the only caller of address_space_to_flatview
>> I can find in vmstate_load call stack so far. Although I think the mr used
>> by address_space_cache_init() won't be affected by the delay of
>> memory_region_transaction_commit(), we really need a mechanism to prevent
>> the modified mr from being used.
>>
>> Maybe we can build a stale list:
>> If a subregion is added, add its parent to the stale list(considering that
>> new subregion's priority has uncertain effects on flatviews).
>> If a subregion is deleted, add itself to the stale list.
>> When memory_region_transaction_commit() regenerates flatviews, clear the
>> stale list.
>> when address_space_translate_internal() is called, check whether the mr
>> looked up matches one of mrs（or its child）in the stale list. If yes, a
>> crash will be triggered.
> I'm not sure that'll work, though.  Consider this graph:
>
>                              A
>                             / \
>                            B   C
>                         (p=1) (p=0)
>
> A,B,C are MRs, B&C are subregions to A.  When B's priority is higher (p=1),
> any access to A will go upon B, so far so good.
>
> Then, let's assume D comes under C with even higher priority:
>
>                              A
>                             / \
>                            B   C
>                         (p=1) (p=0)
>                                |
>                                D
>                               (p=2)
>
>
> Adding C into stale list won't work because when with the old flatview
> it'll point to B instead, while B is not in the stale list. The AS
> operation will carry out without noticing it's already wrong.

Peter, I think our understanding of priority is different.

In the qemu docs
(https://qemu.readthedocs.io/en/stable-6.1/devel/memory.html#overlapping-regions-and-priority),
it says 'Priority values are local to a container, because the priorities of
two regions are only compared when they are both children of the same container.'
  
And as I read in code, when doing render_memory_region() operation on A, qemu
will firstly insert B's FlatRanges and its children's FlatRanges recursively
because B's priority is higher than C. After B's FlatRanges and its children's
FlatRanges are all inserted into flatviews, C's FlatRanges and its children's
FlatRanges will be inserted into gaps left by B if B and C overlaps.

So I think adding D as C's subregion has no effect on B in your second case.
The old FlatRange pointing to B is still effective. C and C'children with lower
priority than D will be affected, but we have flagged them as stale.

I hope I have no misunderstanding of the flatview's construction code. If I
understand wrong, please forgive my ignorance..😭

>
>> There may be many details to consider in this mechanism. Hope you can give
>> some suggestions on its feasibility.
> For this specific case, I'm wildly thinking whether we can just postpone
> the init of the vring cache until migration completes.
>
> One thing to mention from what I read it: we'll need to update all the
> caches in virtio_memory_listener_commit() anyway, when the batched commit()
> happens when migration completes with your approach, so we'll rebuild the
> vring cache once and for all which looks also nice if possible.
>
> There's some details to consider. E.g. the commit() happens only when
> memory_region_update_pending==true.  We may want to make sure the cache is
> initialized unconditionally, at least.  Not sure whether that's doable,
> though.
>
> Thanks,
>
Good idea! We can try it in the new patches! And with the delay of
virtio_init_region_cache(), we can still use assert in address_space_to_flatview().
However, I think the stale list can be used as a retention scheme for further
discussion in the future, because the stale list may adapt to more complex scenarios.

Thanks.

Re: [RFC PATCH] migration: reduce time of loading non-iterable vmstate

Posted by Peter Xu 1 year, 4 months ago

On Thu, Dec 08, 2022 at 10:39:11PM +0800, Chuang Xu wrote:
> 
> On 2022/12/8 上午6:08, Peter Xu wrote:
> > On Thu, Dec 08, 2022 at 12:07:03AM +0800, Chuang Xu wrote:
> > > On 2022/12/6 上午12:28, Peter Xu wrote:
> > > > Chuang,
> > > > 
> > > > No worry on the delay; you're faster than when I read yours. :)
> > > > 
> > > > On Mon, Dec 05, 2022 at 02:56:15PM +0800, Chuang Xu wrote:
> > > > > > As a start, maybe you can try with poison address_space_to_flatview() (by
> > > > > > e.g. checking the start_pack_mr_change flag and assert it is not set)
> > > > > > during this process to see whether any call stack can even try to
> > > > > > dereference a flatview.
> > > > > > 
> > > > > > It's just that I didn't figure a good way to "prove" its validity, even if
> > > > > > I think this is an interesting idea worth thinking to shrink the downtime.
> > > > > Thanks for your sugguestions!
> > > > > I used a thread local variable to identify whether the current thread is a
> > > > > migration thread(main thread of target qemu) and I modified the code of
> > > > > qemu_coroutine_switch to make sure the thread local variable true only in
> > > > > process_incoming_migration_co call stack. If the target qemu detects that
> > > > > start_pack_mr_change is set and address_space_to_flatview() is called in
> > > > > non-migrating threads or non-migrating coroutine, it will crash.
> > > > Are you using the thread var just to avoid the assert triggering in the
> > > > migration thread when commiting memory changes?
> > > > 
> > > > I think _maybe_ another cleaner way to sanity check this is directly upon
> > > > the depth:
> > > > 
> > > > static inline FlatView *address_space_to_flatview(AddressSpace *as)
> > > > {
> > > >       /*
> > > >        * Before using any flatview, sanity check we're not during a memory
> > > >        * region transaction or the map can be invalid.  Note that this can
> > > >        * also be called during commit phase of memory transaction, but that
> > > >        * should also only happen when the depth decreases to 0 first.
> > > >        */
> > > >       assert(memory_region_transaction_depth == 0);
> > > >       return qatomic_rcu_read(&as->current_map);
> > > > }
> > > > 
> > > > That should also cover the safe cases of memory transaction commits during
> > > > migration.
> > > > 
> > > Peter, I tried this way and found that the target qemu will crash.
> > > 
> > > Here is the gdb backtrace:
> > > 
> > > #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
> > > #1  0x00007ff2929d851a in __GI_abort () at abort.c:118
> > > #2  0x00007ff2929cfe67 in __assert_fail_base (fmt=<optimized out>, assertion=assertion@entry=0x55a32578cdc0 "memory_region_transaction_depth == 0", file=file@entry=0x55a32575d9b0 "/data00/migration/qemu-5.2.0/include/exec/memory.h",
> > >      line=line@entry=766, function=function@entry=0x55a32578d6e0 <__PRETTY_FUNCTION__.20463> "address_space_to_flatview") at assert.c:92
> > > #3  0x00007ff2929cff12 in __GI___assert_fail (assertion=assertion@entry=0x55a32578cdc0 "memory_region_transaction_depth == 0", file=file@entry=0x55a32575d9b0 "/data00/migration/qemu-5.2.0/include/exec/memory.h", line=line@entry=766,
> > >      function=function@entry=0x55a32578d6e0 <__PRETTY_FUNCTION__.20463> "address_space_to_flatview") at assert.c:101
> > > #4  0x000055a324b2ed5e in address_space_to_flatview (as=0x55a326132580 <address_space_memory>) at /data00/migration/qemu-5.2.0/include/exec/memory.h:766
> > > #5  0x000055a324e79559 in address_space_to_flatview (as=0x55a326132580 <address_space_memory>) at ../softmmu/memory.c:811
> > > #6  address_space_get_flatview (as=0x55a326132580 <address_space_memory>) at ../softmmu/memory.c:805
> > > #7  0x000055a324e96474 in address_space_cache_init (cache=cache@entry=0x55a32a4fb000, as=<optimized out>, addr=addr@entry=68404985856, len=len@entry=4096, is_write=false) at ../softmmu/physmem.c:3307
> > > #8  0x000055a324ea9cba in virtio_init_region_cache (vdev=0x55a32985d9a0, n=0) at ../hw/virtio/virtio.c:185
> > > #9  0x000055a324eaa615 in virtio_load (vdev=0x55a32985d9a0, f=<optimized out>, version_id=<optimized out>) at ../hw/virtio/virtio.c:3203
> > > #10 0x000055a324c6ab96 in vmstate_load_state (f=f@entry=0x55a329dc0c00, vmsd=0x55a325fc1a60 <vmstate_virtio_scsi>, opaque=0x55a32985d9a0, version_id=1) at ../migration/vmstate.c:143
> > > #11 0x000055a324cda138 in vmstate_load (f=0x55a329dc0c00, se=0x55a329941c90) at ../migration/savevm.c:913
> > > #12 0x000055a324cdda34 in qemu_loadvm_section_start_full (mis=0x55a3284ef9e0, f=0x55a329dc0c00) at ../migration/savevm.c:2741
> > > #13 qemu_loadvm_state_main (f=f@entry=0x55a329dc0c00, mis=mis@entry=0x55a3284ef9e0) at ../migration/savevm.c:2939
> > > #14 0x000055a324cdf66a in qemu_loadvm_state (f=0x55a329dc0c00) at ../migration/savevm.c:3021
> > > #15 0x000055a324d14b4e in process_incoming_migration_co (opaque=<optimized out>) at ../migration/migration.c:574
> > > #16 0x000055a32501ae3b in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at ../util/coroutine-ucontext.c:173
> > > #17 0x00007ff2929e8000 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
> > > #18 0x00007ffed80dc2a0 in ?? ()
> > > #19 0x0000000000000000 in ?? ()
> > > 
> > > address_space_cache_init() is the only caller of address_space_to_flatview
> > > I can find in vmstate_load call stack so far. Although I think the mr used
> > > by address_space_cache_init() won't be affected by the delay of
> > > memory_region_transaction_commit(), we really need a mechanism to prevent
> > > the modified mr from being used.
> > > 
> > > Maybe we can build a stale list:
> > > If a subregion is added, add its parent to the stale list(considering that
> > > new subregion's priority has uncertain effects on flatviews).
> > > If a subregion is deleted, add itself to the stale list.
> > > When memory_region_transaction_commit() regenerates flatviews, clear the
> > > stale list.
> > > when address_space_translate_internal() is called, check whether the mr
> > > looked up matches one of mrs（or its child）in the stale list. If yes, a
> > > crash will be triggered.
> > I'm not sure that'll work, though.  Consider this graph:
> > 
> >                              A
> >                             / \
> >                            B   C
> >                         (p=1) (p=0)
> > 
> > A,B,C are MRs, B&C are subregions to A.  When B's priority is higher (p=1),
> > any access to A will go upon B, so far so good.
> > 
> > Then, let's assume D comes under C with even higher priority:
> > 
> >                              A
> >                             / \
> >                            B   C
> >                         (p=1) (p=0)
> >                                |
> >                                D
> >                               (p=2)
> > 
> > 
> > Adding C into stale list won't work because when with the old flatview
> > it'll point to B instead, while B is not in the stale list. The AS
> > operation will carry out without noticing it's already wrong.
> 
> Peter, I think our understanding of priority is different.
> 
> In the qemu docs
> (https://qemu.readthedocs.io/en/stable-6.1/devel/memory.html#overlapping-regions-and-priority),
> it says 'Priority values are local to a container, because the priorities of
> two regions are only compared when they are both children of the same container.'
> And as I read in code, when doing render_memory_region() operation on A, qemu
> will firstly insert B's FlatRanges and its children's FlatRanges recursively
> because B's priority is higher than C. After B's FlatRanges and its children's
> FlatRanges are all inserted into flatviews, C's FlatRanges and its children's
> FlatRanges will be inserted into gaps left by B if B and C overlaps.
> 
> So I think adding D as C's subregion has no effect on B in your second case.
> The old FlatRange pointing to B is still effective. C and C'children with lower
> priority than D will be affected, but we have flagged them as stale.
> 
> I hope I have no misunderstanding of the flatview's construction code. If I
> understand wrong, please forgive my ignorance..😭

No I think you're right.. thanks, I should read the code/doc first rather
than trusting myself. :)

But still, the whole point is that the parent may not even be visible to
the flatview, so I still don't know how it could work.

My 2nd attempt:

                                  A
                                  |
                                  B
                                (p=1)

Adding C with p=2:

                                  A
                                 / \
                                B   C
                             (p=1) (p=2)

IIUC the flatview to access the offset A resides should point to B, then
after C plugged we'll still lookup and find B.  Even if A is in the stale
list, B is not?

The other thing I didn't mention is that I don't think the address space
translation is the solo consumer of the flat view.  Some examples:

common_semi_find_bases() walks the flatview without translations.

memory_region_update_coalesced_range() (calls address_space_get_flatview()
first) notifies kvm coalesced mmio regions without translations.

So at least hooking up address_space_translate_internal() itself may not be
enough too.

> 
> > 
> > > There may be many details to consider in this mechanism. Hope you can give
> > > some suggestions on its feasibility.
> > For this specific case, I'm wildly thinking whether we can just postpone
> > the init of the vring cache until migration completes.
> > 
> > One thing to mention from what I read it: we'll need to update all the
> > caches in virtio_memory_listener_commit() anyway, when the batched commit()
> > happens when migration completes with your approach, so we'll rebuild the
> > vring cache once and for all which looks also nice if possible.
> > 
> > There's some details to consider. E.g. the commit() happens only when
> > memory_region_update_pending==true.  We may want to make sure the cache is
> > initialized unconditionally, at least.  Not sure whether that's doable,
> > though.
> > 
> > Thanks,
> > 
> Good idea! We can try it in the new patches! And with the delay of
> virtio_init_region_cache(), we can still use assert in address_space_to_flatview().
> However, I think the stale list can be used as a retention scheme for further
> discussion in the future, because the stale list may adapt to more complex scenarios.

If the assert will work that'll be even better.  I'm actually worried this
can trigger like what you mentioned in the virtio path, I didn't expect it
comes that soon.  So if there's a minimum cases and we can fixup easily
that'll be great.  Hopefully there aren't so much or we'll need to revisit
the whole idea.

Thanks,

-- 
Peter Xu

Re: [External] Re: [RFC PATCH] migration: reduce time of loading non-iterable vmstate

Posted by Chuang Xu 1 year, 4 months ago

Hi, Peter!

This email is a supplement to my previous email 7 hours ago.

On 2022/12/9 上午12:00, Peter Xu wrote:

If the assert will work that'll be even better.  I'm actually worried this
can trigger like what you mentioned in the virtio path, I didn't expect it
comes that soon.  So if there's a minimum cases and we can fixup easily
that'll be great.  Hopefully there aren't so much or we'll need to revisit
the whole idea.

Thanks,


Only delaying virtio_init_region_cache() will result in the failure of the
checks and caches following original virtio_init_region_cache().

Here are the patches related to these checks and cache
operation:https://gitlab.com/qemu-project/qemu/-/commit/1abeb5a65d515f8a8a9cfc4a82342f731bd9321fhttps://gitlab.com/qemu-project/qemu/-/commit/be1fea9bc286f64c6c995bb0d7145a0b738aeddbhttps://gitlab.com/qemu-project/qemu/-/commit/b796fcd1bf2978aed15748db04e054f34789e9ebhttps://gitlab.com/qemu-project/qemu/-/commit/bccdef6b1a204db0f41ffb6e24ce373e4d7890d4

I think I should try to postpone these checks and caches too..

Thanks!

Re: [RFC PATCH] migration: reduce time of loading non-iterable vmstate

Posted by Chuang Xu 1 year, 4 months ago

On 2022/12/9 上午12:00, Peter Xu wrote:

On Thu, Dec 08, 2022 at 10:39:11PM +0800, Chuang Xu wrote:

On 2022/12/8 上午6:08, Peter Xu wrote:

On Thu, Dec 08, 2022 at 12:07:03AM +0800, Chuang Xu wrote:

On 2022/12/6 上午12:28, Peter Xu wrote:

Chuang,

No worry on the delay; you're faster than when I read yours. :)

On Mon, Dec 05, 2022 at 02:56:15PM +0800, Chuang Xu wrote:

As a start, maybe you can try with poison address_space_to_flatview() (by
e.g. checking the start_pack_mr_change flag and assert it is not set)
during this process to see whether any call stack can even try to
dereference a flatview.

It's just that I didn't figure a good way to "prove" its validity, even if
I think this is an interesting idea worth thinking to shrink the downtime.

Thanks for your sugguestions!
I used a thread local variable to identify whether the current thread is a
migration thread(main thread of target qemu) and I modified the code of
qemu_coroutine_switch to make sure the thread local variable true only in
process_incoming_migration_co call stack. If the target qemu detects that
start_pack_mr_change is set and address_space_to_flatview() is called in
non-migrating threads or non-migrating coroutine, it will crash.

Are you using the thread var just to avoid the assert triggering in the
migration thread when commiting memory changes?

I think _maybe_ another cleaner way to sanity check this is directly upon
the depth:

static inline FlatView *address_space_to_flatview(AddressSpace *as)
{
      /*
       * Before using any flatview, sanity check we're not during a memory
       * region transaction or the map can be invalid.  Note that this can
       * also be called during commit phase of memory transaction, but that
       * should also only happen when the depth decreases to 0 first.
       */
      assert(memory_region_transaction_depth == 0);
      return qatomic_rcu_read(&as->current_map);
}

That should also cover the safe cases of memory transaction commits during
migration.


Peter, I tried this way and found that the target qemu will crash.

Here is the gdb backtrace:

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007ff2929d851a in __GI_abort () at abort.c:118
#2  0x00007ff2929cfe67 in __assert_fail_base (fmt=<optimized out>,
assertion=assertion@entry=0x55a32578cdc0
"memory_region_transaction_depth == 0", file=file@entry=0x55a32575d9b0
"/data00/migration/qemu-5.2.0/include/exec/memory.h",
     line=line@entry=766, function=function@entry=0x55a32578d6e0
<__PRETTY_FUNCTION__.20463> "address_space_to_flatview") at
assert.c:92
#3  0x00007ff2929cff12 in __GI___assert_fail
(assertion=assertion@entry=0x55a32578cdc0
"memory_region_transaction_depth == 0", file=file@entry=0x55a32575d9b0
"/data00/migration/qemu-5.2.0/include/exec/memory.h",
line=line@entry=766,
     function=function@entry=0x55a32578d6e0
<__PRETTY_FUNCTION__.20463> "address_space_to_flatview") at
assert.c:101
#4  0x000055a324b2ed5e in address_space_to_flatview (as=0x55a326132580
<address_space_memory>) at
/data00/migration/qemu-5.2.0/include/exec/memory.h:766
#5  0x000055a324e79559 in address_space_to_flatview (as=0x55a326132580
<address_space_memory>) at ../softmmu/memory.c:811
#6  address_space_get_flatview (as=0x55a326132580
<address_space_memory>) at ../softmmu/memory.c:805
#7  0x000055a324e96474 in address_space_cache_init
(cache=cache@entry=0x55a32a4fb000, as=<optimized out>,
addr=addr@entry=68404985856, len=len@entry=4096, is_write=false) at
../softmmu/physmem.c:3307
#8  0x000055a324ea9cba in virtio_init_region_cache
(vdev=0x55a32985d9a0, n=0) at ../hw/virtio/virtio.c:185
#9  0x000055a324eaa615 in virtio_load (vdev=0x55a32985d9a0,
f=<optimized out>, version_id=<optimized out>) at
../hw/virtio/virtio.c:3203
#10 0x000055a324c6ab96 in vmstate_load_state
(f=f@entry=0x55a329dc0c00, vmsd=0x55a325fc1a60 <vmstate_virtio_scsi>,
opaque=0x55a32985d9a0, version_id=1) at ../migration/vmstate.c:143
#11 0x000055a324cda138 in vmstate_load (f=0x55a329dc0c00,
se=0x55a329941c90) at ../migration/savevm.c:913
#12 0x000055a324cdda34 in qemu_loadvm_section_start_full
(mis=0x55a3284ef9e0, f=0x55a329dc0c00) at ../migration/savevm.c:2741
#13 qemu_loadvm_state_main (f=f@entry=0x55a329dc0c00,
mis=mis@entry=0x55a3284ef9e0) at ../migration/savevm.c:2939
#14 0x000055a324cdf66a in qemu_loadvm_state (f=0x55a329dc0c00) at
../migration/savevm.c:3021
#15 0x000055a324d14b4e in process_incoming_migration_co
(opaque=<optimized out>) at ../migration/migration.c:574
#16 0x000055a32501ae3b in coroutine_trampoline (i0=<optimized out>,
i1=<optimized out>) at ../util/coroutine-ucontext.c:173
#17 0x00007ff2929e8000 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#18 0x00007ffed80dc2a0 in ?? ()
#19 0x0000000000000000 in ?? ()

address_space_cache_init() is the only caller of address_space_to_flatview
I can find in vmstate_load call stack so far. Although I think the mr used
by address_space_cache_init() won't be affected by the delay of
memory_region_transaction_commit(), we really need a mechanism to prevent
the modified mr from being used.

Maybe we can build a stale list:
If a subregion is added, add its parent to the stale list(considering that
new subregion's priority has uncertain effects on flatviews).
If a subregion is deleted, add itself to the stale list.
When memory_region_transaction_commit() regenerates flatviews, clear the
stale list.
when address_space_translate_internal() is called, check whether the mr
looked up matches one of mrs（or its child）in the stale list. If yes, a
crash will be triggered.

I'm not sure that'll work, though.  Consider this graph:

                             A
                            / \
                           B   C
                        (p=1) (p=0)

A,B,C are MRs, B&C are subregions to A.  When B's priority is higher (p=1),
any access to A will go upon B, so far so good.

Then, let's assume D comes under C with even higher priority:

                             A
                            / \
                           B   C
                        (p=1) (p=0)
                               |
                               D
                              (p=2)


Adding C into stale list won't work because when with the old flatview
it'll point to B instead, while B is not in the stale list. The AS
operation will carry out without noticing it's already wrong.

Peter, I think our understanding of priority is different.

In the qemu docs
(https://qemu.readthedocs.io/en/stable-6.1/devel/memory.html#overlapping-regions-and-priority),
it says 'Priority values are local to a container, because the priorities of
two regions are only compared when they are both children of the same
container.'
And as I read in code, when doing render_memory_region() operation on A, qemu
will firstly insert B's FlatRanges and its children's FlatRanges recursively
because B's priority is higher than C. After B's FlatRanges and its children's
FlatRanges are all inserted into flatviews, C's FlatRanges and its children's
FlatRanges will be inserted into gaps left by B if B and C overlaps.

So I think adding D as C's subregion has no effect on B in your second case.
The old FlatRange pointing to B is still effective. C and C'children with lower
priority than D will be affected, but we have flagged them as stale.

I hope I have no misunderstanding of the flatview's construction code. If I
understand wrong, please forgive my ignorance..😭

No I think you're right.. thanks, I should read the code/doc first rather
than trusting myself. :)

But still, the whole point is that the parent may not even be visible to
the flatview, so I still don't know how it could work.

My 2nd attempt:

                                  A
                                  |
                                  B
                                (p=1)

Adding C with p=2:

                                  A
                                 / \
                                B   C
                             (p=1) (p=2)

IIUC the flatview to access the offset A resides should point to B, then
after C plugged we'll still lookup and find B.  Even if A is in the stale
list, B is not?

Sorry I forgot to describe my latest ideas about this mechanism in detail.

we can add A, B and B's children to the stale list（If there is D, E and
other mrs have lower priority than C, we can also add them and their children
to the stale list recursively). Or we can add a stale flag to mr structure
to avoid cost of searching mr in the stale list..

The other thing I didn't mention is that I don't think the address space
translation is the solo consumer of the flat view.  Some examples:

common_semi_find_bases() walks the flatview without translations.

memory_region_update_coalesced_range() (calls address_space_get_flatview()
first) notifies kvm coalesced mmio regions without translations.

So at least hooking up address_space_translate_internal() itself may not be
enough too.

This is really a problem. Maybe we should check whether the mr is stale on
all critical paths, or find other more genneral ways..

There may be many details to consider in this mechanism. Hope you can give
some suggestions on its feasibility.

For this specific case, I'm wildly thinking whether we can just postpone
the init of the vring cache until migration completes.

One thing to mention from what I read it: we'll need to update all the
caches in virtio_memory_listener_commit() anyway, when the batched commit()
happens when migration completes with your approach, so we'll rebuild the
vring cache once and for all which looks also nice if possible.

There's some details to consider. E.g. the commit() happens only when
memory_region_update_pending==true.  We may want to make sure the cache is
initialized unconditionally, at least.  Not sure whether that's doable,
though.

Thanks,


Good idea! We can try it in the new patches! And with the delay of
virtio_init_region_cache(), we can still use assert in
address_space_to_flatview().
However, I think the stale list can be used as a retention scheme for further
discussion in the future, because the stale list may adapt to more
complex scenarios.

If the assert will work that'll be even better.  I'm actually worried this
can trigger like what you mentioned in the virtio path, I didn't expect it
comes that soon.  So if there's a minimum cases and we can fixup easily
that'll be great.  Hopefully there aren't so much or we'll need to revisit
the whole idea.

Thanks,

+1.. Hope this is the only case that will trigger crash.

I'll upload the second version as soon as possible.

Thanks.