[v1] migration: add capability to bypass the shared memory

[Qemu-devel] [PATCH] migration: add capability to bypass the shared memory

Posted by Lai Jiangshan 7 years, 10 months ago

1) What's this

When the migration capability 'bypass-shared-memory'
is set, the shared memory will be bypassed when migration.

It is the key feature to enable several excellent features for
the qemu, such as qemu-local-migration, qemu-live-update,
extremely-fast-save-restore, vm-template, vm-fast-live-clone,
yet-another-post-copy-migration, etc..

The philosophy behind this key feature, including the resulting
advanced key features, is that a part of the memory management
is separated out from the qemu, and let the other toolkits
such as libvirt, kata-containers (https://github.com/kata-containers)
runv(https://github.com/hyperhq/runv/) or some multiple cooperative
qemu commands directly access to it, manage it, provide features on it.

2) Status in real world

The hyperhq(http://hyper.sh  http://hypercontainer.io/)
introduced the feature vm-template(vm-fast-live-clone)
to the hyper container for several years, it works perfect.
(see https://github.com/hyperhq/runv/pull/297).

The feature vm-template makes the containers(VMs) can
be started in 130ms and save 80M memory for every
container(VM). So that the hyper containers are fast
and high-density as normal containers.

kata-containers project (https://github.com/kata-containers)
which was launched by hyper, intel and friends and which descended
from runv (and clear-container) should have this feature enabled.
Unfortunately, due to the code confliction between runv&cc,
this feature was temporary disabled and it is being brought
back by hyper and intel team.

3) How to use and bring up advanced features.

In current qemu command line, shared memory has
to be configured via memory-object.

a) feature: qemu-local-migration, qemu-live-update
Set the mem-path on the tmpfs and set share=on for it when
start the vm. example:
-object \
memory-backend-file,id=mem,size=128M,mem-path=/dev/shm/memory,share=on \
-numa node,nodeid=0,cpus=0-7,memdev=mem

when you want to migrate the vm locally (after fixed a security bug
of the qemu-binary, or other reason), you can start a new qemu with
the same command line and -incoming, then you can migrate the
vm from the old qemu to the new qemu with the migration capability
'bypass-shared-memory' set. The migration will migrate the device-state
*ONLY*, the memory is the origin memory backed by tmpfs file.

b) feature: extremely-fast-save-restore
the same above, but the mem-path is on the persistent file system.

c)  feature: vm-template, vm-fast-live-clone
the template vm is started as 1), and paused when the guest reaches
the template point(example: the guest app is ready), then the template
vm is saved. (the qemu process of the template can be killed now, because
we need only the memory and the device state files (in tmpfs)).

Then we can launch one or multiple VMs base on the template vm states,
the new VMs are started without the “share=on”, all the new VMs share
the initial memory from the memory file, they save a lot of memory.
all the new VMs start from the template point, the guest app can go to
work quickly.

The new VM booted from template vm can’t become template again,
if you need this unusual chained-template feature, you can write
a cloneable-tmpfs kernel module for it.

The libvirt toolkit can’t manage vm-template currently, in the
hyperhq/runv, we use qemu wrapper script to do it. I hope someone add
“libvrit managed template” feature to libvirt.

d) feature: yet-another-post-copy-migration
It is a possible feature, no toolkit can do it well now.
Using nbd server/client on the memory file is reluctantly Ok but
inconvenient. A special feature for tmpfs might be needed to
fully complete this feature.
No one need yet another post copy migration method,
but it is possible when some crazy man need it.

Cc: Samuel Ortiz <sameo@linux.intel.com>
Cc: Sebastien Boeuf <sebastien.boeuf@intel.com>
Cc: James O. D. Hunt <james.o.hunt@intel.com>
Cc: Xu Wang <gnawux@gmail.com>
Cc: Peng Tao <bergwolf@gmail.com>
Cc: Xiao Guangrong <xiaoguangrong@tencent.com>
Cc: Xiao Guangrong <xiaoguangrong.eric@gmail.com>
Signed-off-by: Lai Jiangshan <jiangshanlai@gmail.com>
---
 migration/migration.c | 13 +++++++++++++
 migration/migration.h |  1 +
 migration/ram.c       | 26 +++++++++++++++++---------
 qapi/migration.json   |  8 +++++++-
 4 files changed, 38 insertions(+), 10 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index d780601f0c..880d58889f 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1476,6 +1476,19 @@ bool migrate_release_ram(void)
     return s->enabled_capabilities[MIGRATION_CAPABILITY_RELEASE_RAM];
 }
 
+bool migrate_bypass_shared_memory(void)
+{
+    MigrationState *s;
+
+    /* it is not workable with postcopy yet. */
+    if (migrate_postcopy_ram())
+        return false;
+
+    s = migrate_get_current();
+
+    return s->enabled_capabilities[MIGRATION_CAPABILITY_BYPASS_SHARED_MEMORY];
+}
+
 bool migrate_postcopy_ram(void)
 {
     MigrationState *s;
diff --git a/migration/migration.h b/migration/migration.h
index 663415fe48..54f0f541de 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -178,6 +178,7 @@ MigrationState *migrate_get_current(void);
 
 bool migrate_postcopy(void);
 
+bool migrate_bypass_shared_memory(void);
 bool migrate_release_ram(void);
 bool migrate_postcopy_ram(void);
 bool migrate_zero_blocks(void);
diff --git a/migration/ram.c b/migration/ram.c
index 021d583b9b..75990dd2ba 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -772,6 +772,10 @@ unsigned long migration_bitmap_find_dirty(RAMState *rs, RAMBlock *rb,
     unsigned long *bitmap = rb->bmap;
     unsigned long next;
 
+    /* when this ramblock is requested bypassing */
+    if (!bitmap)
+        return size;
+
     if (rs->ram_bulk_stage && start > 0) {
         next = start + 1;
     } else {
@@ -842,7 +846,9 @@ static void migration_bitmap_sync(RAMState *rs)
     qemu_mutex_lock(&rs->bitmap_mutex);
     rcu_read_lock();
     RAMBLOCK_FOREACH(block) {
-        migration_bitmap_sync_range(rs, block, 0, block->used_length);
+        if (!migrate_bypass_shared_memory() || !qemu_ram_is_shared(block)) {
+            migration_bitmap_sync_range(rs, block, 0, block->used_length);
+        }
     }
     rcu_read_unlock();
     qemu_mutex_unlock(&rs->bitmap_mutex);
@@ -2123,18 +2129,12 @@ static int ram_state_init(RAMState **rsp)
     qemu_mutex_init(&(*rsp)->src_page_req_mutex);
     QSIMPLEQ_INIT(&(*rsp)->src_page_requests);
 
-    /*
-     * Count the total number of pages used by ram blocks not including any
-     * gaps due to alignment or unplugs.
-     */
-    (*rsp)->migration_dirty_pages = ram_bytes_total() >> TARGET_PAGE_BITS;
-
     ram_state_reset(*rsp);
 
     return 0;
 }
 
-static void ram_list_init_bitmaps(void)
+static void ram_list_init_bitmaps(RAMState *rs)
 {
     RAMBlock *block;
     unsigned long pages;
@@ -2142,9 +2142,17 @@ static void ram_list_init_bitmaps(void)
     /* Skip setting bitmap if there is no RAM */
     if (ram_bytes_total()) {
         QLIST_FOREACH_RCU(block, &ram_list.blocks, next) {
+            if (migrate_bypass_shared_memory() && qemu_ram_is_shared(block)) {
+                continue;
+            }
             pages = block->max_length >> TARGET_PAGE_BITS;
             block->bmap = bitmap_new(pages);
             bitmap_set(block->bmap, 0, pages);
+            /*
+             * Count the total number of pages used by ram blocks not
+             * including any gaps due to alignment or unplugs.
+             */
+            rs->migration_dirty_pages += pages;
             if (migrate_postcopy_ram()) {
                 block->unsentmap = bitmap_new(pages);
                 bitmap_set(block->unsentmap, 0, pages);
@@ -2160,7 +2168,7 @@ static void ram_init_bitmaps(RAMState *rs)
     qemu_mutex_lock_ramlist();
     rcu_read_lock();
 
-    ram_list_init_bitmaps();
+    ram_list_init_bitmaps(rs);
     memory_global_dirty_log_start();
     migration_bitmap_sync(rs);
 
diff --git a/qapi/migration.json b/qapi/migration.json
index 03f57c9616..f18ee1bcc5 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -352,12 +352,17 @@
 #
 # @x-multifd: Use more than one fd for migration (since 2.11)
 #
+# @bypass-shared-memory: the shared memory region will be bypassed on migration.
+#          This feature allows the memory region to be reused by new qemu(s)
+#          or be migrated separately. (since 2.12)
+#
 # Since: 1.2
 ##
 { 'enum': 'MigrationCapability',
   'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks',
            'compress', 'events', 'postcopy-ram', 'x-colo', 'release-ram',
-           'block', 'return-path', 'pause-before-switchover', 'x-multifd' ] }
+           'block', 'return-path', 'pause-before-switchover', 'x-multifd',
+           'bypass-shared-memory'] }
 
 ##
 # @MigrationCapabilityStatus:
@@ -412,6 +417,7 @@
 #       {"state": true, "capability": "events"},
 #       {"state": false, "capability": "postcopy-ram"},
 #       {"state": false, "capability": "x-colo"}
+#       {"state": false, "capability": "bypass-shared-memory"}
 #    ]}
 #
 ##
-- 
2.14.3 (Apple Git-98)

Re: [Qemu-devel] [PATCH] migration: add capability to bypass the shared memory

Posted by Lai Jiangshan 7 years, 10 months ago

The attached patch is based on v2.11.1.  It was pushed on
https://github.com/hyperhq/qemu v2.11.1-template

The updated patch for upstream qemu is on
https://github.com/hyperhq/qemu upstream-template

On Sat, Mar 31, 2018 at 4:45 PM, Lai Jiangshan <jiangshanlai@gmail.com> wrote:

> ---
>  migration/migration.c | 13 +++++++++++++
>  migration/migration.h |  1 +
>  migration/ram.c       | 26 +++++++++++++++++---------
>  qapi/migration.json   |  8 +++++++-
>  4 files changed, 38 insertions(+), 10 deletions(-)
>

Re: [Qemu-devel] [PATCH] migration: add capability to bypass the shared memory

Posted by Eric Blake 7 years, 10 months ago

On 03/31/2018 03:45 AM, Lai Jiangshan wrote:
> 1) What's this
> 
> When the migration capability 'bypass-shared-memory'
> is set, the shared memory will be bypassed when migration.
> 
> It is the key feature to enable several excellent features for
> the qemu, such as qemu-local-migration, qemu-live-update,
> extremely-fast-save-restore, vm-template, vm-fast-live-clone,
> yet-another-post-copy-migration, etc..
> 

> 
> Cc: Samuel Ortiz <sameo@linux.intel.com>
> Cc: Sebastien Boeuf <sebastien.boeuf@intel.com>
> Cc: James O. D. Hunt <james.o.hunt@intel.com>
> Cc: Xu Wang <gnawux@gmail.com>
> Cc: Peng Tao <bergwolf@gmail.com>
> Cc: Xiao Guangrong <xiaoguangrong@tencent.com>
> Cc: Xiao Guangrong <xiaoguangrong.eric@gmail.com>
> Signed-off-by: Lai Jiangshan <jiangshanlai@gmail.com>
> ---
>   migration/migration.c | 13 +++++++++++++
>   migration/migration.h |  1 +
>   migration/ram.c       | 26 +++++++++++++++++---------
>   qapi/migration.json   |  8 +++++++-
>   4 files changed, 38 insertions(+), 10 deletions(-)
> 

> +++ b/qapi/migration.json
> @@ -352,12 +352,17 @@
>   #
>   # @x-multifd: Use more than one fd for migration (since 2.11)
>   #
> +# @bypass-shared-memory: the shared memory region will be bypassed on migration.
> +#          This feature allows the memory region to be reused by new qemu(s)
> +#          or be migrated separately. (since 2.12)

This is a new feature and you've missed the freeze for 2.12; this will 
need to be updated to '(since 2.13)' if the patch itself receives 
favorable review.

> +#
>   # Since: 1.2
>   ##
>   { 'enum': 'MigrationCapability',
>     'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks',
>              'compress', 'events', 'postcopy-ram', 'x-colo', 'release-ram',
> -           'block', 'return-path', 'pause-before-switchover', 'x-multifd' ] }
> +           'block', 'return-path', 'pause-before-switchover', 'x-multifd',
> +           'bypass-shared-memory'] }
>   
>   ##
>   # @MigrationCapabilityStatus:
> @@ -412,6 +417,7 @@
>   #       {"state": true, "capability": "events"},
>   #       {"state": false, "capability": "postcopy-ram"},
>   #       {"state": false, "capability": "x-colo"}
> +#       {"state": false, "capability": "bypass-shared-memory"}

Invalid JSON - you forgot to add the comma on the previous line.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [Qemu-devel] [PATCH] migration: add capability to bypass the shared memory

Posted by Stefan Hajnoczi 7 years, 7 months ago

On Sat, Mar 31, 2018 at 04:45:00PM +0800, Lai Jiangshan wrote:
> a) feature: qemu-local-migration, qemu-live-update
> Set the mem-path on the tmpfs and set share=on for it when
> start the vm. example:
> -object \
> memory-backend-file,id=mem,size=128M,mem-path=/dev/shm/memory,share=on \
> -numa node,nodeid=0,cpus=0-7,memdev=mem
> 
> when you want to migrate the vm locally (after fixed a security bug
> of the qemu-binary, or other reason), you can start a new qemu with
> the same command line and -incoming, then you can migrate the
> vm from the old qemu to the new qemu with the migration capability
> 'bypass-shared-memory' set. The migration will migrate the device-state
> *ONLY*, the memory is the origin memory backed by tmpfs file.

Marcelo, Andrea, Paolo: There was a more complex local migration
approach in 2013 with fd passing and vmsplice.  They specifically
avoided the approach proposed in this patch, but I don't remember why.

The closest to an explanation I've found is this message from Marcelo:

  Another possibility is to use memory that is not anonymous for guest
  RAM, such as hugetlbfs or tmpfs.

  IIRC ksm and thp have limitations wrt tmpfs.

https://www.spinics.net/lists/linux-mm/msg67437.html

Have the limitations been been solved since then?

> c)  feature: vm-template, vm-fast-live-clone
> the template vm is started as 1), and paused when the guest reaches
> the template point(example: the guest app is ready), then the template
> vm is saved. (the qemu process of the template can be killed now, because
> we need only the memory and the device state files (in tmpfs)).
> 
> Then we can launch one or multiple VMs base on the template vm states,
> the new VMs are started without the “share=on”, all the new VMs share
> the initial memory from the memory file, they save a lot of memory.
> all the new VMs start from the template point, the guest app can go to
> work quickly.
> 
> The new VM booted from template vm can’t become template again,
> if you need this unusual chained-template feature, you can write
> a cloneable-tmpfs kernel module for it.
> 
> The libvirt toolkit can’t manage vm-template currently, in the
> hyperhq/runv, we use qemu wrapper script to do it. I hope someone add
> “libvrit managed template” feature to libvirt.

This feature has been discussed multiple times in the past and probably
the reason why it's not in libvirt yet is that no one wants it badly
enough that they have solved the security issues.

RAM and disk contain secrets like address-space layout randomization,
random number generator state, cryptographic keys, etc.  Both the kernel
and userspace handle secrets, making it hard to isolate all secrets and
wipe them when cloning.

Risks:
1. If one cloned VM is exploited then all other VMs are more likely to
   be exploitable (e.g. kernel address space layout randomization).
2. If you give VMs cloned from the same template to untrusted users,
   they may be able to determine the secrets other users' VMs.

How are you wiping secrets and re-randomizing cloned VMs?  Security is a
major factor for using Kata, so it's important not to leak secrets
between cloned VMs.

Stefan

Re: [Qemu-devel] [PATCH] migration: add capability to bypass the shared memory

Posted by Peng Tao 7 years, 7 months ago

On Mon, Jul 2, 2018 at 9:10 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Sat, Mar 31, 2018 at 04:45:00PM +0800, Lai Jiangshan wrote:
>> a) feature: qemu-local-migration, qemu-live-update
>> Set the mem-path on the tmpfs and set share=on for it when
>> start the vm. example:
>> -object \
>> memory-backend-file,id=mem,size=128M,mem-path=/dev/shm/memory,share=on \
>> -numa node,nodeid=0,cpus=0-7,memdev=mem
>>
>> when you want to migrate the vm locally (after fixed a security bug
>> of the qemu-binary, or other reason), you can start a new qemu with
>> the same command line and -incoming, then you can migrate the
>> vm from the old qemu to the new qemu with the migration capability
>> 'bypass-shared-memory' set. The migration will migrate the device-state
>> *ONLY*, the memory is the origin memory backed by tmpfs file.
>
> Marcelo, Andrea, Paolo: There was a more complex local migration
> approach in 2013 with fd passing and vmsplice.  They specifically
> avoided the approach proposed in this patch, but I don't remember why.
>
> The closest to an explanation I've found is this message from Marcelo:
>
>   Another possibility is to use memory that is not anonymous for guest
>   RAM, such as hugetlbfs or tmpfs.
>
>   IIRC ksm and thp have limitations wrt tmpfs.
>
> https://www.spinics.net/lists/linux-mm/msg67437.html
>
> Have the limitations been been solved since then?
>
>> c)  feature: vm-template, vm-fast-live-clone
>> the template vm is started as 1), and paused when the guest reaches
>> the template point(example: the guest app is ready), then the template
>> vm is saved. (the qemu process of the template can be killed now, because
>> we need only the memory and the device state files (in tmpfs)).
>>
>> Then we can launch one or multiple VMs base on the template vm states,
>> the new VMs are started without the “share=on”, all the new VMs share
>> the initial memory from the memory file, they save a lot of memory.
>> all the new VMs start from the template point, the guest app can go to
>> work quickly.
>>
>> The new VM booted from template vm can’t become template again,
>> if you need this unusual chained-template feature, you can write
>> a cloneable-tmpfs kernel module for it.
>>
>> The libvirt toolkit can’t manage vm-template currently, in the
>> hyperhq/runv, we use qemu wrapper script to do it. I hope someone add
>> “libvrit managed template” feature to libvirt.
>
> This feature has been discussed multiple times in the past and probably
> the reason why it's not in libvirt yet is that no one wants it badly
> enough that they have solved the security issues.
>
> RAM and disk contain secrets like address-space layout randomization,
> random number generator state, cryptographic keys, etc.  Both the kernel
> and userspace handle secrets, making it hard to isolate all secrets and
> wipe them when cloning.
>
Hi Stefan,

> Risks:
> 1. If one cloned VM is exploited then all other VMs are more likely to
>    be exploitable (e.g. kernel address space layout randomization).
w.r.t. KASLR, any memory duplication technology would expose it. I
remember there are CVEs (e.g., CVE-2015-2877) specific to this kind
attack against KSM and it was stated that "Basically if you care about
this attack vector, disable deduplication.". Share-until-written
approaches for memory conservation among mutually untrusting tenants
are inherently detectable for information disclosure, and can be
classified as potentially misunderstood behaviors rather than
vulnerabilities. [1]

I think the same applies to vm templating as well. Actually VM
templating is more useful (than KSM) in this regard since we can
create a template for each trusted tenant where as with KSM all VMs on
a host are treated equally.

[1] https://access.redhat.com/security/cve/cve-2015-2877

> 2. If you give VMs cloned from the same template to untrusted users,
>    they may be able to determine the secrets other users' VMs.
In kata and runv, vm templating is used carefully so that we do not
use or save any secret keys before creating the template VM. IOW, the
feature is not supposed to be used generally to create any template
VMs at any stage.

>
> How are you wiping secrets and re-randomizing cloned VMs?
I think we can write some host generated random seeds to guest's
urandom device, when cloning VMs from the same template before handing
it to users. Is it enough or do you think there are more to do w/
re-randomizing?

>  Security is a
> major factor for using Kata, so it's important not to leak secrets
> between cloned VMs.
>
Yes, indeed! And it is all about trade-offs, VM templating or KSM. If
we want security above anything, we should just disable all the
sharing. But there is actually no ceiling (think about physical
isolation!). So it's more about trade-offs. With Kata, VM templating
and KSM give users options to achieve better performance and lower
memory footprint with little sacrifice. The security advantage of
running VM-based containers is still there.

Cheers,
Tao

Re: [Qemu-devel] [PATCH] migration: add capability to bypass the shared memory

Posted by Andrea Arcangeli 7 years, 7 months ago

Hello,

On Mon, Jul 02, 2018 at 09:52:08PM +0800, Peng Tao wrote:
> I think we can write some host generated random seeds to guest's
> urandom device, when cloning VMs from the same template before handing
> it to users. Is it enough or do you think there are more to do w/
> re-randomizing?

That may be enough, but it's critically important to get
right. Reusing the same /dev/urandom number just twice on two
different operations, can lead to leak of the entire private key even
if the reused random number itself is not predictable.

You may want to look into the upstream random number generator that
can be configured at build time to printk() a warning if it's being
used at boot before it had its "shutdown" state restored. It would
sound safer if you could re-trigger such warning post vmrestore of a
cloned image if userland uses random number before the random number
has been re-seeded post vmrestore. With a full loaded userland running
immediately post vmrestore, an userland race condition would otherwise
risk to go unnoticed.

Thanks,
Andrea

Re: [Qemu-devel] [PATCH] migration: add capability to bypass the shared memory

Posted by Peng Tao 7 years, 7 months ago

On Tue, Jul 3, 2018 at 6:15 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> Hello,
>
> On Mon, Jul 02, 2018 at 09:52:08PM +0800, Peng Tao wrote:
>> I think we can write some host generated random seeds to guest's
>> urandom device, when cloning VMs from the same template before handing
>> it to users. Is it enough or do you think there are more to do w/
>> re-randomizing?
>
> That may be enough, but it's critically important to get
> right. Reusing the same /dev/urandom number just twice on two
> different operations, can lead to leak of the entire private key even
> if the reused random number itself is not predictable.
>
> You may want to look into the upstream random number generator that
> can be configured at build time to printk() a warning if it's being
> used at boot before it had its "shutdown" state restored. It would
> sound safer if you could re-trigger such warning post vmrestore of a
> cloned image if userland uses random number before the random number
> has been re-seeded post vmrestore. With a full loaded userland running
> immediately post vmrestore, an userland race condition would otherwise
> risk to go unnoticed.
>
Good point! Thanks a lot!

Cheers,
Tao

Re: [Qemu-devel] [PATCH] migration: add capability to bypass the shared memory

Posted by Stefan Hajnoczi 7 years, 7 months ago

On Mon, Jul 02, 2018 at 09:52:08PM +0800, Peng Tao wrote:
> On Mon, Jul 2, 2018 at 9:10 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > On Sat, Mar 31, 2018 at 04:45:00PM +0800, Lai Jiangshan wrote:
> > Risks:
> > 1. If one cloned VM is exploited then all other VMs are more likely to
> >    be exploitable (e.g. kernel address space layout randomization).
> w.r.t. KASLR, any memory duplication technology would expose it. I
> remember there are CVEs (e.g., CVE-2015-2877) specific to this kind
> attack against KSM and it was stated that "Basically if you care about
> this attack vector, disable deduplication.". Share-until-written
> approaches for memory conservation among mutually untrusting tenants
> are inherently detectable for information disclosure, and can be
> classified as potentially misunderstood behaviors rather than
> vulnerabilities. [1]
> 
> I think the same applies to vm templating as well. Actually VM
> templating is more useful (than KSM) in this regard since we can
> create a template for each trusted tenant where as with KSM all VMs on
> a host are treated equally.
> 
> [1] https://access.redhat.com/security/cve/cve-2015-2877

That solves the problem between untrusted users but a breach in one
clone may reveal secrets of all other clones belonging to the same
tenant.  As a user, I would be uncomfortable knowing that if one of my
machines is breached then secrets used by all of my machines might be
exposed.

> > 2. If you give VMs cloned from the same template to untrusted users,
> >    they may be able to determine the secrets other users' VMs.
> In kata and runv, vm templating is used carefully so that we do not
> use or save any secret keys before creating the template VM. IOW, the
> feature is not supposed to be used generally to create any template
> VMs at any stage.

At what point are templates captured to avoid these problems?  Is there
code that shows how to do this?

> >  Security is a
> > major factor for using Kata, so it's important not to leak secrets
> > between cloned VMs.
> >
> Yes, indeed! And it is all about trade-offs, VM templating or KSM. If
> we want security above anything, we should just disable all the
> sharing. But there is actually no ceiling (think about physical
> isolation!). So it's more about trade-offs. With Kata, VM templating
> and KSM give users options to achieve better performance and lower
> memory footprint with little sacrifice. The security advantage of
> running VM-based containers is still there.

Adding options to enable/disable features leads to confusion among
users, makes performance comparisons harder, and increases support
overhead.

Technical solutions to the security problems are possible.  I'm
interested in progress in this area because it means users don't need to
make a choice, they can benefit from the feature without sacrificing
security.

Stefan

Re: [Qemu-devel] [PATCH] migration: add capability to bypass the shared memory

Posted by Peng Tao 7 years, 7 months ago

On Tue, Jul 3, 2018 at 6:05 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Mon, Jul 02, 2018 at 09:52:08PM +0800, Peng Tao wrote:
>> On Mon, Jul 2, 2018 at 9:10 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
>> > On Sat, Mar 31, 2018 at 04:45:00PM +0800, Lai Jiangshan wrote:
>> > Risks:
>> > 1. If one cloned VM is exploited then all other VMs are more likely to
>> >    be exploitable (e.g. kernel address space layout randomization).
>> w.r.t. KASLR, any memory duplication technology would expose it. I
>> remember there are CVEs (e.g., CVE-2015-2877) specific to this kind
>> attack against KSM and it was stated that "Basically if you care about
>> this attack vector, disable deduplication.". Share-until-written
>> approaches for memory conservation among mutually untrusting tenants
>> are inherently detectable for information disclosure, and can be
>> classified as potentially misunderstood behaviors rather than
>> vulnerabilities. [1]
>>
>> I think the same applies to vm templating as well. Actually VM
>> templating is more useful (than KSM) in this regard since we can
>> create a template for each trusted tenant where as with KSM all VMs on
>> a host are treated equally.
>>
>> [1] https://access.redhat.com/security/cve/cve-2015-2877
>
> That solves the problem between untrusted users but a breach in one
> clone may reveal secrets of all other clones belonging to the same
> tenant.  As a user, I would be uncomfortable knowing that if one of my
> machines is breached then secrets used by all of my machines might be
> exposed.
>
Secrets are really point 2 in your list and I'll answer it below :)

>> > 2. If you give VMs cloned from the same template to untrusted users,
>> >    they may be able to determine the secrets other users' VMs.
>> In kata and runv, vm templating is used carefully so that we do not
>> use or save any secret keys before creating the template VM. IOW, the
>> feature is not supposed to be used generally to create any template
>> VMs at any stage.
>
> At what point are templates captured to avoid these problems?  Is there
> code that shows how to do this?
>
Both runv and kata pauses the VM right after the agent inside guest is
up and running, which, in the initramfs case, translates into the
point that kernel boots and the init process starts. If you are
interested in seeing the actual code, you can look at
https://github.com/hyperhq/hyperstart/ and
https://github.com/kata-containers/agent for what is done in the guest
at that point. If you see any secrets being saved there, I'll be more
than happy to fix it. :)

>> >  Security is a
>> > major factor for using Kata, so it's important not to leak secrets
>> > between cloned VMs.
>> >
>> Yes, indeed! And it is all about trade-offs, VM templating or KSM. If
>> we want security above anything, we should just disable all the
>> sharing. But there is actually no ceiling (think about physical
>> isolation!). So it's more about trade-offs. With Kata, VM templating
>> and KSM give users options to achieve better performance and lower
>> memory footprint with little sacrifice. The security advantage of
>> running VM-based containers is still there.
>
> Adding options to enable/disable features leads to confusion among
> users, makes performance comparisons harder, and increases support
> overhead.
>
> Technical solutions to the security problems are possible.  I'm
> interested in progress in this area because it means users don't need to
> make a choice, they can benefit from the feature without sacrificing
> security.
>
Well, that is really beyond the scope of the reviewing of this
particular QEMU patch. But as a Kata developer, I can answer it
anyway.

For one thing, Kata already has quite a few configuration options that
let users choose different features. For another thing, Kata already
ships with KSM support by default and VM templating in Kata is better
off than KSM in many aspects (e.g., by providing similar level of
memory conservation w/o affecting host and guest performance). So from
Kata containers point of view, it makes sense to have VM templating
support and let users decide which one they want to use.

PS: CCing kata-dev since the discussion starts to concern about Kata
specific usage.

Cheers,
Tao

Re: [Qemu-devel] [PATCH] migration: add capability to bypass the shared memory

Posted by Stefan Hajnoczi 7 years, 7 months ago

On Tue, Jul 03, 2018 at 11:10:12PM +0800, Peng Tao wrote:
> On Tue, Jul 3, 2018 at 6:05 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > On Mon, Jul 02, 2018 at 09:52:08PM +0800, Peng Tao wrote:
> >> On Mon, Jul 2, 2018 at 9:10 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> >> > On Sat, Mar 31, 2018 at 04:45:00PM +0800, Lai Jiangshan wrote:
> >> > Risks:
> >> > 1. If one cloned VM is exploited then all other VMs are more likely to
> >> >    be exploitable (e.g. kernel address space layout randomization).
> >> w.r.t. KASLR, any memory duplication technology would expose it. I
> >> remember there are CVEs (e.g., CVE-2015-2877) specific to this kind
> >> attack against KSM and it was stated that "Basically if you care about
> >> this attack vector, disable deduplication.". Share-until-written
> >> approaches for memory conservation among mutually untrusting tenants
> >> are inherently detectable for information disclosure, and can be
> >> classified as potentially misunderstood behaviors rather than
> >> vulnerabilities. [1]
> >>
> >> I think the same applies to vm templating as well. Actually VM
> >> templating is more useful (than KSM) in this regard since we can
> >> create a template for each trusted tenant where as with KSM all VMs on
> >> a host are treated equally.
> >>
> >> [1] https://access.redhat.com/security/cve/cve-2015-2877
> >
> > That solves the problem between untrusted users but a breach in one
> > clone may reveal secrets of all other clones belonging to the same
> > tenant.  As a user, I would be uncomfortable knowing that if one of my
> > machines is breached then secrets used by all of my machines might be
> > exposed.
> >
> Secrets are really point 2 in your list and I'll answer it below :)
> 
> >> > 2. If you give VMs cloned from the same template to untrusted users,
> >> >    they may be able to determine the secrets other users' VMs.
> >> In kata and runv, vm templating is used carefully so that we do not
> >> use or save any secret keys before creating the template VM. IOW, the
> >> feature is not supposed to be used generally to create any template
> >> VMs at any stage.
> >
> > At what point are templates captured to avoid these problems?  Is there
> > code that shows how to do this?
> >
> Both runv and kata pauses the VM right after the agent inside guest is
> up and running, which, in the initramfs case, translates into the
> point that kernel boots and the init process starts. If you are
> interested in seeing the actual code, you can look at
> https://github.com/hyperhq/hyperstart/ and
> https://github.com/kata-containers/agent for what is done in the guest
> at that point. If you see any secrets being saved there, I'll be more
> than happy to fix it. :)

Two things come to mind:

At that point both guest kernel and agent address-space layout
randomization (ASLR) is finished.  ALSR makes it harder for memory
corruption bugs to lead to real exploits because the attacker does not
know the full memory layout of the process.  Cloned VMs will not benefit
from ASLR because much of the memory layout of the guest kernel and
agent will be identical across all clones.

Software random number generators have probably been initialized at this
point.  This doesn't mean that all cloned VMs will produce the same
sequence of random numbers since they should incorporate entropy sources
or use hardware random number generators, but the quality of random
numbers might be reduced.  Someone who knows random number generators
should take a look at this.

> >> >  Security is a
> >> > major factor for using Kata, so it's important not to leak secrets
> >> > between cloned VMs.
> >> >
> >> Yes, indeed! And it is all about trade-offs, VM templating or KSM. If
> >> we want security above anything, we should just disable all the
> >> sharing. But there is actually no ceiling (think about physical
> >> isolation!). So it's more about trade-offs. With Kata, VM templating
> >> and KSM give users options to achieve better performance and lower
> >> memory footprint with little sacrifice. The security advantage of
> >> running VM-based containers is still there.
> >
> > Adding options to enable/disable features leads to confusion among
> > users, makes performance comparisons harder, and increases support
> > overhead.
> >
> > Technical solutions to the security problems are possible.  I'm
> > interested in progress in this area because it means users don't need to
> > make a choice, they can benefit from the feature without sacrificing
> > security.
> >
> Well, that is really beyond the scope of the reviewing of this
> particular QEMU patch. But as a Kata developer, I can answer it
> anyway.
> 
> For one thing, Kata already has quite a few configuration options that
> let users choose different features. For another thing, Kata already
> ships with KSM support by default and VM templating in Kata is better
> off than KSM in many aspects (e.g., by providing similar level of
> memory conservation w/o affecting host and guest performance). So from
> Kata containers point of view, it makes sense to have VM templating
> support and let users decide which one they want to use.
> 
> PS: CCing kata-dev since the discussion starts to concern about Kata
> specific usage.

Thanks for doing that.  I think discussing the security implications of
vm templates (clones) is important specifically for Kata.

Stefan

Re: [Qemu-devel] [PATCH] migration: add capability to bypass the shared memory

Posted by Peng Tao 7 years, 7 months ago

Hi Stefan,

On Tue, Jul 10, 2018 at 9:40 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> Two things come to mind:
>
> At that point both guest kernel and agent address-space layout
> randomization (ASLR) is finished.  ALSR makes it harder for memory
> corruption bugs to lead to real exploits because the attacker does not
> know the full memory layout of the process.  Cloned VMs will not benefit
> from ASLR because much of the memory layout of the guest kernel and
> agent will be identical across all clones.
>
Yes, indeed. I am not arguing that ASLR is retained with VM
templating. Just that ASLR is also compromised if one wants to use KSM
to save memory by sharing among different guests. Kata is already
shipping with KSM components and we are adding VM templating as a
better alternative.

> Software random number generators have probably been initialized at this
> point.  This doesn't mean that all cloned VMs will produce the same
> sequence of random numbers since they should incorporate entropy sources
> or use hardware random number generators, but the quality of random
> numbers might be reduced.  Someone who knows random number generators
> should take a look at this.
>
As Andrea pointed out earlier in his comments, we can configure the
random number generator to printk a warning if it's being used at boot
before it had its "shutdown" state restored. Then we can add a new
kata-agent request set the entropy and check for such warning after a
new VM is cloned and before it is given to the user. This way, we are
guaranteed that random numbers generated by each guest is created with
a different seed. Do you have other concern with this method?

Cheers,
Tao

Re: [Qemu-devel] [PATCH] migration: add capability to bypass the shared memory

Posted by Stefan Hajnoczi 7 years, 6 months ago

On Thu, Jul 12, 2018 at 11:02:08PM +0800, Peng Tao wrote:
> On Tue, Jul 10, 2018 at 9:40 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> > Two things come to mind:
> >
> > At that point both guest kernel and agent address-space layout
> > randomization (ASLR) is finished.  ALSR makes it harder for memory
> > corruption bugs to lead to real exploits because the attacker does not
> > know the full memory layout of the process.  Cloned VMs will not benefit
> > from ASLR because much of the memory layout of the guest kernel and
> > agent will be identical across all clones.
> >
> Yes, indeed. I am not arguing that ASLR is retained with VM
> templating. Just that ASLR is also compromised if one wants to use KSM
> to save memory by sharing among different guests. Kata is already
> shipping with KSM components and we are adding VM templating as a
> better alternative.

Hang on, ASLR is *not* compromised by KSM.  The address space layout is
still unique for each guest, even if KSM deduplicates physical pages on
the host.  Remember ASLR is about virtual addresses while KSM is about
sharing the physical pages.  Therefore KSM does not affect ASLR.

The KSM issue you referred to earlier is a timing side-channel attack.
Being vulnerable to timing side-channel attacks through KSM does not
reduce the effectiveness of ASLR.

> > Software random number generators have probably been initialized at this
> > point.  This doesn't mean that all cloned VMs will produce the same
> > sequence of random numbers since they should incorporate entropy sources
> > or use hardware random number generators, but the quality of random
> > numbers might be reduced.  Someone who knows random number generators
> > should take a look at this.
> >
> As Andrea pointed out earlier in his comments, we can configure the
> random number generator to printk a warning if it's being used at boot
> before it had its "shutdown" state restored. Then we can add a new
> kata-agent request set the entropy and check for such warning after a
> new VM is cloned and before it is given to the user. This way, we are
> guaranteed that random numbers generated by each guest is created with
> a different seed. Do you have other concern with this method?

Sounds good.

Stefan

Re: [Qemu-devel] [PATCH] migration: add capability to bypass the shared memory

Posted by Andrea Arcangeli 7 years, 7 months ago

Hello everyone,

On Mon, Jul 02, 2018 at 02:10:54PM +0100, Stefan Hajnoczi wrote:
> Marcelo, Andrea, Paolo: There was a more complex local migration
> approach in 2013 with fd passing and vmsplice.  They specifically
> avoided the approach proposed in this patch, but I don't remember why.
> 
> The closest to an explanation I've found is this message from Marcelo:
> 
>   Another possibility is to use memory that is not anonymous for guest
>   RAM, such as hugetlbfs or tmpfs.
> 
>   IIRC ksm and thp have limitations wrt tmpfs.
> 
> https://www.spinics.net/lists/linux-mm/msg67437.html
> 
> Have the limitations been been solved since then?

tmpfs supports THP upstream nawadays, but it's not the default.

About KSM on non-anonymous memory it's not happening any time soon,
I'm not aware of anybody working on it and it's major change not
easily staying contained within KSM so it would take time. There have
been multiple requests for this feature (but usually for bare metal
containers).

The higher priority items for KSM are in terms of xxhash and/or
accellerated crc32 and perhaps multithreading the scanner to use more
cores. Those are fully self contained issues and they won't make the
rest of the kernel more complex.

Such kind of setup simply brings some of the limitations that
vhost-user/DPDK also has.

Thanks,
Andrea

[Qemu-devel] [PATCH V3] migration: add capability to bypass the shared memory

Posted by Lai Jiangshan 7 years, 10 months ago

1) What's this

When the migration capability 'bypass-shared-memory'
is set, the shared memory will be bypassed when migration.

It is the key feature to enable several excellent features for
the qemu, such as qemu-local-migration, qemu-live-update,
extremely-fast-save-restore, vm-template, vm-fast-live-clone,
yet-another-post-copy-migration, etc..

The philosophy behind this key feature, including the resulting
advanced key features, is that a part of the memory management
is separated out from the qemu, and let the other toolkits
such as libvirt, kata-containers (https://github.com/kata-containers)
runv(https://github.com/hyperhq/runv/) or some multiple cooperative
qemu commands directly access to it, manage it, provide features on it.

2) Status in real world

The hyperhq(http://hyper.sh  http://hypercontainer.io/)
introduced the feature vm-template(vm-fast-live-clone)
to the hyper container for several years, it works perfect.
(see https://github.com/hyperhq/runv/pull/297).

The feature vm-template makes the containers(VMs) can
be started in 130ms and save 80M memory for every
container(VM). So that the hyper containers are fast
and high-density as normal containers.

kata-containers project (https://github.com/kata-containers)
which was launched by hyper, intel and friends and which descended
from runv (and clear-container) should have this feature enabled.
Unfortunately, due to the code confliction between runv&cc,
this feature was temporary disabled and it is being brought
back by hyper and intel team.

3) How to use and bring up advanced features.

In current qemu command line, shared memory has
to be configured via memory-object.

a) feature: qemu-local-migration, qemu-live-update
Set the mem-path on the tmpfs and set share=on for it when
start the vm. example:
-object \
memory-backend-file,id=mem,size=128M,mem-path=/dev/shm/memory,share=on \
-numa node,nodeid=0,cpus=0-7,memdev=mem

when you want to migrate the vm locally (after fixed a security bug
of the qemu-binary, or other reason), you can start a new qemu with
the same command line and -incoming, then you can migrate the
vm from the old qemu to the new qemu with the migration capability
'bypass-shared-memory' set. The migration will migrate the device-state
*ONLY*, the memory is the origin memory backed by tmpfs file.

b) feature: extremely-fast-save-restore
the same above, but the mem-path is on the persistent file system.

c)  feature: vm-template, vm-fast-live-clone
the template vm is started as 1), and paused when the guest reaches
the template point(example: the guest app is ready), then the template
vm is saved. (the qemu process of the template can be killed now, because
we need only the memory and the device state files (in tmpfs)).

Then we can launch one or multiple VMs base on the template vm states,
the new VMs are started without the “share=on”, all the new VMs share
the initial memory from the memory file, they save a lot of memory.
all the new VMs start from the template point, the guest app can go to
work quickly.

The new VM booted from template vm can’t become template again,
if you need this unusual chained-template feature, you can write
a cloneable-tmpfs kernel module for it.

The libvirt toolkit can’t manage vm-template currently, in the
hyperhq/runv, we use qemu wrapper script to do it. I hope someone add
“libvrit managed template” feature to libvirt.

d) feature: yet-another-post-copy-migration
It is a possible feature, no toolkit can do it well now.
Using nbd server/client on the memory file is reluctantly Ok but
inconvenient. A special feature for tmpfs might be needed to
fully complete this feature.
No one need yet another post copy migration method,
but it is possible when some crazy man need it.

Cc: Samuel Ortiz <sameo@linux.intel.com>
Cc: Sebastien Boeuf <sebastien.boeuf@intel.com>
Cc: James O. D. Hunt <james.o.hunt@intel.com>
Cc: Xu Wang <gnawux@gmail.com>
Cc: Peng Tao <bergwolf@gmail.com>
Cc: Xiao Guangrong <xiaoguangrong@tencent.com>
Cc: Xiao Guangrong <xiaoguangrong.eric@gmail.com>
Signed-off-by: Lai Jiangshan <jiangshanlai@gmail.com>
---

Changes in V3:
 rebased on upstream master
 update the available version of the capability to
 v2.13

Changes in V2:
 rebased on 2.11.1

 migration/migration.c | 13 +++++++++++++
 migration/migration.h |  1 +
 migration/ram.c       | 26 +++++++++++++++++---------
 qapi/migration.json   |  6 +++++-
 4 files changed, 36 insertions(+), 10 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 52a5092add..c5a3591bc7 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1509,6 +1509,19 @@ bool migrate_release_ram(void)
     return s->enabled_capabilities[MIGRATION_CAPABILITY_RELEASE_RAM];
 }
 
+bool migrate_bypass_shared_memory(void)
+{
+    MigrationState *s;
+
+    /* it is not workable with postcopy yet. */
+    if (migrate_postcopy_ram())
+        return false;
+
+    s = migrate_get_current();
+
+    return s->enabled_capabilities[MIGRATION_CAPABILITY_BYPASS_SHARED_MEMORY];
+}
+
 bool migrate_postcopy_ram(void)
 {
     MigrationState *s;
diff --git a/migration/migration.h b/migration/migration.h
index 8d2f320c48..cfd2513ef0 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -206,6 +206,7 @@ MigrationState *migrate_get_current(void);
 
 bool migrate_postcopy(void);
 
+bool migrate_bypass_shared_memory(void);
 bool migrate_release_ram(void);
 bool migrate_postcopy_ram(void);
 bool migrate_zero_blocks(void);
diff --git a/migration/ram.c b/migration/ram.c
index 0e90efa092..6881ec1d80 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -780,6 +780,10 @@ unsigned long migration_bitmap_find_dirty(RAMState *rs, RAMBlock *rb,
     unsigned long *bitmap = rb->bmap;
     unsigned long next;
 
+    /* when this ramblock is requested bypassing */
+    if (!bitmap)
+        return size;
+
     if (rs->ram_bulk_stage && start > 0) {
         next = start + 1;
     } else {
@@ -850,7 +854,9 @@ static void migration_bitmap_sync(RAMState *rs)
     qemu_mutex_lock(&rs->bitmap_mutex);
     rcu_read_lock();
     RAMBLOCK_FOREACH(block) {
-        migration_bitmap_sync_range(rs, block, 0, block->used_length);
+        if (!migrate_bypass_shared_memory() || !qemu_ram_is_shared(block)) {
+            migration_bitmap_sync_range(rs, block, 0, block->used_length);
+        }
     }
     rcu_read_unlock();
     qemu_mutex_unlock(&rs->bitmap_mutex);
@@ -2132,18 +2138,12 @@ static int ram_state_init(RAMState **rsp)
     qemu_mutex_init(&(*rsp)->src_page_req_mutex);
     QSIMPLEQ_INIT(&(*rsp)->src_page_requests);
 
-    /*
-     * Count the total number of pages used by ram blocks not including any
-     * gaps due to alignment or unplugs.
-     */
-    (*rsp)->migration_dirty_pages = ram_bytes_total() >> TARGET_PAGE_BITS;
-
     ram_state_reset(*rsp);
 
     return 0;
 }
 
-static void ram_list_init_bitmaps(void)
+static void ram_list_init_bitmaps(RAMState *rs)
 {
     RAMBlock *block;
     unsigned long pages;
@@ -2151,9 +2151,17 @@ static void ram_list_init_bitmaps(void)
     /* Skip setting bitmap if there is no RAM */
     if (ram_bytes_total()) {
         QLIST_FOREACH_RCU(block, &ram_list.blocks, next) {
+            if (migrate_bypass_shared_memory() && qemu_ram_is_shared(block)) {
+                continue;
+            }
             pages = block->max_length >> TARGET_PAGE_BITS;
             block->bmap = bitmap_new(pages);
             bitmap_set(block->bmap, 0, pages);
+            /*
+             * Count the total number of pages used by ram blocks not
+             * including any gaps due to alignment or unplugs.
+             */
+            rs->migration_dirty_pages += pages;
             if (migrate_postcopy_ram()) {
                 block->unsentmap = bitmap_new(pages);
                 bitmap_set(block->unsentmap, 0, pages);
@@ -2169,7 +2177,7 @@ static void ram_init_bitmaps(RAMState *rs)
     qemu_mutex_lock_ramlist();
     rcu_read_lock();
 
-    ram_list_init_bitmaps();
+    ram_list_init_bitmaps(rs);
     memory_global_dirty_log_start();
     migration_bitmap_sync(rs);
 
diff --git a/qapi/migration.json b/qapi/migration.json
index 9d0bf82cf4..45326480bd 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -357,13 +357,17 @@
 # @dirty-bitmaps: If enabled, QEMU will migrate named dirty bitmaps.
 #                 (since 2.12)
 #
+# @bypass-shared-memory: the shared memory region will be bypassed on migration.
+#          This feature allows the memory region to be reused by new qemu(s)
+#          or be migrated separately. (since 2.13)
+#
 # Since: 1.2
 ##
 { 'enum': 'MigrationCapability',
   'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks',
            'compress', 'events', 'postcopy-ram', 'x-colo', 'release-ram',
            'block', 'return-path', 'pause-before-switchover', 'x-multifd',
-           'dirty-bitmaps' ] }
+           'dirty-bitmaps', 'bypass-shared-memory' ] }
 
 ##
 # @MigrationCapabilityStatus:
-- 
2.14.3 (Apple Git-98)

Re: [Qemu-devel] [PATCH V3] migration: add capability to bypass the shared memory

Posted by no-reply@patchew.org 7 years, 10 months ago

Hi,

This series seems to have some coding style problems. See output below for
more information:

Type: series
Message-id: 20180401084848.36725-1-jiangshanlai@gmail.com
Subject: [Qemu-devel] [PATCH V3] migration: add capability to bypass the shared memory

=== TEST SCRIPT BEGIN ===
#!/bin/bash

BASE=base
n=1
total=$(git log --oneline $BASE.. | wc -l)
failed=0

git config --local diff.renamelimit 0
git config --local diff.renames True
git config --local diff.algorithm histogram

commits="$(git log --format=%H --reverse $BASE..)"
for c in $commits; do
    echo "Checking PATCH $n/$total: $(git log -n 1 --format=%s $c)..."
    if ! git show $c --format=email | ./scripts/checkpatch.pl --mailback -; then
        failed=1
        echo
    fi
    n=$((n+1))
done

exit $failed
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
From https://github.com/patchew-project/qemu
 * [new tag]               patchew/20180401084848.36725-1-jiangshanlai@gmail.com -> patchew/20180401084848.36725-1-jiangshanlai@gmail.com
Switched to a new branch 'test'
8886edc9cf migration: add capability to bypass the shared memory

=== OUTPUT BEGIN ===
Checking PATCH 1/1: migration: add capability to bypass the shared memory...
ERROR: braces {} are necessary for all arms of this statement
#118: FILE: migration/migration.c:1525:
+    if (migrate_postcopy_ram())
[...]

ERROR: braces {} are necessary for all arms of this statement
#150: FILE: migration/ram.c:784:
+    if (!bitmap)
[...]

total: 2 errors, 0 warnings, 108 lines checked

Your patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

=== OUTPUT END ===

Test command exited with code: 1


---
Email generated automatically by Patchew [http://patchew.org/].
Please send your feedback to patchew-devel@redhat.com

Re: [Qemu-devel] [PATCH V3] migration: add capability to bypass the shared memory

Posted by no-reply@patchew.org 7 years, 10 months ago

Hi,

This series failed docker-mingw@fedora build test. Please find the testing commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

Type: series
Message-id: 20180401084848.36725-1-jiangshanlai@gmail.com
Subject: [Qemu-devel] [PATCH V3] migration: add capability to bypass the shared memory

=== TEST SCRIPT BEGIN ===
#!/bin/bash
set -e
git submodule update --init dtc
# Let docker tests dump environment info
export SHOW_ENV=1
export J=8
time make docker-test-mingw@fedora
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
Switched to a new branch 'test'
8886edc9cf migration: add capability to bypass the shared memory

=== OUTPUT BEGIN ===
Submodule 'dtc' (git://git.qemu-project.org/dtc.git) registered for path 'dtc'
Cloning into '/var/tmp/patchew-tester-tmp-_154viaz/src/dtc'...
Submodule path 'dtc': checked out 'e54388015af1fb4bf04d0bca99caba1074d9cc42'
  BUILD   fedora
make[1]: Entering directory '/var/tmp/patchew-tester-tmp-_154viaz/src'
  GEN     /var/tmp/patchew-tester-tmp-_154viaz/src/docker-src.2018-04-01-04.54.41.32446/qemu.tar
Cloning into '/var/tmp/patchew-tester-tmp-_154viaz/src/docker-src.2018-04-01-04.54.41.32446/qemu.tar.vroot'...
done.
Checking out files:  44% (2682/6066)   
Checking out files:  45% (2730/6066)   
Checking out files:  46% (2791/6066)   
Checking out files:  47% (2852/6066)   
Checking out files:  48% (2912/6066)   
Checking out files:  49% (2973/6066)   
Checking out files:  50% (3033/6066)   
Checking out files:  51% (3094/6066)   
Checking out files:  51% (3100/6066)   
Checking out files:  52% (3155/6066)   
Checking out files:  53% (3215/6066)   
Checking out files:  54% (3276/6066)   
Checking out files:  55% (3337/6066)   
Checking out files:  55% (3378/6066)   
Checking out files:  56% (3397/6066)   
Checking out files:  57% (3458/6066)   
Checking out files:  58% (3519/6066)   
Checking out files:  59% (3579/6066)   
Checking out files:  59% (3629/6066)   
Checking out files:  60% (3640/6066)   
Checking out files:  61% (3701/6066)   
Checking out files:  62% (3761/6066)   
Checking out files:  63% (3822/6066)   
Checking out files:  64% (3883/6066)   
Checking out files:  65% (3943/6066)   
Checking out files:  66% (4004/6066)   
Checking out files:  67% (4065/6066)   
Checking out files:  68% (4125/6066)   
Checking out files:  69% (4186/6066)   
Checking out files:  70% (4247/6066)   
Checking out files:  71% (4307/6066)   
Checking out files:  72% (4368/6066)   
Checking out files:  73% (4429/6066)   
Checking out files:  74% (4489/6066)   
Checking out files:  75% (4550/6066)   
Checking out files:  76% (4611/6066)   
Checking out files:  77% (4671/6066)   
Checking out files:  78% (4732/6066)   
Checking out files:  79% (4793/6066)   
Checking out files:  80% (4853/6066)   
Checking out files:  81% (4914/6066)   
Checking out files:  82% (4975/6066)   
Checking out files:  83% (5035/6066)   
Checking out files:  84% (5096/6066)   
Checking out files:  85% (5157/6066)   
Checking out files:  86% (5217/6066)   
Checking out files:  87% (5278/6066)   
Checking out files:  88% (5339/6066)   
Checking out files:  89% (5399/6066)   
Checking out files:  90% (5460/6066)   
Checking out files:  91% (5521/6066)   
Checking out files:  92% (5581/6066)   
Checking out files:  93% (5642/6066)   
Checking out files:  94% (5703/6066)   
Checking out files:  95% (5763/6066)   
Checking out files:  96% (5824/6066)   
Checking out files:  97% (5885/6066)   
Checking out files:  97% (5943/6066)   
Checking out files:  98% (5945/6066)   
Checking out files:  99% (6006/6066)   
Checking out files: 100% (6066/6066)   
Checking out files: 100% (6066/6066), done.
Your branch is up-to-date with 'origin/test'.
Submodule 'dtc' (git://git.qemu-project.org/dtc.git) registered for path 'dtc'
Cloning into '/var/tmp/patchew-tester-tmp-_154viaz/src/docker-src.2018-04-01-04.54.41.32446/qemu.tar.vroot/dtc'...
Submodule path 'dtc': checked out 'e54388015af1fb4bf04d0bca99caba1074d9cc42'
Submodule 'ui/keycodemapdb' (git://git.qemu.org/keycodemapdb.git) registered for path 'ui/keycodemapdb'
Cloning into '/var/tmp/patchew-tester-tmp-_154viaz/src/docker-src.2018-04-01-04.54.41.32446/qemu.tar.vroot/ui/keycodemapdb'...
Submodule path 'ui/keycodemapdb': checked out '6b3d716e2b6472eb7189d3220552280ef3d832ce'
  COPY    RUNNER
    RUN test-mingw in qemu:fedora 
Packages installed:
PyYAML-3.12-5.fc27.x86_64
SDL-devel-1.2.15-29.fc27.x86_64
bc-1.07.1-3.fc27.x86_64
bison-3.0.4-8.fc27.x86_64
bzip2-1.0.6-24.fc27.x86_64
ccache-3.3.6-1.fc27.x86_64
clang-5.0.1-3.fc27.x86_64
findutils-4.6.0-16.fc27.x86_64
flex-2.6.1-5.fc27.x86_64
gcc-7.3.1-5.fc27.x86_64
gcc-c++-7.3.1-5.fc27.x86_64
gettext-0.19.8.1-12.fc27.x86_64
git-2.14.3-3.fc27.x86_64
glib2-devel-2.54.3-2.fc27.x86_64
hostname-3.18-4.fc27.x86_64
libaio-devel-0.3.110-9.fc27.x86_64
libasan-7.3.1-5.fc27.x86_64
libfdt-devel-1.4.6-1.fc27.x86_64
libubsan-7.3.1-5.fc27.x86_64
llvm-5.0.1-3.fc27.x86_64
make-4.2.1-4.fc27.x86_64
mingw32-SDL-1.2.15-9.fc27.noarch
mingw32-bzip2-1.0.6-9.fc27.noarch
mingw32-curl-7.54.1-2.fc27.noarch
mingw32-glib2-2.54.1-1.fc27.noarch
mingw32-gmp-6.1.2-2.fc27.noarch
mingw32-gnutls-3.5.13-2.fc27.noarch
mingw32-gtk2-2.24.31-4.fc27.noarch
mingw32-gtk3-3.22.16-1.fc27.noarch
mingw32-libjpeg-turbo-1.5.1-3.fc27.noarch
mingw32-libpng-1.6.29-2.fc27.noarch
mingw32-libssh2-1.8.0-3.fc27.noarch
mingw32-libtasn1-4.13-1.fc27.noarch
mingw32-nettle-3.3-3.fc27.noarch
mingw32-pixman-0.34.0-3.fc27.noarch
mingw32-pkg-config-0.28-9.fc27.x86_64
mingw64-SDL-1.2.15-9.fc27.noarch
mingw64-bzip2-1.0.6-9.fc27.noarch
mingw64-curl-7.54.1-2.fc27.noarch
mingw64-glib2-2.54.1-1.fc27.noarch
mingw64-gmp-6.1.2-2.fc27.noarch
mingw64-gnutls-3.5.13-2.fc27.noarch
mingw64-gtk2-2.24.31-4.fc27.noarch
mingw64-gtk3-3.22.16-1.fc27.noarch
mingw64-libjpeg-turbo-1.5.1-3.fc27.noarch
mingw64-libpng-1.6.29-2.fc27.noarch
mingw64-libssh2-1.8.0-3.fc27.noarch
mingw64-libtasn1-4.13-1.fc27.noarch
mingw64-nettle-3.3-3.fc27.noarch
mingw64-pixman-0.34.0-3.fc27.noarch
mingw64-pkg-config-0.28-9.fc27.x86_64
nettle-devel-3.4-1.fc27.x86_64
perl-5.26.1-403.fc27.x86_64
pixman-devel-0.34.0-4.fc27.x86_64
python3-3.6.2-13.fc27.x86_64
sparse-0.5.1-2.fc27.x86_64
tar-1.29-7.fc27.x86_64
which-2.21-4.fc27.x86_64
zlib-devel-1.2.11-4.fc27.x86_64

Environment variables:
TARGET_LIST=
PACKAGES=ccache gettext git tar PyYAML sparse flex bison python3 bzip2 hostname     glib2-devel pixman-devel zlib-devel SDL-devel libfdt-devel     gcc gcc-c++ llvm clang make perl which bc findutils libaio-devel     nettle-devel libasan libubsan     mingw32-pixman mingw32-glib2 mingw32-gmp mingw32-SDL mingw32-pkg-config     mingw32-gtk2 mingw32-gtk3 mingw32-gnutls mingw32-nettle mingw32-libtasn1     mingw32-libjpeg-turbo mingw32-libpng mingw32-curl mingw32-libssh2     mingw32-bzip2     mingw64-pixman mingw64-glib2 mingw64-gmp mingw64-SDL mingw64-pkg-config     mingw64-gtk2 mingw64-gtk3 mingw64-gnutls mingw64-nettle mingw64-libtasn1     mingw64-libjpeg-turbo mingw64-libpng mingw64-curl mingw64-libssh2     mingw64-bzip2
J=8
V=
HOSTNAME=0a5568de7c6a
DEBUG=
SHOW_ENV=1
PWD=/
HOME=/root
CCACHE_DIR=/var/tmp/ccache
DISTTAG=f27container
QEMU_CONFIGURE_OPTS=--python=/usr/bin/python3
FGC=f27
TEST_DIR=/tmp/qemu-test
SHLVL=1
FEATURES=mingw clang pyyaml asan dtc
PATH=/usr/lib/ccache:/usr/lib64/ccache:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
MAKEFLAGS= -j8
EXTRA_CONFIGURE_OPTS=
_=/usr/bin/env

Configure options:
--enable-werror --target-list=x86_64-softmmu,aarch64-softmmu --prefix=/tmp/qemu-test/install --python=/usr/bin/python3 --cross-prefix=x86_64-w64-mingw32- --enable-trace-backends=simple --enable-gnutls --enable-nettle --enable-curl --enable-vnc --enable-bzip2 --enable-guest-agent --with-sdlabi=1.2 --with-gtkabi=2.0
Install prefix    /tmp/qemu-test/install
BIOS directory    /tmp/qemu-test/install
firmware path     /tmp/qemu-test/install/share/qemu-firmware
binary directory  /tmp/qemu-test/install
library directory /tmp/qemu-test/install/lib
module directory  /tmp/qemu-test/install/lib
libexec directory /tmp/qemu-test/install/libexec
include directory /tmp/qemu-test/install/include
config directory  /tmp/qemu-test/install
local state directory   queried at runtime
Windows SDK       no
Source path       /tmp/qemu-test/src
GIT binary        git
GIT submodules    
C compiler        x86_64-w64-mingw32-gcc
Host C compiler   cc
C++ compiler      x86_64-w64-mingw32-g++
Objective-C compiler clang
ARFLAGS           rv
CFLAGS            -O2 -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=2 -g 
QEMU_CFLAGS       -I/usr/x86_64-w64-mingw32/sys-root/mingw/include/pixman-1  -I$(SRC_PATH)/dtc/libfdt -Werror -DHAS_LIBSSH2_SFTP_FSYNC -mms-bitfields -I/usr/x86_64-w64-mingw32/sys-root/mingw/include/glib-2.0 -I/usr/x86_64-w64-mingw32/sys-root/mingw/lib/glib-2.0/include -I/usr/x86_64-w64-mingw32/sys-root/mingw/include  -m64 -mcx16 -mthreads -D__USE_MINGW_ANSI_STDIO=1 -DWIN32_LEAN_AND_MEAN -DWINVER=0x501 -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -Wstrict-prototypes -Wredundant-decls -Wall -Wundef -Wwrite-strings -Wmissing-prototypes -fno-strict-aliasing -fno-common -fwrapv  -Wexpansion-to-defined -Wendif-labels -Wno-shift-negative-value -Wno-missing-include-dirs -Wempty-body -Wnested-externs -Wformat-security -Wformat-y2k -Winit-self -Wignored-qualifiers -Wold-style-declaration -Wold-style-definition -Wtype-limits -fstack-protector-strong -I/usr/x86_64-w64-mingw32/sys-root/mingw/include -I/usr/x86_64-w64-mingw32/sys-root/mingw/include/p11-kit-1 -I/usr/x86_64-w64-mingw32/sys-root/mingw/include  -I/usr/x86_64-w64-mingw32/sys-root/mingw/include   -I/usr/x86_64-w64-mingw32/sys-root/mingw/include/libpng16 
LDFLAGS           -Wl,--nxcompat -Wl,--no-seh -Wl,--dynamicbase -Wl,--warn-common -m64 -g 
make              make
install           install
python            /usr/bin/python3 -B
smbd              /usr/sbin/smbd
module support    no
host CPU          x86_64
host big endian   no
target list       x86_64-softmmu aarch64-softmmu
gprof enabled     no
sparse enabled    no
strip binaries    yes
profiler          no
static build      no
SDL support       yes (1.2.15)
GTK support       yes (2.24.31)
GTK GL support    no
VTE support       no 
TLS priority      NORMAL
GNUTLS support    yes
GNUTLS rnd        yes
libgcrypt         no
libgcrypt kdf     no
nettle            yes (3.3)
nettle kdf        yes
libtasn1          yes
curses support    no
virgl support     no
curl support      yes
mingw32 support   yes
Audio drivers     dsound
Block whitelist (rw) 
Block whitelist (ro) 
VirtFS support    no
Multipath support no
VNC support       yes
VNC SASL support  no
VNC JPEG support  yes
VNC PNG support   yes
xen support       no
brlapi support    no
bluez  support    no
Documentation     no
PIE               no
vde support       no
netmap support    no
Linux AIO support no
ATTR/XATTR support no
Install blobs     yes
KVM support       no
HAX support       yes
HVF support       no
WHPX support      no
TCG support       yes
TCG debug enabled no
TCG interpreter   no
malloc trim support no
RDMA support      no
fdt support       yes
membarrier        no
preadv support    no
fdatasync         no
madvise           no
posix_madvise     no
posix_memalign    no
libcap-ng support no
vhost-net support no
vhost-crypto support no
vhost-scsi support no
vhost-vsock support no
vhost-user support no
Trace backends    simple
Trace output file trace-<pid>
spice support     no 
rbd support       no
xfsctl support    no
smartcard support no
libusb            no
usb net redir     no
OpenGL support    no
OpenGL dmabufs    no
libiscsi support  no
libnfs support    no
build guest agent yes
QGA VSS support   no
QGA w32 disk info yes
QGA MSI support   no
seccomp support   no
coroutine backend win32
coroutine pool    yes
debug stack usage no
crypto afalg      no
GlusterFS support no
gcov              gcov
gcov enabled      no
TPM support       yes
libssh2 support   yes
TPM passthrough   no
TPM emulator      no
QOM debugging     yes
Live block migration yes
lzo support       no
snappy support    no
bzip2 support     yes
NUMA host support no
libxml2           no
tcmalloc support  no
jemalloc support  no
avx2 optimization yes
replication support yes
VxHS block device no
capstone          no

WARNING: Use of GTK 2.0 is deprecated and will be removed in
WARNING: future releases. Please switch to using GTK 3.0

WARNING: Use of SDL 1.2 is deprecated and will be removed in
WARNING: future releases. Please switch to using SDL 2.0
mkdir -p dtc/libfdt
mkdir -p dtc/tests
  GEN     x86_64-softmmu/config-devices.mak.tmp
  GEN     config-host.h
  GEN     aarch64-softmmu/config-devices.mak.tmp
  GEN     qemu-options.def
  GEN     qapi-gen
  GEN     trace/generated-tcg-tracers.h
  GEN     trace/generated-helpers-wrappers.h
  GEN     trace/generated-helpers.h
  GEN     trace/generated-helpers.c
  GEN     aarch64-softmmu/config-devices.mak
  GEN     module_block.h
  GEN     x86_64-softmmu/config-devices.mak
  GEN     ui/input-keymap-atset1-to-qcode.c
  GEN     ui/input-keymap-linux-to-qcode.c
  GEN     ui/input-keymap-qcode-to-atset3.c
  GEN     ui/input-keymap-qcode-to-atset2.c
  GEN     ui/input-keymap-qcode-to-atset1.c
  GEN     ui/input-keymap-qcode-to-linux.c
  GEN     ui/input-keymap-qcode-to-qnum.c
  GEN     ui/input-keymap-qcode-to-sun.c
  GEN     ui/input-keymap-qnum-to-qcode.c
  GEN     ui/input-keymap-win32-to-qcode.c
  GEN     ui/input-keymap-usb-to-qcode.c
  GEN     ui/input-keymap-x11-to-qcode.c
  GEN     ui/input-keymap-xorgevdev-to-qcode.c
  GEN     ui/input-keymap-xorgkbd-to-qcode.c
  GEN     ui/input-keymap-xorgxquartz-to-qcode.c
  GEN     ui/input-keymap-xorgxwin-to-qcode.c
  GEN     trace-root.h
  GEN     tests/test-qapi-gen
  GEN     util/trace.h
  GEN     crypto/trace.h
  GEN     io/trace.h
  GEN     migration/trace.h
  GEN     block/trace.h
  GEN     chardev/trace.h
  GEN     hw/block/trace.h
  GEN     hw/block/dataplane/trace.h
  GEN     hw/char/trace.h
  GEN     hw/intc/trace.h
  GEN     hw/net/trace.h
  GEN     hw/rdma/trace.h
  GEN     hw/rdma/vmw/trace.h
  GEN     hw/virtio/trace.h
  GEN     hw/audio/trace.h
  GEN     hw/misc/trace.h
  GEN     hw/misc/macio/trace.h
  GEN     hw/usb/trace.h
  GEN     hw/scsi/trace.h
  GEN     hw/nvram/trace.h
  GEN     hw/display/trace.h
  GEN     hw/input/trace.h
  GEN     hw/timer/trace.h
  GEN     hw/dma/trace.h
  GEN     hw/sparc/trace.h
  GEN     hw/sparc64/trace.h
  GEN     hw/sd/trace.h
  GEN     hw/isa/trace.h
  GEN     hw/mem/trace.h
  GEN     hw/i386/trace.h
  GEN     hw/i386/xen/trace.h
  GEN     hw/9pfs/trace.h
  GEN     hw/ppc/trace.h
  GEN     hw/pci/trace.h
  GEN     hw/pci-host/trace.h
  GEN     hw/s390x/trace.h
  GEN     hw/vfio/trace.h
  GEN     hw/acpi/trace.h
  GEN     hw/arm/trace.h
  GEN     hw/alpha/trace.h
  GEN     hw/hppa/trace.h
  GEN     hw/xen/trace.h
  GEN     hw/ide/trace.h
  GEN     hw/tpm/trace.h
  GEN     ui/trace.h
  GEN     audio/trace.h
  GEN     net/trace.h
  GEN     target/arm/trace.h
  GEN     target/i386/trace.h
  GEN     target/mips/trace.h
  GEN     target/sparc/trace.h
  GEN     target/s390x/trace.h
  GEN     target/ppc/trace.h
  GEN     qom/trace.h
  GEN     linux-user/trace.h
  GEN     qapi/trace.h
  GEN     accel/tcg/trace.h
  GEN     accel/kvm/trace.h
  GEN     nbd/trace.h
  GEN     scsi/trace.h
  GEN     trace-root.c
  GEN     util/trace.c
  GEN     crypto/trace.c
  GEN     io/trace.c
  GEN     migration/trace.c
  GEN     block/trace.c
  GEN     chardev/trace.c
  GEN     hw/block/trace.c
  GEN     hw/block/dataplane/trace.c
  GEN     hw/char/trace.c
  GEN     hw/intc/trace.c
  GEN     hw/net/trace.c
  GEN     hw/rdma/trace.c
  GEN     hw/rdma/vmw/trace.c
  GEN     hw/virtio/trace.c
  GEN     hw/audio/trace.c
  GEN     hw/misc/trace.c
  GEN     hw/misc/macio/trace.c
  GEN     hw/usb/trace.c
  GEN     hw/scsi/trace.c
  GEN     hw/nvram/trace.c
  GEN     hw/display/trace.c
  GEN     hw/input/trace.c
  GEN     hw/timer/trace.c
  GEN     hw/dma/trace.c
  GEN     hw/sparc/trace.c
  GEN     hw/sparc64/trace.c
  GEN     hw/sd/trace.c
  GEN     hw/isa/trace.c
  GEN     hw/mem/trace.c
  GEN     hw/i386/trace.c
  GEN     hw/i386/xen/trace.c
  GEN     hw/9pfs/trace.c
  GEN     hw/ppc/trace.c
  GEN     hw/pci/trace.c
  GEN     hw/pci-host/trace.c
  GEN     hw/s390x/trace.c
  GEN     hw/vfio/trace.c
  GEN     hw/acpi/trace.c
  GEN     hw/arm/trace.c
  GEN     hw/alpha/trace.c
  GEN     hw/hppa/trace.c
  GEN     hw/xen/trace.c
  GEN     hw/ide/trace.c
  GEN     hw/tpm/trace.c
  GEN     ui/trace.c
  GEN     audio/trace.c
  GEN     net/trace.c
  GEN     target/arm/trace.c
  GEN     target/i386/trace.c
  GEN     target/mips/trace.c
  GEN     target/sparc/trace.c
  GEN     target/s390x/trace.c
  GEN     target/ppc/trace.c
  GEN     qom/trace.c
  GEN     linux-user/trace.c
  GEN     qapi/trace.c
  GEN     accel/tcg/trace.c
  GEN     accel/kvm/trace.c
  GEN     nbd/trace.c
  GEN     scsi/trace.c
  GEN     config-all-devices.mak
	 DEP /tmp/qemu-test/src/dtc/tests/dumptrees.c
	 DEP /tmp/qemu-test/src/dtc/tests/trees.S
	 DEP /tmp/qemu-test/src/dtc/tests/testutils.c
	 DEP /tmp/qemu-test/src/dtc/tests/value-labels.c
	 DEP /tmp/qemu-test/src/dtc/tests/asm_tree_dump.c
	 DEP /tmp/qemu-test/src/dtc/tests/truncated_property.c
	 DEP /tmp/qemu-test/src/dtc/tests/check_path.c
	 DEP /tmp/qemu-test/src/dtc/tests/overlay_bad_fixup.c
	 DEP /tmp/qemu-test/src/dtc/tests/overlay.c
	 DEP /tmp/qemu-test/src/dtc/tests/subnode_iterate.c
	 DEP /tmp/qemu-test/src/dtc/tests/property_iterate.c
	 DEP /tmp/qemu-test/src/dtc/tests/integer-expressions.c
	 DEP /tmp/qemu-test/src/dtc/tests/utilfdt_test.c
	 DEP /tmp/qemu-test/src/dtc/tests/path_offset_aliases.c
	 DEP /tmp/qemu-test/src/dtc/tests/add_subnode_with_nops.c
	 DEP /tmp/qemu-test/src/dtc/tests/dtbs_equal_unordered.c
	 DEP /tmp/qemu-test/src/dtc/tests/dtb_reverse.c
	 DEP /tmp/qemu-test/src/dtc/tests/dtbs_equal_ordered.c
	 DEP /tmp/qemu-test/src/dtc/tests/extra-terminating-null.c
	 DEP /tmp/qemu-test/src/dtc/tests/incbin.c
	 DEP /tmp/qemu-test/src/dtc/tests/boot-cpuid.c
	 DEP /tmp/qemu-test/src/dtc/tests/phandle_format.c
	 DEP /tmp/qemu-test/src/dtc/tests/path-references.c
	 DEP /tmp/qemu-test/src/dtc/tests/references.c
	 DEP /tmp/qemu-test/src/dtc/tests/string_escapes.c
	 DEP /tmp/qemu-test/src/dtc/tests/propname_escapes.c
	 DEP /tmp/qemu-test/src/dtc/tests/appendprop2.c
	 DEP /tmp/qemu-test/src/dtc/tests/appendprop1.c
	 DEP /tmp/qemu-test/src/dtc/tests/del_node.c
	 DEP /tmp/qemu-test/src/dtc/tests/del_property.c
	 DEP /tmp/qemu-test/src/dtc/tests/setprop.c
	 DEP /tmp/qemu-test/src/dtc/tests/rw_tree1.c
	 DEP /tmp/qemu-test/src/dtc/tests/set_name.c
	 DEP /tmp/qemu-test/src/dtc/tests/open_pack.c
	 DEP /tmp/qemu-test/src/dtc/tests/nopulate.c
	 DEP /tmp/qemu-test/src/dtc/tests/mangle-layout.c
	 DEP /tmp/qemu-test/src/dtc/tests/move_and_save.c
	 DEP /tmp/qemu-test/src/dtc/tests/sw_tree1.c
	 DEP /tmp/qemu-test/src/dtc/tests/nop_node.c
	 DEP /tmp/qemu-test/src/dtc/tests/nop_property.c
	 DEP /tmp/qemu-test/src/dtc/tests/setprop_inplace.c
	 DEP /tmp/qemu-test/src/dtc/tests/stringlist.c
	 DEP /tmp/qemu-test/src/dtc/tests/addr_size_cells.c
	 DEP /tmp/qemu-test/src/dtc/tests/notfound.c
	 DEP /tmp/qemu-test/src/dtc/tests/sized_cells.c
	 DEP /tmp/qemu-test/src/dtc/tests/char_literal.c
	 DEP /tmp/qemu-test/src/dtc/tests/get_alias.c
	 DEP /tmp/qemu-test/src/dtc/tests/node_offset_by_compatible.c
	 DEP /tmp/qemu-test/src/dtc/tests/node_check_compatible.c
	 DEP /tmp/qemu-test/src/dtc/tests/node_offset_by_phandle.c
	 DEP /tmp/qemu-test/src/dtc/tests/node_offset_by_prop_value.c
	 DEP /tmp/qemu-test/src/dtc/tests/parent_offset.c
	 DEP /tmp/qemu-test/src/dtc/tests/supernode_atdepth_offset.c
	 DEP /tmp/qemu-test/src/dtc/tests/get_path.c
	 DEP /tmp/qemu-test/src/dtc/tests/get_phandle.c
	 DEP /tmp/qemu-test/src/dtc/tests/getprop.c
	 DEP /tmp/qemu-test/src/dtc/tests/get_name.c
	 DEP /tmp/qemu-test/src/dtc/tests/path_offset.c
	 DEP /tmp/qemu-test/src/dtc/tests/subnode_offset.c
	 DEP /tmp/qemu-test/src/dtc/tests/find_property.c
	 DEP /tmp/qemu-test/src/dtc/tests/root_node.c
	 DEP /tmp/qemu-test/src/dtc/tests/get_mem_rsv.c
	 DEP /tmp/qemu-test/src/dtc/libfdt/fdt_overlay.c
	 DEP /tmp/qemu-test/src/dtc/libfdt/fdt_addresses.c
	 DEP /tmp/qemu-test/src/dtc/libfdt/fdt_empty_tree.c
	 DEP /tmp/qemu-test/src/dtc/libfdt/fdt_strerror.c
	 DEP /tmp/qemu-test/src/dtc/libfdt/fdt_rw.c
	 DEP /tmp/qemu-test/src/dtc/libfdt/fdt_sw.c
	 DEP /tmp/qemu-test/src/dtc/libfdt/fdt_wip.c
	 DEP /tmp/qemu-test/src/dtc/libfdt/fdt_ro.c
	 DEP /tmp/qemu-test/src/dtc/libfdt/fdt.c
	 DEP /tmp/qemu-test/src/dtc/util.c
	 DEP /tmp/qemu-test/src/dtc/fdtoverlay.c
	 DEP /tmp/qemu-test/src/dtc/fdtput.c
	 DEP /tmp/qemu-test/src/dtc/fdtget.c
	 DEP /tmp/qemu-test/src/dtc/fdtdump.c
	 LEX convert-dtsv0-lexer.lex.c
	 DEP /tmp/qemu-test/src/dtc/srcpos.c
	 BISON dtc-parser.tab.c
	 LEX dtc-lexer.lex.c
	 DEP /tmp/qemu-test/src/dtc/treesource.c
	 DEP /tmp/qemu-test/src/dtc/livetree.c
	 DEP /tmp/qemu-test/src/dtc/fstree.c
	 DEP /tmp/qemu-test/src/dtc/flattree.c
	 DEP /tmp/qemu-test/src/dtc/dtc.c
	 DEP /tmp/qemu-test/src/dtc/data.c
	 DEP /tmp/qemu-test/src/dtc/checks.c
	 DEP convert-dtsv0-lexer.lex.c
	 DEP dtc-parser.tab.c
	 DEP dtc-lexer.lex.c
	CHK version_gen.h
	UPD version_gen.h
	 DEP /tmp/qemu-test/src/dtc/util.c
	 CC libfdt/fdt.o
	 CC libfdt/fdt_ro.o
	 CC libfdt/fdt_wip.o
	 CC libfdt/fdt_empty_tree.o
	 CC libfdt/fdt_sw.o
	 CC libfdt/fdt_strerror.o
	 CC libfdt/fdt_addresses.o
	 CC libfdt/fdt_rw.o
	 CC libfdt/fdt_overlay.o
	 AR libfdt/libfdt.a
x86_64-w64-mingw32-ar: creating libfdt/libfdt.a
a - libfdt/fdt.o
a - libfdt/fdt_ro.o
a - libfdt/fdt_wip.o
a - libfdt/fdt_sw.o
a - libfdt/fdt_rw.o
a - libfdt/fdt_strerror.o
a - libfdt/fdt_empty_tree.o
a - libfdt/fdt_addresses.o
a - libfdt/fdt_overlay.o
  RC      version.o
mkdir -p dtc/libfdt
mkdir -p dtc/tests
  GEN     qga/qapi-generated/qapi-gen
  CC      qapi/qapi-types.o
  CC      qapi/qapi-builtin-types.o
  CC      qapi/qapi-types-block-core.o
  CC      qapi/qapi-types-char.o
  CC      qapi/qapi-types-common.o
  CC      qapi/qapi-types-crypto.o
  CC      qapi/qapi-types-block.o
  CC      qapi/qapi-types-introspect.o
  CC      qapi/qapi-types-migration.o
  CC      qapi/qapi-types-misc.o
  CC      qapi/qapi-types-net.o
  CC      qapi/qapi-types-rocker.o
  CC      qapi/qapi-types-run-state.o
  CC      qapi/qapi-types-sockets.o
  CC      qapi/qapi-types-tpm.o
  CC      qapi/qapi-types-trace.o
  CC      qapi/qapi-types-transaction.o
  CC      qapi/qapi-types-ui.o
  CC      qapi/qapi-builtin-visit.o
  CC      qapi/qapi-visit.o
  CC      qapi/qapi-visit-block-core.o
  CC      qapi/qapi-visit-block.o
  CC      qapi/qapi-visit-char.o
  CC      qapi/qapi-visit-common.o
  CC      qapi/qapi-visit-crypto.o
  CC      qapi/qapi-visit-introspect.o
  CC      qapi/qapi-visit-migration.o
  CC      qapi/qapi-visit-misc.o
  CC      qapi/qapi-visit-net.o
  CC      qapi/qapi-visit-rocker.o
  CC      qapi/qapi-visit-run-state.o
  CC      qapi/qapi-visit-sockets.o
  CC      qapi/qapi-visit-tpm.o
  CC      qapi/qapi-visit-trace.o
  CC      qapi/qapi-visit-transaction.o
  CC      qapi/qapi-visit-ui.o
  CC      qapi/qapi-events.o
  CC      qapi/qapi-events-block-core.o
  CC      qapi/qapi-events-block.o
  CC      qapi/qapi-events-char.o
make: *** [/tmp/qemu-test/src/rules.mak:66: qapi/qapi-types.o] Error 1
make: *** Waiting for unfinished jobs....
Traceback (most recent call last):
  File "./tests/docker/docker.py", line 407, in <module>
    sys.exit(main())
  File "./tests/docker/docker.py", line 404, in main
    return args.cmdobj.run(args, argv)
  File "./tests/docker/docker.py", line 261, in run
    return Docker().run(argv, args.keep, quiet=args.quiet)
  File "./tests/docker/docker.py", line 229, in run
    quiet=quiet)
  File "./tests/docker/docker.py", line 147, in _do_check
    return subprocess.check_call(self._command + cmd, **kwargs)
  File "/usr/lib64/python2.7/subprocess.py", line 186, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['docker', 'run', '--label', 'com.qemu.instance.uuid=624ffe64358a11e8aeea52540069c830', '-u', '0', '--security-opt', 'seccomp=unconfined', '--rm', '--net=none', '-e', 'TARGET_LIST=', '-e', 'EXTRA_CONFIGURE_OPTS=', '-e', 'V=', '-e', 'J=8', '-e', 'DEBUG=', '-e', 'SHOW_ENV=1', '-e', 'CCACHE_DIR=/var/tmp/ccache', '-v', '/root/.cache/qemu-docker-ccache:/var/tmp/ccache:z', '-v', '/var/tmp/patchew-tester-tmp-_154viaz/src/docker-src.2018-04-01-04.54.41.32446:/var/tmp/qemu:z,ro', 'qemu:fedora', '/var/tmp/qemu/run', 'test-mingw']' returned non-zero exit status 2
make[1]: *** [tests/docker/Makefile.include:129: docker-run] Error 1
make[1]: Leaving directory '/var/tmp/patchew-tester-tmp-_154viaz/src'
make: *** [tests/docker/Makefile.include:163: docker-run-test-mingw@fedora] Error 2

real	2m16.109s
user	0m9.531s
sys	0m8.757s
=== OUTPUT END ===

Test command exited with code: 2


---
Email generated automatically by Patchew [http://patchew.org/].
Please send your feedback to patchew-devel@redhat.com

[Qemu-devel] [PATCH V4] migration: add capability to bypass the shared memory

Posted by Lai Jiangshan 7 years, 10 months ago

1) What's this

When the migration capability 'bypass-shared-memory'
is set, the shared memory will be bypassed when migration.

It is the key feature to enable several excellent features for
the qemu, such as qemu-local-migration, qemu-live-update,
extremely-fast-save-restore, vm-template, vm-fast-live-clone,
yet-another-post-copy-migration, etc..

The philosophy behind this key feature, including the resulting
advanced key features, is that a part of the memory management
is separated out from the qemu, and let the other toolkits
such as libvirt, kata-containers (https://github.com/kata-containers)
runv(https://github.com/hyperhq/runv/) or some multiple cooperative
qemu commands directly access to it, manage it, provide features on it.

2) Status in real world

The hyperhq(http://hyper.sh  http://hypercontainer.io/)
introduced the feature vm-template(vm-fast-live-clone)
to the hyper container for several years, it works perfect.
(see https://github.com/hyperhq/runv/pull/297).

The feature vm-template makes the containers(VMs) can
be started in 130ms and save 80M memory for every
container(VM). So that the hyper containers are fast
and high-density as normal containers.

kata-containers project (https://github.com/kata-containers)
which was launched by hyper, intel and friends and which descended
from runv (and clear-container) should have this feature enabled.
Unfortunately, due to the code confliction between runv&cc,
this feature was temporary disabled and it is being brought
back by hyper and intel team.

3) How to use and bring up advanced features.

In current qemu command line, shared memory has
to be configured via memory-object.

a) feature: qemu-local-migration, qemu-live-update
Set the mem-path on the tmpfs and set share=on for it when
start the vm. example:
-object \
memory-backend-file,id=mem,size=128M,mem-path=/dev/shm/memory,share=on \
-numa node,nodeid=0,cpus=0-7,memdev=mem

when you want to migrate the vm locally (after fixed a security bug
of the qemu-binary, or other reason), you can start a new qemu with
the same command line and -incoming, then you can migrate the
vm from the old qemu to the new qemu with the migration capability
'bypass-shared-memory' set. The migration will migrate the device-state
*ONLY*, the memory is the origin memory backed by tmpfs file.

b) feature: extremely-fast-save-restore
the same above, but the mem-path is on the persistent file system.

c)  feature: vm-template, vm-fast-live-clone
the template vm is started as 1), and paused when the guest reaches
the template point(example: the guest app is ready), then the template
vm is saved. (the qemu process of the template can be killed now, because
we need only the memory and the device state files (in tmpfs)).

Then we can launch one or multiple VMs base on the template vm states,
the new VMs are started without the “share=on”, all the new VMs share
the initial memory from the memory file, they save a lot of memory.
all the new VMs start from the template point, the guest app can go to
work quickly.

The new VM booted from template vm can’t become template again,
if you need this unusual chained-template feature, you can write
a cloneable-tmpfs kernel module for it.

The libvirt toolkit can’t manage vm-template currently, in the
hyperhq/runv, we use qemu wrapper script to do it. I hope someone add
“libvrit managed template” feature to libvirt.

d) feature: yet-another-post-copy-migration
It is a possible feature, no toolkit can do it well now.
Using nbd server/client on the memory file is reluctantly Ok but
inconvenient. A special feature for tmpfs might be needed to
fully complete this feature.
No one need yet another post copy migration method,
but it is possible when some crazy man need it.

Cc: Samuel Ortiz <sameo@linux.intel.com>
Cc: Sebastien Boeuf <sebastien.boeuf@intel.com>
Cc: James O. D. Hunt <james.o.hunt@intel.com>
Cc: Xu Wang <gnawux@gmail.com>
Cc: Peng Tao <bergwolf@gmail.com>
Cc: Xiao Guangrong <xiaoguangrong@tencent.com>
Cc: Xiao Guangrong <xiaoguangrong.eric@gmail.com>
Signed-off-by: Lai Jiangshan <jiangshanlai@gmail.com>
---

Changes in V4:
 fixes checkpatch.pl errors

Changes in V3:
 rebased on upstream master
 update the available version of the capability to
 v2.13

Changes in V2:
 rebased on 2.11.1

 migration/migration.c | 14 ++++++++++++++
 migration/migration.h |  1 +
 migration/ram.c       | 27 ++++++++++++++++++---------
 qapi/migration.json   |  6 +++++-
 4 files changed, 38 insertions(+), 10 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 52a5092add..6a63102d7f 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1509,6 +1509,20 @@ bool migrate_release_ram(void)
     return s->enabled_capabilities[MIGRATION_CAPABILITY_RELEASE_RAM];
 }
 
+bool migrate_bypass_shared_memory(void)
+{
+    MigrationState *s;
+
+    /* it is not workable with postcopy yet. */
+    if (migrate_postcopy_ram()) {
+        return false;
+    }
+
+    s = migrate_get_current();
+
+    return s->enabled_capabilities[MIGRATION_CAPABILITY_BYPASS_SHARED_MEMORY];
+}
+
 bool migrate_postcopy_ram(void)
 {
     MigrationState *s;
diff --git a/migration/migration.h b/migration/migration.h
index 8d2f320c48..cfd2513ef0 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -206,6 +206,7 @@ MigrationState *migrate_get_current(void);
 
 bool migrate_postcopy(void);
 
+bool migrate_bypass_shared_memory(void);
 bool migrate_release_ram(void);
 bool migrate_postcopy_ram(void);
 bool migrate_zero_blocks(void);
diff --git a/migration/ram.c b/migration/ram.c
index 0e90efa092..bca170c386 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -780,6 +780,11 @@ unsigned long migration_bitmap_find_dirty(RAMState *rs, RAMBlock *rb,
     unsigned long *bitmap = rb->bmap;
     unsigned long next;
 
+    /* when this ramblock is requested bypassing */
+    if (!bitmap) {
+        return size;
+    }
+
     if (rs->ram_bulk_stage && start > 0) {
         next = start + 1;
     } else {
@@ -850,7 +855,9 @@ static void migration_bitmap_sync(RAMState *rs)
     qemu_mutex_lock(&rs->bitmap_mutex);
     rcu_read_lock();
     RAMBLOCK_FOREACH(block) {
-        migration_bitmap_sync_range(rs, block, 0, block->used_length);
+        if (!migrate_bypass_shared_memory() || !qemu_ram_is_shared(block)) {
+            migration_bitmap_sync_range(rs, block, 0, block->used_length);
+        }
     }
     rcu_read_unlock();
     qemu_mutex_unlock(&rs->bitmap_mutex);
@@ -2132,18 +2139,12 @@ static int ram_state_init(RAMState **rsp)
     qemu_mutex_init(&(*rsp)->src_page_req_mutex);
     QSIMPLEQ_INIT(&(*rsp)->src_page_requests);
 
-    /*
-     * Count the total number of pages used by ram blocks not including any
-     * gaps due to alignment or unplugs.
-     */
-    (*rsp)->migration_dirty_pages = ram_bytes_total() >> TARGET_PAGE_BITS;
-
     ram_state_reset(*rsp);
 
     return 0;
 }
 
-static void ram_list_init_bitmaps(void)
+static void ram_list_init_bitmaps(RAMState *rs)
 {
     RAMBlock *block;
     unsigned long pages;
@@ -2151,9 +2152,17 @@ static void ram_list_init_bitmaps(void)
     /* Skip setting bitmap if there is no RAM */
     if (ram_bytes_total()) {
         QLIST_FOREACH_RCU(block, &ram_list.blocks, next) {
+            if (migrate_bypass_shared_memory() && qemu_ram_is_shared(block)) {
+                continue;
+            }
             pages = block->max_length >> TARGET_PAGE_BITS;
             block->bmap = bitmap_new(pages);
             bitmap_set(block->bmap, 0, pages);
+            /*
+             * Count the total number of pages used by ram blocks not
+             * including any gaps due to alignment or unplugs.
+             */
+            rs->migration_dirty_pages += pages;
             if (migrate_postcopy_ram()) {
                 block->unsentmap = bitmap_new(pages);
                 bitmap_set(block->unsentmap, 0, pages);
@@ -2169,7 +2178,7 @@ static void ram_init_bitmaps(RAMState *rs)
     qemu_mutex_lock_ramlist();
     rcu_read_lock();
 
-    ram_list_init_bitmaps();
+    ram_list_init_bitmaps(rs);
     memory_global_dirty_log_start();
     migration_bitmap_sync(rs);
 
diff --git a/qapi/migration.json b/qapi/migration.json
index 9d0bf82cf4..45326480bd 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -357,13 +357,17 @@
 # @dirty-bitmaps: If enabled, QEMU will migrate named dirty bitmaps.
 #                 (since 2.12)
 #
+# @bypass-shared-memory: the shared memory region will be bypassed on migration.
+#          This feature allows the memory region to be reused by new qemu(s)
+#          or be migrated separately. (since 2.13)
+#
 # Since: 1.2
 ##
 { 'enum': 'MigrationCapability',
   'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks',
            'compress', 'events', 'postcopy-ram', 'x-colo', 'release-ram',
            'block', 'return-path', 'pause-before-switchover', 'x-multifd',
-           'dirty-bitmaps' ] }
+           'dirty-bitmaps', 'bypass-shared-memory' ] }
 
 ##
 # @MigrationCapabilityStatus:
-- 
2.14.3 (Apple Git-98)

Re: [Qemu-devel] [PATCH V4] migration: add capability to bypass the shared memory

Posted by Xiao Guangrong 7 years, 10 months ago


On 04/04/2018 07:47 PM, Lai Jiangshan wrote:
> 1) What's this
> 
> When the migration capability 'bypass-shared-memory'
> is set, the shared memory will be bypassed when migration.
> 
> It is the key feature to enable several excellent features for
> the qemu, such as qemu-local-migration, qemu-live-update,
> extremely-fast-save-restore, vm-template, vm-fast-live-clone,
> yet-another-post-copy-migration, etc..
> 
> The philosophy behind this key feature, including the resulting
> advanced key features, is that a part of the memory management
> is separated out from the qemu, and let the other toolkits
> such as libvirt, kata-containers (https://github.com/kata-containers)
> runv(https://github.com/hyperhq/runv/) or some multiple cooperative
> qemu commands directly access to it, manage it, provide features on it.
> 
> 2) Status in real world
> 
> The hyperhq(http://hyper.sh  http://hypercontainer.io/)
> introduced the feature vm-template(vm-fast-live-clone)
> to the hyper container for several years, it works perfect.
> (see https://github.com/hyperhq/runv/pull/297).
> 
> The feature vm-template makes the containers(VMs) can
> be started in 130ms and save 80M memory for every
> container(VM). So that the hyper containers are fast
> and high-density as normal containers.
> 
> kata-containers project (https://github.com/kata-containers)
> which was launched by hyper, intel and friends and which descended
> from runv (and clear-container) should have this feature enabled.
> Unfortunately, due to the code confliction between runv&cc,
> this feature was temporary disabled and it is being brought
> back by hyper and intel team.
> 
> 3) How to use and bring up advanced features.
> 
> In current qemu command line, shared memory has
> to be configured via memory-object.
> 
> a) feature: qemu-local-migration, qemu-live-update
> Set the mem-path on the tmpfs and set share=on for it when
> start the vm. example:
> -object \
> memory-backend-file,id=mem,size=128M,mem-path=/dev/shm/memory,share=on \
> -numa node,nodeid=0,cpus=0-7,memdev=mem
> 
> when you want to migrate the vm locally (after fixed a security bug
> of the qemu-binary, or other reason), you can start a new qemu with
> the same command line and -incoming, then you can migrate the
> vm from the old qemu to the new qemu with the migration capability
> 'bypass-shared-memory' set. The migration will migrate the device-state
> *ONLY*, the memory is the origin memory backed by tmpfs file.
> 
> b) feature: extremely-fast-save-restore
> the same above, but the mem-path is on the persistent file system.
> 
> c)  feature: vm-template, vm-fast-live-clone
> the template vm is started as 1), and paused when the guest reaches
> the template point(example: the guest app is ready), then the template
> vm is saved. (the qemu process of the template can be killed now, because
> we need only the memory and the device state files (in tmpfs)).
> 
> Then we can launch one or multiple VMs base on the template vm states,
> the new VMs are started without the “share=on”, all the new VMs share
> the initial memory from the memory file, they save a lot of memory.
> all the new VMs start from the template point, the guest app can go to
> work quickly.
> 
> The new VM booted from template vm can’t become template again,
> if you need this unusual chained-template feature, you can write
> a cloneable-tmpfs kernel module for it.
> 
> The libvirt toolkit can’t manage vm-template currently, in the
> hyperhq/runv, we use qemu wrapper script to do it. I hope someone add
> “libvrit managed template” feature to libvirt.
> 
> d) feature: yet-another-post-copy-migration
> It is a possible feature, no toolkit can do it well now.
> Using nbd server/client on the memory file is reluctantly Ok but
> inconvenient. A special feature for tmpfs might be needed to
> fully complete this feature.
> No one need yet another post copy migration method,
> but it is possible when some crazy man need it.

Excellent work. :）

It's a brilliant feature that can improve our production a lot.

Reviewed-by: Xiao Guangrong <xiaoguangrong@tencent.com>

Re: [Qemu-devel] [PATCH V4] migration: add capability to bypass the shared memory

Posted by Dr. David Alan Gilbert 7 years, 10 months ago

Hi,

* Lai Jiangshan (jiangshanlai@gmail.com) wrote:
> 1) What's this
> 
> When the migration capability 'bypass-shared-memory'
> is set, the shared memory will be bypassed when migration.
> 
> It is the key feature to enable several excellent features for
> the qemu, such as qemu-local-migration, qemu-live-update,
> extremely-fast-save-restore, vm-template, vm-fast-live-clone,
> yet-another-post-copy-migration, etc..
> 
> The philosophy behind this key feature, including the resulting
> advanced key features, is that a part of the memory management
> is separated out from the qemu, and let the other toolkits
> such as libvirt, kata-containers (https://github.com/kata-containers)
> runv(https://github.com/hyperhq/runv/) or some multiple cooperative
> qemu commands directly access to it, manage it, provide features on it.
> 
> 2) Status in real world
> 
> The hyperhq(http://hyper.sh  http://hypercontainer.io/)
> introduced the feature vm-template(vm-fast-live-clone)
> to the hyper container for several years, it works perfect.
> (see https://github.com/hyperhq/runv/pull/297).
> 
> The feature vm-template makes the containers(VMs) can
> be started in 130ms and save 80M memory for every
> container(VM). So that the hyper containers are fast
> and high-density as normal containers.
> 
> kata-containers project (https://github.com/kata-containers)
> which was launched by hyper, intel and friends and which descended
> from runv (and clear-container) should have this feature enabled.
> Unfortunately, due to the code confliction between runv&cc,
> this feature was temporary disabled and it is being brought
> back by hyper and intel team.
> 
> 3) How to use and bring up advanced features.
> 
> In current qemu command line, shared memory has
> to be configured via memory-object.
> 
> a) feature: qemu-local-migration, qemu-live-update
> Set the mem-path on the tmpfs and set share=on for it when
> start the vm. example:
> -object \
> memory-backend-file,id=mem,size=128M,mem-path=/dev/shm/memory,share=on \
> -numa node,nodeid=0,cpus=0-7,memdev=mem
> 
> when you want to migrate the vm locally (after fixed a security bug
> of the qemu-binary, or other reason), you can start a new qemu with
> the same command line and -incoming, then you can migrate the
> vm from the old qemu to the new qemu with the migration capability
> 'bypass-shared-memory' set. The migration will migrate the device-state
> *ONLY*, the memory is the origin memory backed by tmpfs file.
> 
> b) feature: extremely-fast-save-restore
> the same above, but the mem-path is on the persistent file system.
> 
> c)  feature: vm-template, vm-fast-live-clone
> the template vm is started as 1), and paused when the guest reaches
> the template point(example: the guest app is ready), then the template
> vm is saved. (the qemu process of the template can be killed now, because
> we need only the memory and the device state files (in tmpfs)).
> 
> Then we can launch one or multiple VMs base on the template vm states,
> the new VMs are started without the “share=on”, all the new VMs share
> the initial memory from the memory file, they save a lot of memory.
> all the new VMs start from the template point, the guest app can go to
> work quickly.

How do you handle the storage in this case, or giving each VM it's own
MAC address?

> The new VM booted from template vm can’t become template again,
> if you need this unusual chained-template feature, you can write
> a cloneable-tmpfs kernel module for it.
> 
> The libvirt toolkit can’t manage vm-template currently, in the
> hyperhq/runv, we use qemu wrapper script to do it. I hope someone add
> “libvrit managed template” feature to libvirt.

> d) feature: yet-another-post-copy-migration
> It is a possible feature, no toolkit can do it well now.
> Using nbd server/client on the memory file is reluctantly Ok but
> inconvenient. A special feature for tmpfs might be needed to
> fully complete this feature.
> No one need yet another post copy migration method,
> but it is possible when some crazy man need it.

As the crazy person who did the existing postcopy; one is enough!

Some minor fix requests below, but this looks nice and simple.

Shared memory is interesting because tehre are lots of different uses;
e.g. your uses, but also vhost-user which is sharing for a completely
different reason.

> Cc: Samuel Ortiz <sameo@linux.intel.com>
> Cc: Sebastien Boeuf <sebastien.boeuf@intel.com>
> Cc: James O. D. Hunt <james.o.hunt@intel.com>
> Cc: Xu Wang <gnawux@gmail.com>
> Cc: Peng Tao <bergwolf@gmail.com>
> Cc: Xiao Guangrong <xiaoguangrong@tencent.com>
> Cc: Xiao Guangrong <xiaoguangrong.eric@gmail.com>
> Signed-off-by: Lai Jiangshan <jiangshanlai@gmail.com>
> ---
> 
> Changes in V4:
>  fixes checkpatch.pl errors
> 
> Changes in V3:
>  rebased on upstream master
>  update the available version of the capability to
>  v2.13
> 
> Changes in V2:
>  rebased on 2.11.1
> 
>  migration/migration.c | 14 ++++++++++++++
>  migration/migration.h |  1 +
>  migration/ram.c       | 27 ++++++++++++++++++---------
>  qapi/migration.json   |  6 +++++-
>  4 files changed, 38 insertions(+), 10 deletions(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 52a5092add..6a63102d7f 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1509,6 +1509,20 @@ bool migrate_release_ram(void)
>      return s->enabled_capabilities[MIGRATION_CAPABILITY_RELEASE_RAM];
>  }
>  
> +bool migrate_bypass_shared_memory(void)
> +{
> +    MigrationState *s;
> +
> +    /* it is not workable with postcopy yet. */
> +    if (migrate_postcopy_ram()) {
> +        return false;
> +    }

Please change this to work in the same way as the check for
postcopy+compress in migration.c migrate_caps_check.

> +    s = migrate_get_current();
> +
> +    return s->enabled_capabilities[MIGRATION_CAPABILITY_BYPASS_SHARED_MEMORY];
> +}
> +
>  bool migrate_postcopy_ram(void)
>  {
>      MigrationState *s;
> diff --git a/migration/migration.h b/migration/migration.h
> index 8d2f320c48..cfd2513ef0 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -206,6 +206,7 @@ MigrationState *migrate_get_current(void);
>  
>  bool migrate_postcopy(void);
>  
> +bool migrate_bypass_shared_memory(void);
>  bool migrate_release_ram(void);
>  bool migrate_postcopy_ram(void);
>  bool migrate_zero_blocks(void);
> diff --git a/migration/ram.c b/migration/ram.c
> index 0e90efa092..bca170c386 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -780,6 +780,11 @@ unsigned long migration_bitmap_find_dirty(RAMState *rs, RAMBlock *rb,
>      unsigned long *bitmap = rb->bmap;
>      unsigned long next;
>  
> +    /* when this ramblock is requested bypassing */
> +    if (!bitmap) {
> +        return size;
> +    }
> +
>      if (rs->ram_bulk_stage && start > 0) {
>          next = start + 1;
>      } else {
> @@ -850,7 +855,9 @@ static void migration_bitmap_sync(RAMState *rs)
>      qemu_mutex_lock(&rs->bitmap_mutex);
>      rcu_read_lock();
>      RAMBLOCK_FOREACH(block) {
> -        migration_bitmap_sync_range(rs, block, 0, block->used_length);
> +        if (!migrate_bypass_shared_memory() || !qemu_ram_is_shared(block)) {
> +            migration_bitmap_sync_range(rs, block, 0, block->used_length);
> +        }
>      }
>      rcu_read_unlock();
>      qemu_mutex_unlock(&rs->bitmap_mutex);
> @@ -2132,18 +2139,12 @@ static int ram_state_init(RAMState **rsp)
>      qemu_mutex_init(&(*rsp)->src_page_req_mutex);
>      QSIMPLEQ_INIT(&(*rsp)->src_page_requests);
>  
> -    /*
> -     * Count the total number of pages used by ram blocks not including any
> -     * gaps due to alignment or unplugs.
> -     */
> -    (*rsp)->migration_dirty_pages = ram_bytes_total() >> TARGET_PAGE_BITS;
> -
>      ram_state_reset(*rsp);
>  
>      return 0;
>  }
>  
> -static void ram_list_init_bitmaps(void)
> +static void ram_list_init_bitmaps(RAMState *rs)
>  {
>      RAMBlock *block;
>      unsigned long pages;
> @@ -2151,9 +2152,17 @@ static void ram_list_init_bitmaps(void)
>      /* Skip setting bitmap if there is no RAM */
>      if (ram_bytes_total()) {

I think you need to add here a :
   rs->migration_dirty_pages = 0;

I don't see anywhere else that initialises it, and there is the case of
a migration that fails, followed by a 2nd attempt.

>          QLIST_FOREACH_RCU(block, &ram_list.blocks, next) {
> +            if (migrate_bypass_shared_memory() && qemu_ram_is_shared(block)) {
> +                continue;
> +            }
>              pages = block->max_length >> TARGET_PAGE_BITS;
>              block->bmap = bitmap_new(pages);
>              bitmap_set(block->bmap, 0, pages);
> +            /*
> +             * Count the total number of pages used by ram blocks not
> +             * including any gaps due to alignment or unplugs.
> +             */
> +            rs->migration_dirty_pages += pages;
>              if (migrate_postcopy_ram()) {
>                  block->unsentmap = bitmap_new(pages);
>                  bitmap_set(block->unsentmap, 0, pages);
> @@ -2169,7 +2178,7 @@ static void ram_init_bitmaps(RAMState *rs)
>      qemu_mutex_lock_ramlist();
>      rcu_read_lock();
>  
> -    ram_list_init_bitmaps();
> +    ram_list_init_bitmaps(rs);
>      memory_global_dirty_log_start();
>      migration_bitmap_sync(rs);
>  
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 9d0bf82cf4..45326480bd 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -357,13 +357,17 @@
>  # @dirty-bitmaps: If enabled, QEMU will migrate named dirty bitmaps.
>  #                 (since 2.12)
>  #
> +# @bypass-shared-memory: the shared memory region will be bypassed on migration.
> +#          This feature allows the memory region to be reused by new qemu(s)
> +#          or be migrated separately. (since 2.13)
> +#
>  # Since: 1.2
>  ##
>  { 'enum': 'MigrationCapability',
>    'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks',
>             'compress', 'events', 'postcopy-ram', 'x-colo', 'release-ram',
>             'block', 'return-path', 'pause-before-switchover', 'x-multifd',
> -           'dirty-bitmaps' ] }
> +           'dirty-bitmaps', 'bypass-shared-memory' ] }
>  
>  ##
>  # @MigrationCapabilityStatus:
> -- 
> 2.14.3 (Apple Git-98)
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Re: [Qemu-devel] [PATCH V4] migration: add capability to bypass the shared memory

Posted by Lai Jiangshan 7 years, 10 months ago

On Tue, Apr 10, 2018 at 1:30 AM, Dr. David Alan Gilbert
<dgilbert@redhat.com> wrote:
> Hi,
>
> * Lai Jiangshan (jiangshanlai@gmail.com) wrote:
>> 1) What's this
>>
>> When the migration capability 'bypass-shared-memory'
>> is set, the shared memory will be bypassed when migration.
>>
>> It is the key feature to enable several excellent features for
>> the qemu, such as qemu-local-migration, qemu-live-update,
>> extremely-fast-save-restore, vm-template, vm-fast-live-clone,
>> yet-another-post-copy-migration, etc..
>>
>> The philosophy behind this key feature, including the resulting
>> advanced key features, is that a part of the memory management
>> is separated out from the qemu, and let the other toolkits
>> such as libvirt, kata-containers (https://github.com/kata-containers)
>> runv(https://github.com/hyperhq/runv/) or some multiple cooperative
>> qemu commands directly access to it, manage it, provide features on it.
>>
>> 2) Status in real world
>>
>> The hyperhq(http://hyper.sh  http://hypercontainer.io/)
>> introduced the feature vm-template(vm-fast-live-clone)
>> to the hyper container for several years, it works perfect.
>> (see https://github.com/hyperhq/runv/pull/297).
>>
>> The feature vm-template makes the containers(VMs) can
>> be started in 130ms and save 80M memory for every
>> container(VM). So that the hyper containers are fast
>> and high-density as normal containers.
>>
>> kata-containers project (https://github.com/kata-containers)
>> which was launched by hyper, intel and friends and which descended
>> from runv (and clear-container) should have this feature enabled.
>> Unfortunately, due to the code confliction between runv&cc,
>> this feature was temporary disabled and it is being brought
>> back by hyper and intel team.
>>
>> 3) How to use and bring up advanced features.
>>
>> In current qemu command line, shared memory has
>> to be configured via memory-object.
>>
>> a) feature: qemu-local-migration, qemu-live-update
>> Set the mem-path on the tmpfs and set share=on for it when
>> start the vm. example:
>> -object \
>> memory-backend-file,id=mem,size=128M,mem-path=/dev/shm/memory,share=on \
>> -numa node,nodeid=0,cpus=0-7,memdev=mem
>>
>> when you want to migrate the vm locally (after fixed a security bug
>> of the qemu-binary, or other reason), you can start a new qemu with
>> the same command line and -incoming, then you can migrate the
>> vm from the old qemu to the new qemu with the migration capability
>> 'bypass-shared-memory' set. The migration will migrate the device-state
>> *ONLY*, the memory is the origin memory backed by tmpfs file.
>>
>> b) feature: extremely-fast-save-restore
>> the same above, but the mem-path is on the persistent file system.
>>
>> c)  feature: vm-template, vm-fast-live-clone
>> the template vm is started as 1), and paused when the guest reaches
>> the template point(example: the guest app is ready), then the template
>> vm is saved. (the qemu process of the template can be killed now, because
>> we need only the memory and the device state files (in tmpfs)).
>>
>> Then we can launch one or multiple VMs base on the template vm states,
>> the new VMs are started without the “share=on”, all the new VMs share
>> the initial memory from the memory file, they save a lot of memory.
>> all the new VMs start from the template point, the guest app can go to
>> work quickly.
>
> How do you handle the storage in this case, or giving each VM it's own
> MAC address?

The user or the upper layer tools can copy/clone the storage
(on xfs,btrfs,ceph...). The user or the upper layer tools can handle the
interface MAC itself while this patch just focus on memory.

hyper/runv clone the vm before the interfaces are inserted.
vm-template are often used along with hotplugging.

>
>> The new VM booted from template vm can’t become template again,
>> if you need this unusual chained-template feature, you can write
>> a cloneable-tmpfs kernel module for it.
>>
>> The libvirt toolkit can’t manage vm-template currently, in the
>> hyperhq/runv, we use qemu wrapper script to do it. I hope someone add
>> “libvrit managed template” feature to libvirt.
>
>> d) feature: yet-another-post-copy-migration
>> It is a possible feature, no toolkit can do it well now.
>> Using nbd server/client on the memory file is reluctantly Ok but
>> inconvenient. A special feature for tmpfs might be needed to
>> fully complete this feature.
>> No one need yet another post copy migration method,
>> but it is possible when some crazy man need it.
>
> As the crazy person who did the existing postcopy; one is enough!
>

Very true. This part of comments just shows how much
potentials there are for such a simple migration capability.


> Some minor fix requests below, but this looks nice and simple.
>

Will do soon. Thank for your review.

> Shared memory is interesting because tehre are lots of different uses;
> e.g. your uses, but also vhost-user which is sharing for a completely
> different reason.
>
>> Cc: Samuel Ortiz <sameo@linux.intel.com>
>> Cc: Sebastien Boeuf <sebastien.boeuf@intel.com>
>> Cc: James O. D. Hunt <james.o.hunt@intel.com>
>> Cc: Xu Wang <gnawux@gmail.com>
>> Cc: Peng Tao <bergwolf@gmail.com>
>> Cc: Xiao Guangrong <xiaoguangrong@tencent.com>
>> Cc: Xiao Guangrong <xiaoguangrong.eric@gmail.com>
>> Signed-off-by: Lai Jiangshan <jiangshanlai@gmail.com>
>> ---
>>
>> Changes in V4:
>>  fixes checkpatch.pl errors
>>
>> Changes in V3:
>>  rebased on upstream master
>>  update the available version of the capability to
>>  v2.13
>>
>> Changes in V2:
>>  rebased on 2.11.1
>>
>>  migration/migration.c | 14 ++++++++++++++
>>  migration/migration.h |  1 +
>>  migration/ram.c       | 27 ++++++++++++++++++---------
>>  qapi/migration.json   |  6 +++++-
>>  4 files changed, 38 insertions(+), 10 deletions(-)
>>
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 52a5092add..6a63102d7f 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -1509,6 +1509,20 @@ bool migrate_release_ram(void)
>>      return s->enabled_capabilities[MIGRATION_CAPABILITY_RELEASE_RAM];
>>  }
>>
>> +bool migrate_bypass_shared_memory(void)
>> +{
>> +    MigrationState *s;
>> +
>> +    /* it is not workable with postcopy yet. */
>> +    if (migrate_postcopy_ram()) {
>> +        return false;
>> +    }
>
> Please change this to work in the same way as the check for
> postcopy+compress in migration.c migrate_caps_check.
>
>> +    s = migrate_get_current();
>> +
>> +    return s->enabled_capabilities[MIGRATION_CAPABILITY_BYPASS_SHARED_MEMORY];
>> +}
>> +
>>  bool migrate_postcopy_ram(void)
>>  {
>>      MigrationState *s;
>> diff --git a/migration/migration.h b/migration/migration.h
>> index 8d2f320c48..cfd2513ef0 100644
>> --- a/migration/migration.h
>> +++ b/migration/migration.h
>> @@ -206,6 +206,7 @@ MigrationState *migrate_get_current(void);
>>
>>  bool migrate_postcopy(void);
>>
>> +bool migrate_bypass_shared_memory(void);
>>  bool migrate_release_ram(void);
>>  bool migrate_postcopy_ram(void);
>>  bool migrate_zero_blocks(void);
>> diff --git a/migration/ram.c b/migration/ram.c
>> index 0e90efa092..bca170c386 100644
>> --- a/migration/ram.c
>> +++ b/migration/ram.c
>> @@ -780,6 +780,11 @@ unsigned long migration_bitmap_find_dirty(RAMState *rs, RAMBlock *rb,
>>      unsigned long *bitmap = rb->bmap;
>>      unsigned long next;
>>
>> +    /* when this ramblock is requested bypassing */
>> +    if (!bitmap) {
>> +        return size;
>> +    }
>> +
>>      if (rs->ram_bulk_stage && start > 0) {
>>          next = start + 1;
>>      } else {
>> @@ -850,7 +855,9 @@ static void migration_bitmap_sync(RAMState *rs)
>>      qemu_mutex_lock(&rs->bitmap_mutex);
>>      rcu_read_lock();
>>      RAMBLOCK_FOREACH(block) {
>> -        migration_bitmap_sync_range(rs, block, 0, block->used_length);
>> +        if (!migrate_bypass_shared_memory() || !qemu_ram_is_shared(block)) {
>> +            migration_bitmap_sync_range(rs, block, 0, block->used_length);
>> +        }
>>      }
>>      rcu_read_unlock();
>>      qemu_mutex_unlock(&rs->bitmap_mutex);
>> @@ -2132,18 +2139,12 @@ static int ram_state_init(RAMState **rsp)
>>      qemu_mutex_init(&(*rsp)->src_page_req_mutex);
>>      QSIMPLEQ_INIT(&(*rsp)->src_page_requests);
>>
>> -    /*
>> -     * Count the total number of pages used by ram blocks not including any
>> -     * gaps due to alignment or unplugs.
>> -     */
>> -    (*rsp)->migration_dirty_pages = ram_bytes_total() >> TARGET_PAGE_BITS;
>> -
>>      ram_state_reset(*rsp);
>>
>>      return 0;
>>  }
>>
>> -static void ram_list_init_bitmaps(void)
>> +static void ram_list_init_bitmaps(RAMState *rs)
>>  {
>>      RAMBlock *block;
>>      unsigned long pages;
>> @@ -2151,9 +2152,17 @@ static void ram_list_init_bitmaps(void)
>>      /* Skip setting bitmap if there is no RAM */
>>      if (ram_bytes_total()) {
>
> I think you need to add here a :
>    rs->migration_dirty_pages = 0;
>
> I don't see anywhere else that initialises it, and there is the case of
> a migration that fails, followed by a 2nd attempt.
>
>>          QLIST_FOREACH_RCU(block, &ram_list.blocks, next) {
>> +            if (migrate_bypass_shared_memory() && qemu_ram_is_shared(block)) {
>> +                continue;
>> +            }
>>              pages = block->max_length >> TARGET_PAGE_BITS;
>>              block->bmap = bitmap_new(pages);
>>              bitmap_set(block->bmap, 0, pages);
>> +            /*
>> +             * Count the total number of pages used by ram blocks not
>> +             * including any gaps due to alignment or unplugs.
>> +             */
>> +            rs->migration_dirty_pages += pages;
>>              if (migrate_postcopy_ram()) {
>>                  block->unsentmap = bitmap_new(pages);
>>                  bitmap_set(block->unsentmap, 0, pages);
>> @@ -2169,7 +2178,7 @@ static void ram_init_bitmaps(RAMState *rs)
>>      qemu_mutex_lock_ramlist();
>>      rcu_read_lock();
>>
>> -    ram_list_init_bitmaps();
>> +    ram_list_init_bitmaps(rs);
>>      memory_global_dirty_log_start();
>>      migration_bitmap_sync(rs);
>>
>> diff --git a/qapi/migration.json b/qapi/migration.json
>> index 9d0bf82cf4..45326480bd 100644
>> --- a/qapi/migration.json
>> +++ b/qapi/migration.json
>> @@ -357,13 +357,17 @@
>>  # @dirty-bitmaps: If enabled, QEMU will migrate named dirty bitmaps.
>>  #                 (since 2.12)
>>  #
>> +# @bypass-shared-memory: the shared memory region will be bypassed on migration.
>> +#          This feature allows the memory region to be reused by new qemu(s)
>> +#          or be migrated separately. (since 2.13)
>> +#
>>  # Since: 1.2
>>  ##
>>  { 'enum': 'MigrationCapability',
>>    'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks',
>>             'compress', 'events', 'postcopy-ram', 'x-colo', 'release-ram',
>>             'block', 'return-path', 'pause-before-switchover', 'x-multifd',
>> -           'dirty-bitmaps' ] }
>> +           'dirty-bitmaps', 'bypass-shared-memory' ] }
>>
>>  ##
>>  # @MigrationCapabilityStatus:
>> --
>> 2.14.3 (Apple Git-98)
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

[Qemu-devel] [PATCH V5] migration: add capability to bypass the shared memory

Posted by Lai Jiangshan 7 years, 9 months ago

1) What's this

When the migration capability 'bypass-shared-memory'
is set, the shared memory will be bypassed when migration.

It is the key feature to enable several excellent features for
the qemu, such as qemu-local-migration, qemu-live-update,
extremely-fast-save-restore, vm-template, vm-fast-live-clone,
yet-another-post-copy-migration, etc..

The philosophy behind this key feature, including the resulting
advanced key features, is that a part of the memory management
is separated out from the qemu, and let the other toolkits
such as libvirt, kata-containers (https://github.com/kata-containers)
runv(https://github.com/hyperhq/runv/) or some multiple cooperative
qemu commands directly access to it, manage it, provide features on it.

2) Status in real world

The hyperhq(http://hyper.sh  http://hypercontainer.io/)
introduced the feature vm-template(vm-fast-live-clone)
to the hyper container for several years, it works perfect.
(see https://github.com/hyperhq/runv/pull/297).

The feature vm-template makes the containers(VMs) can
be started in 130ms and save 80M memory for every
container(VM). So that the hyper containers are fast
and high-density as normal containers.

kata-containers project (https://github.com/kata-containers)
which was launched by hyper, intel and friends and which descended
from runv (and clear-container) should have this feature enabled.
Unfortunately, due to the code confliction between runv&cc,
this feature was temporary disabled and it is being brought
back by hyper and intel team.

3) How to use and bring up advanced features.

In current qemu command line, shared memory has
to be configured via memory-object.

a) feature: qemu-local-migration, qemu-live-update
Set the mem-path on the tmpfs and set share=on for it when
start the vm. example:
-object \
memory-backend-file,id=mem,size=128M,mem-path=/dev/shm/memory,share=on \
-numa node,nodeid=0,cpus=0-7,memdev=mem

when you want to migrate the vm locally (after fixed a security bug
of the qemu-binary, or other reason), you can start a new qemu with
the same command line and -incoming, then you can migrate the
vm from the old qemu to the new qemu with the migration capability
'bypass-shared-memory' set. The migration will migrate the device-state
*ONLY*, the memory is the origin memory backed by tmpfs file.

b) feature: extremely-fast-save-restore
the same above, but the mem-path is on the persistent file system.

c)  feature: vm-template, vm-fast-live-clone
the template vm is started as 1), and paused when the guest reaches
the template point(example: the guest app is ready), then the template
vm is saved. (the qemu process of the template can be killed now, because
we need only the memory and the device state files (in tmpfs)).

Then we can launch one or multiple VMs base on the template vm states,
the new VMs are started without the “share=on”, all the new VMs share
the initial memory from the memory file, they save a lot of memory.
all the new VMs start from the template point, the guest app can go to
work quickly.

The new VM booted from template vm can’t become template again,
if you need this unusual chained-template feature, you can write
a cloneable-tmpfs kernel module for it.

The libvirt toolkit can’t manage vm-template currently, in the
hyperhq/runv, we use qemu wrapper script to do it. I hope someone add
“libvrit managed template” feature to libvirt.

d) feature: yet-another-post-copy-migration
It is a possible feature, no toolkit can do it well now.
Using nbd server/client on the memory file is reluctantly Ok but
inconvenient. A special feature for tmpfs might be needed to
fully complete this feature.
No one need yet another post copy migration method,
but it is possible when some crazy man need it.

Cc: Juan Quintela <quintela@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Eric Blake <eblake@redhat.com>
Cc: Markus Armbruster <armbru@redhat.com>
Cc: Samuel Ortiz <sameo@linux.intel.com>
Cc: Sebastien Boeuf <sebastien.boeuf@intel.com>
Cc: James O. D. Hunt <james.o.hunt@intel.com>
Cc: Xu Wang <gnawux@gmail.com>
Cc: Peng Tao <bergwolf@gmail.com>
Cc: Xiao Guangrong <xiaoguangrong@tencent.com>
Cc: Xiao Guangrong <xiaoguangrong.eric@gmail.com>
Signed-off-by: Lai Jiangshan <jiangshanlai@gmail.com>
---

Changes in V5:
 check cappability conflict in migrate_caps_check()

Changes in V4:
 fixes checkpatch.pl errors

Changes in V3:
 rebased on upstream master
 update the available version of the capability to
 v2.13

Changes in V2:
 rebased on 2.11.1

 migration/migration.c | 22 ++++++++++++++++++++++
 migration/migration.h |  1 +
 migration/ram.c       | 27 ++++++++++++++++++---------
 qapi/migration.json   |  6 +++++-
 4 files changed, 46 insertions(+), 10 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 52a5092add..110b40f6d4 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -736,6 +736,19 @@ static bool migrate_caps_check(bool *cap_list,
             return false;
         }
 
+        if (cap_list[MIGRATION_CAPABILITY_BYPASS_SHARED_MEMORY]) {
+            /* Bypass and postcopy are quite conflicting ways
+             * to get memory in the destination.  And there
+             * is not code to discriminate the differences and
+             * handle the conflicts currently.  It should be possible
+             * to fix, but it is generally useless when both ways
+             * are used together.
+             */
+            error_setg(errp, "Bypass is not currently compatible "
+                       "with postcopy");
+            return false;
+        }
+
         /* This check is reasonably expensive, so only when it's being
          * set the first time, also it's only the destination that needs
          * special support.
@@ -1509,6 +1522,15 @@ bool migrate_release_ram(void)
     return s->enabled_capabilities[MIGRATION_CAPABILITY_RELEASE_RAM];
 }
 
+bool migrate_bypass_shared_memory(void)
+{
+    MigrationState *s;
+
+    s = migrate_get_current();
+
+    return s->enabled_capabilities[MIGRATION_CAPABILITY_BYPASS_SHARED_MEMORY];
+}
+
 bool migrate_postcopy_ram(void)
 {
     MigrationState *s;
diff --git a/migration/migration.h b/migration/migration.h
index 8d2f320c48..cfd2513ef0 100644
--- a/migration/migration.h
+++ b/migration/migration.h
@@ -206,6 +206,7 @@ MigrationState *migrate_get_current(void);
 
 bool migrate_postcopy(void);
 
+bool migrate_bypass_shared_memory(void);
 bool migrate_release_ram(void);
 bool migrate_postcopy_ram(void);
 bool migrate_zero_blocks(void);
diff --git a/migration/ram.c b/migration/ram.c
index 0e90efa092..bca170c386 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -780,6 +780,11 @@ unsigned long migration_bitmap_find_dirty(RAMState *rs, RAMBlock *rb,
     unsigned long *bitmap = rb->bmap;
     unsigned long next;
 
+    /* when this ramblock is requested bypassing */
+    if (!bitmap) {
+        return size;
+    }
+
     if (rs->ram_bulk_stage && start > 0) {
         next = start + 1;
     } else {
@@ -850,7 +855,9 @@ static void migration_bitmap_sync(RAMState *rs)
     qemu_mutex_lock(&rs->bitmap_mutex);
     rcu_read_lock();
     RAMBLOCK_FOREACH(block) {
-        migration_bitmap_sync_range(rs, block, 0, block->used_length);
+        if (!migrate_bypass_shared_memory() || !qemu_ram_is_shared(block)) {
+            migration_bitmap_sync_range(rs, block, 0, block->used_length);
+        }
     }
     rcu_read_unlock();
     qemu_mutex_unlock(&rs->bitmap_mutex);
@@ -2132,18 +2139,12 @@ static int ram_state_init(RAMState **rsp)
     qemu_mutex_init(&(*rsp)->src_page_req_mutex);
     QSIMPLEQ_INIT(&(*rsp)->src_page_requests);
 
-    /*
-     * Count the total number of pages used by ram blocks not including any
-     * gaps due to alignment or unplugs.
-     */
-    (*rsp)->migration_dirty_pages = ram_bytes_total() >> TARGET_PAGE_BITS;
-
     ram_state_reset(*rsp);
 
     return 0;
 }
 
-static void ram_list_init_bitmaps(void)
+static void ram_list_init_bitmaps(RAMState *rs)
 {
     RAMBlock *block;
     unsigned long pages;
@@ -2151,9 +2152,17 @@ static void ram_list_init_bitmaps(void)
     /* Skip setting bitmap if there is no RAM */
     if (ram_bytes_total()) {
         QLIST_FOREACH_RCU(block, &ram_list.blocks, next) {
+            if (migrate_bypass_shared_memory() && qemu_ram_is_shared(block)) {
+                continue;
+            }
             pages = block->max_length >> TARGET_PAGE_BITS;
             block->bmap = bitmap_new(pages);
             bitmap_set(block->bmap, 0, pages);
+            /*
+             * Count the total number of pages used by ram blocks not
+             * including any gaps due to alignment or unplugs.
+             */
+            rs->migration_dirty_pages += pages;
             if (migrate_postcopy_ram()) {
                 block->unsentmap = bitmap_new(pages);
                 bitmap_set(block->unsentmap, 0, pages);
@@ -2169,7 +2178,7 @@ static void ram_init_bitmaps(RAMState *rs)
     qemu_mutex_lock_ramlist();
     rcu_read_lock();
 
-    ram_list_init_bitmaps();
+    ram_list_init_bitmaps(rs);
     memory_global_dirty_log_start();
     migration_bitmap_sync(rs);
 
diff --git a/qapi/migration.json b/qapi/migration.json
index 9d0bf82cf4..45326480bd 100644
--- a/qapi/migration.json
+++ b/qapi/migration.json
@@ -357,13 +357,17 @@
 # @dirty-bitmaps: If enabled, QEMU will migrate named dirty bitmaps.
 #                 (since 2.12)
 #
+# @bypass-shared-memory: the shared memory region will be bypassed on migration.
+#          This feature allows the memory region to be reused by new qemu(s)
+#          or be migrated separately. (since 2.13)
+#
 # Since: 1.2
 ##
 { 'enum': 'MigrationCapability',
   'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks',
            'compress', 'events', 'postcopy-ram', 'x-colo', 'release-ram',
            'block', 'return-path', 'pause-before-switchover', 'x-multifd',
-           'dirty-bitmaps' ] }
+           'dirty-bitmaps', 'bypass-shared-memory' ] }
 
 ##
 # @MigrationCapabilityStatus:
-- 
2.15.1 (Apple Git-101)

Re: [Qemu-devel] [PATCH V5] migration: add capability to bypass the shared memory

Posted by Dr. David Alan Gilbert 7 years, 9 months ago

* Lai Jiangshan (jiangshanlai@gmail.com) wrote:
> 1) What's this
> 
> When the migration capability 'bypass-shared-memory'
> is set, the shared memory will be bypassed when migration.
> 
> It is the key feature to enable several excellent features for
> the qemu, such as qemu-local-migration, qemu-live-update,
> extremely-fast-save-restore, vm-template, vm-fast-live-clone,
> yet-another-post-copy-migration, etc..
> 
> The philosophy behind this key feature, including the resulting
> advanced key features, is that a part of the memory management
> is separated out from the qemu, and let the other toolkits
> such as libvirt, kata-containers (https://github.com/kata-containers)
> runv(https://github.com/hyperhq/runv/) or some multiple cooperative
> qemu commands directly access to it, manage it, provide features on it.
> 
> 2) Status in real world
> 
> The hyperhq(http://hyper.sh  http://hypercontainer.io/)
> introduced the feature vm-template(vm-fast-live-clone)
> to the hyper container for several years, it works perfect.
> (see https://github.com/hyperhq/runv/pull/297).
> 
> The feature vm-template makes the containers(VMs) can
> be started in 130ms and save 80M memory for every
> container(VM). So that the hyper containers are fast
> and high-density as normal containers.
> 
> kata-containers project (https://github.com/kata-containers)
> which was launched by hyper, intel and friends and which descended
> from runv (and clear-container) should have this feature enabled.
> Unfortunately, due to the code confliction between runv&cc,
> this feature was temporary disabled and it is being brought
> back by hyper and intel team.
> 3) How to use and bring up advanced features.
> 
> In current qemu command line, shared memory has
> to be configured via memory-object.
> 
> a) feature: qemu-local-migration, qemu-live-update
> Set the mem-path on the tmpfs and set share=on for it when
> start the vm. example:
> -object \
> memory-backend-file,id=mem,size=128M,mem-path=/dev/shm/memory,share=on \
> -numa node,nodeid=0,cpus=0-7,memdev=mem
> 
> when you want to migrate the vm locally (after fixed a security bug
> of the qemu-binary, or other reason), you can start a new qemu with
> the same command line and -incoming, then you can migrate the
> vm from the old qemu to the new qemu with the migration capability
> 'bypass-shared-memory' set. The migration will migrate the device-state
> *ONLY*, the memory is the origin memory backed by tmpfs file.
> 
> b) feature: extremely-fast-save-restore
> the same above, but the mem-path is on the persistent file system.
> 
> c)  feature: vm-template, vm-fast-live-clone
> the template vm is started as 1), and paused when the guest reaches
> the template point(example: the guest app is ready), then the template
> vm is saved. (the qemu process of the template can be killed now, because
> we need only the memory and the device state files (in tmpfs)).
> 
> Then we can launch one or multiple VMs base on the template vm states,
> the new VMs are started without the “share=on”, all the new VMs share
> the initial memory from the memory file, they save a lot of memory.
> all the new VMs start from the template point, the guest app can go to
> work quickly.
> 
> The new VM booted from template vm can’t become template again,
> if you need this unusual chained-template feature, you can write
> a cloneable-tmpfs kernel module for it.
> 

I've just tried doing something similar with this patch; it's really
interesting. I used LVM snapshotting for the RAM:

cd /dev/shm
fallocate -l 20G backingfile
losetup -f ./backingfile
pvcreate /dev/loop0
vgcreate ram /dev/loop0
lvcreate -L4G -nram1 ram /dev/loop0

qemu -M pc,accel=kvm -m 4G -object memory-backend-file,id=mem,size=4G,mem-path=/dev/ram/ram1,share=on -numa node,memdev=mem -vnc :0 -drive file=my.qcow2,id=d,cache=none -monitor stdio

boot the VM, and do a :
migrate_set_capability bypass-shared-memory on
migrate_set_speed 10G
migrate "exec:cat > migstream1"
q

then:
lvcreate -n ramsnap1 -s ram/ram1 -L4G
qemu -M pc,accel=kvm -m 4G -object memory-backend-file,id=mem,size=4G,mem-path=/dev/ram/ramsnap1,share=on -numa node,memdev=mem -vnc :0 -drive file=my.qcow2,id=d,cache=none -monitor stdio -snapshot -incoming "exec:cat migstream1"


lvcreate -n ramsnap2 -s ram/ram1 -L4G
qemu -M pc,accel=kvm -m 4G -object memory-backend-file,id=mem,size=4G,mem-path=/dev/ram/ramsnap2,share=on -numa node,memdev=mem -vnc :1 -drive file=my.qcow2,id=d,cache=none -monitor stdio -snapshot -incoming "exec:cat migstream1"

and I've got two separate instances of qemu restored from that stream.

It seems to work;  I wonder if we ever need things like msync() or
similar?

I've not tried creating a 2nd template with this.

> The libvirt toolkit can’t manage vm-template currently, in the
> hyperhq/runv, we use qemu wrapper script to do it. I hope someone add
> “libvrit managed template” feature to libvirt.
> 
> d) feature: yet-another-post-copy-migration
> It is a possible feature, no toolkit can do it well now.
> Using nbd server/client on the memory file is reluctantly Ok but
> inconvenient. A special feature for tmpfs might be needed to
> fully complete this feature.
> No one need yet another post copy migration method,
> but it is possible when some crazy man need it.
> 
> Cc: Juan Quintela <quintela@redhat.com>
> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> Cc: Eric Blake <eblake@redhat.com>
> Cc: Markus Armbruster <armbru@redhat.com>
> Cc: Samuel Ortiz <sameo@linux.intel.com>
> Cc: Sebastien Boeuf <sebastien.boeuf@intel.com>
> Cc: James O. D. Hunt <james.o.hunt@intel.com>
> Cc: Xu Wang <gnawux@gmail.com>
> Cc: Peng Tao <bergwolf@gmail.com>
> Cc: Xiao Guangrong <xiaoguangrong@tencent.com>
> Cc: Xiao Guangrong <xiaoguangrong.eric@gmail.com>
> Signed-off-by: Lai Jiangshan <jiangshanlai@gmail.com>
> ---
> 
> Changes in V5:
>  check cappability conflict in migrate_caps_check()
> 
> Changes in V4:
>  fixes checkpatch.pl errors
> 
> Changes in V3:
>  rebased on upstream master
>  update the available version of the capability to
>  v2.13
> 
> Changes in V2:
>  rebased on 2.11.1
> 
>  migration/migration.c | 22 ++++++++++++++++++++++
>  migration/migration.h |  1 +
>  migration/ram.c       | 27 ++++++++++++++++++---------
>  qapi/migration.json   |  6 +++++-
>  4 files changed, 46 insertions(+), 10 deletions(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 52a5092add..110b40f6d4 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -736,6 +736,19 @@ static bool migrate_caps_check(bool *cap_list,
>              return false;
>          }
>  
> +        if (cap_list[MIGRATION_CAPABILITY_BYPASS_SHARED_MEMORY]) {
> +            /* Bypass and postcopy are quite conflicting ways
> +             * to get memory in the destination.  And there
> +             * is not code to discriminate the differences and
> +             * handle the conflicts currently.  It should be possible
> +             * to fix, but it is generally useless when both ways
> +             * are used together.
> +             */
> +            error_setg(errp, "Bypass is not currently compatible "
> +                       "with postcopy");
> +            return false;
> +        }
> +

Good.

>          /* This check is reasonably expensive, so only when it's being
>           * set the first time, also it's only the destination that needs
>           * special support.
> @@ -1509,6 +1522,15 @@ bool migrate_release_ram(void)
>      return s->enabled_capabilities[MIGRATION_CAPABILITY_RELEASE_RAM];
>  }
>  
> +bool migrate_bypass_shared_memory(void)
> +{
> +    MigrationState *s;
> +
> +    s = migrate_get_current();
> +
> +    return s->enabled_capabilities[MIGRATION_CAPABILITY_BYPASS_SHARED_MEMORY];
> +}
> +
>  bool migrate_postcopy_ram(void)
>  {
>      MigrationState *s;
> diff --git a/migration/migration.h b/migration/migration.h
> index 8d2f320c48..cfd2513ef0 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -206,6 +206,7 @@ MigrationState *migrate_get_current(void);
>  
>  bool migrate_postcopy(void);
>  
> +bool migrate_bypass_shared_memory(void);
>  bool migrate_release_ram(void);
>  bool migrate_postcopy_ram(void);
>  bool migrate_zero_blocks(void);
> diff --git a/migration/ram.c b/migration/ram.c
> index 0e90efa092..bca170c386 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -780,6 +780,11 @@ unsigned long migration_bitmap_find_dirty(RAMState *rs, RAMBlock *rb,
>      unsigned long *bitmap = rb->bmap;
>      unsigned long next;
>  
> +    /* when this ramblock is requested bypassing */
> +    if (!bitmap) {
> +        return size;
> +    }
> +
>      if (rs->ram_bulk_stage && start > 0) {
>          next = start + 1;
>      } else {
> @@ -850,7 +855,9 @@ static void migration_bitmap_sync(RAMState *rs)
>      qemu_mutex_lock(&rs->bitmap_mutex);
>      rcu_read_lock();
>      RAMBLOCK_FOREACH(block) {
> -        migration_bitmap_sync_range(rs, block, 0, block->used_length);
> +        if (!migrate_bypass_shared_memory() || !qemu_ram_is_shared(block)) {
> +            migration_bitmap_sync_range(rs, block, 0, block->used_length);
> +        }
>      }
>      rcu_read_unlock();
>      qemu_mutex_unlock(&rs->bitmap_mutex);
> @@ -2132,18 +2139,12 @@ static int ram_state_init(RAMState **rsp)
>      qemu_mutex_init(&(*rsp)->src_page_req_mutex);
>      QSIMPLEQ_INIT(&(*rsp)->src_page_requests);
>  
> -    /*
> -     * Count the total number of pages used by ram blocks not including any
> -     * gaps due to alignment or unplugs.
> -     */
> -    (*rsp)->migration_dirty_pages = ram_bytes_total() >> TARGET_PAGE_BITS;
> -
>      ram_state_reset(*rsp);
>  
>      return 0;
>  }
>  
> -static void ram_list_init_bitmaps(void)
> +static void ram_list_init_bitmaps(RAMState *rs)
>  {
>      RAMBlock *block;
>      unsigned long pages;
> @@ -2151,9 +2152,17 @@ static void ram_list_init_bitmaps(void)
>      /* Skip setting bitmap if there is no RAM */
>      if (ram_bytes_total()) {
>          QLIST_FOREACH_RCU(block, &ram_list.blocks, next) {
> +            if (migrate_bypass_shared_memory() && qemu_ram_is_shared(block)) {
> +                continue;
> +            }
>              pages = block->max_length >> TARGET_PAGE_BITS;
>              block->bmap = bitmap_new(pages);
>              bitmap_set(block->bmap, 0, pages);
> +            /*
> +             * Count the total number of pages used by ram blocks not
> +             * including any gaps due to alignment or unplugs.
> +             */
> +            rs->migration_dirty_pages += pages;
>              if (migrate_postcopy_ram()) {
>                  block->unsentmap = bitmap_new(pages);
>                  bitmap_set(block->unsentmap, 0, pages);

Can you please rework this to combine with Cédric Le Goater's 
'discard non-migratable RAMBlocks' - it's quite similar to what you're
trying to do but for a different reason;  If you look at the v2 from
April 13, I think you can  just find somewhere to clear the
RAM_MIGRATABLE flag.

One thing I noticed; in my world I've got some code that checks if we
ever do a RAM iteration, don't find any dirty blocks but then still have 
migration_dirty_pages being none-0; and with this patch I'm seeing that
check trigger:

   ram_find_and_save_block: no page found, yet dirty_pages=480

it doesn't seem to trigger without the patch.

Dave

> @@ -2169,7 +2178,7 @@ static void ram_init_bitmaps(RAMState *rs)
>      qemu_mutex_lock_ramlist();
>      rcu_read_lock();
>  
> -    ram_list_init_bitmaps();
> +    ram_list_init_bitmaps(rs);
>      memory_global_dirty_log_start();
>      migration_bitmap_sync(rs);
>  
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 9d0bf82cf4..45326480bd 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -357,13 +357,17 @@
>  # @dirty-bitmaps: If enabled, QEMU will migrate named dirty bitmaps.
>  #                 (since 2.12)
>  #
> +# @bypass-shared-memory: the shared memory region will be bypassed on migration.
> +#          This feature allows the memory region to be reused by new qemu(s)
> +#          or be migrated separately. (since 2.13)
> +#
>  # Since: 1.2
>  ##
>  { 'enum': 'MigrationCapability',
>    'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks',
>             'compress', 'events', 'postcopy-ram', 'x-colo', 'release-ram',
>             'block', 'return-path', 'pause-before-switchover', 'x-multifd',
> -           'dirty-bitmaps' ] }
> +           'dirty-bitmaps', 'bypass-shared-memory' ] }
>  
>  ##
>  # @MigrationCapabilityStatus:
> -- 
> 2.15.1 (Apple Git-101)
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Re: [Qemu-devel] [PATCH V5] migration: add capability to bypass the shared memory

Posted by Lai Jiangshan 7 years, 9 months ago

On Fri, Apr 20, 2018 at 12:38 AM, Dr. David Alan Gilbert
<dgilbert@redhat.com> wrote:

>> -static void ram_list_init_bitmaps(void)
>> +static void ram_list_init_bitmaps(RAMState *rs)
>>  {
>>      RAMBlock *block;
>>      unsigned long pages;
>> @@ -2151,9 +2152,17 @@ static void ram_list_init_bitmaps(void)
>>      /* Skip setting bitmap if there is no RAM */
>>      if (ram_bytes_total()) {
>>          QLIST_FOREACH_RCU(block, &ram_list.blocks, next) {
>> +            if (migrate_bypass_shared_memory() && qemu_ram_is_shared(block)) {
>> +                continue;
>> +            }
>>              pages = block->max_length >> TARGET_PAGE_BITS;
>>              block->bmap = bitmap_new(pages);
>>              bitmap_set(block->bmap, 0, pages);
>> +            /*
>> +             * Count the total number of pages used by ram blocks not
>> +             * including any gaps due to alignment or unplugs.
>> +             */
>> +            rs->migration_dirty_pages += pages;
>>              if (migrate_postcopy_ram()) {
>>                  block->unsentmap = bitmap_new(pages);
>>                  bitmap_set(block->unsentmap, 0, pages);
>
> Can you please rework this to combine with Cédric Le Goater's
> 'discard non-migratable RAMBlocks' - it's quite similar to what you're
> trying to do but for a different reason;  If you look at the v2 from
> April 13, I think you can  just find somewhere to clear the
> RAM_MIGRATABLE flag.

Hello Dave:

It seems we need to add new qmp/hmp command to clear/add
RAM_MIGRATABLE flag which is overkill for such a simple feature.
Please point out if there is any simple way to do so.

And this kind of memory is not "un-MIGRATABLE", the user
just decided not to migrate it/them for one of the migrations.
But they are always MIGRATABLE regardless the user migrate
them or not. So clearing/setting the flag may
cause confusion in this case. What do you think?

Bypassing is an option for every migration. For the
same vm instance, the user might migrate it out
multiple times. He wants to bypass shared memory
in some migrations and do the normal migrations in
other times. So it is better that Bypassing is an option
or capability of migration instead of ramblock.

I don't insist on avoiding using RAM_MIGRATABLE.

Thanks,
Lai

>
> One thing I noticed; in my world I've got some code that checks if we
> ever do a RAM iteration, don't find any dirty blocks but then still have
> migration_dirty_pages being none-0; and with this patch I'm seeing that
> check trigger:
>
>    ram_find_and_save_block: no page found, yet dirty_pages=480
>
> it doesn't seem to trigger without the patch.

Does initializing the migration_dirty_pages as you suggested help?

>
> Dave
>
>> @@ -2169,7 +2178,7 @@ static void ram_init_bitmaps(RAMState *rs)
>>      qemu_mutex_lock_ramlist();
>>      rcu_read_lock();
>>
>> -    ram_list_init_bitmaps();
>> +    ram_list_init_bitmaps(rs);
>>      memory_global_dirty_log_start();
>>      migration_bitmap_sync(rs);
>>
>> diff --git a/qapi/migration.json b/qapi/migration.json
>> index 9d0bf82cf4..45326480bd 100644
>> --- a/qapi/migration.json
>> +++ b/qapi/migration.json
>> @@ -357,13 +357,17 @@
>>  # @dirty-bitmaps: If enabled, QEMU will migrate named dirty bitmaps.
>>  #                 (since 2.12)
>>  #
>> +# @bypass-shared-memory: the shared memory region will be bypassed on migration.
>> +#          This feature allows the memory region to be reused by new qemu(s)
>> +#          or be migrated separately. (since 2.13)
>> +#
>>  # Since: 1.2
>>  ##
>>  { 'enum': 'MigrationCapability',
>>    'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks',
>>             'compress', 'events', 'postcopy-ram', 'x-colo', 'release-ram',
>>             'block', 'return-path', 'pause-before-switchover', 'x-multifd',
>> -           'dirty-bitmaps' ] }
>> +           'dirty-bitmaps', 'bypass-shared-memory' ] }
>>
>>  ##
>>  # @MigrationCapabilityStatus:
>> --
>> 2.15.1 (Apple Git-101)
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Re: [Qemu-devel] [PATCH V5] migration: add capability to bypass the shared memory

Posted by Dr. David Alan Gilbert 7 years, 9 months ago

* Lai Jiangshan (jiangshanlai@gmail.com) wrote:
> On Fri, Apr 20, 2018 at 12:38 AM, Dr. David Alan Gilbert
> <dgilbert@redhat.com> wrote:
> 
> >> -static void ram_list_init_bitmaps(void)
> >> +static void ram_list_init_bitmaps(RAMState *rs)
> >>  {
> >>      RAMBlock *block;
> >>      unsigned long pages;
> >> @@ -2151,9 +2152,17 @@ static void ram_list_init_bitmaps(void)
> >>      /* Skip setting bitmap if there is no RAM */
> >>      if (ram_bytes_total()) {
> >>          QLIST_FOREACH_RCU(block, &ram_list.blocks, next) {
> >> +            if (migrate_bypass_shared_memory() && qemu_ram_is_shared(block)) {
> >> +                continue;
> >> +            }
> >>              pages = block->max_length >> TARGET_PAGE_BITS;
> >>              block->bmap = bitmap_new(pages);
> >>              bitmap_set(block->bmap, 0, pages);
> >> +            /*
> >> +             * Count the total number of pages used by ram blocks not
> >> +             * including any gaps due to alignment or unplugs.
> >> +             */
> >> +            rs->migration_dirty_pages += pages;
> >>              if (migrate_postcopy_ram()) {
> >>                  block->unsentmap = bitmap_new(pages);
> >>                  bitmap_set(block->unsentmap, 0, pages);
> >
> > Can you please rework this to combine with Cédric Le Goater's
> > 'discard non-migratable RAMBlocks' - it's quite similar to what you're
> > trying to do but for a different reason;  If you look at the v2 from
> > April 13, I think you can  just find somewhere to clear the
> > RAM_MIGRATABLE flag.
> 
> Hello Dave:
> 
> It seems we need to add new qmp/hmp command to clear/add
> RAM_MIGRATABLE flag which is overkill for such a simple feature.
> Please point out if there is any simple way to do so.

I'm fine with you still using a capability to enable/disable it - I
think that part of your patch is fine;  but then I think you just
need to check that capability somewhere in Cedric's code; perhaps in his
qemu_ram_is_migratable?

> And this kind of memory is not "un-MIGRATABLE", the user
> just decided not to migrate it/them for one of the migrations.
> But they are always MIGRATABLE regardless the user migrate
> them or not. So clearing/setting the flag may
> cause confusion in this case. What do you think?

The 'RAM_MIGRATABLE' is just an internal name for the flag; it's
not seen by the user;  it's as good a name as any.

> Bypassing is an option for every migration. For the
> same vm instance, the user might migrate it out
> multiple times. He wants to bypass shared memory
> in some migrations and do the normal migrations in
> other times. So it is better that Bypassing is an option
> or capability of migration instead of ramblock.
> 
> I don't insist on avoiding using RAM_MIGRATABLE.

and so it might be best for you not to change the flag, just
to add to qemu_ram_is_migratable.

> Thanks,
> Lai
> 
> >
> > One thing I noticed; in my world I've got some code that checks if we
> > ever do a RAM iteration, don't find any dirty blocks but then still have
> > migration_dirty_pages being none-0; and with this patch I'm seeing that
> > check trigger:
> >
> >    ram_find_and_save_block: no page found, yet dirty_pages=480
> >
> > it doesn't seem to trigger without the patch.
> 
> Does initializing the migration_dirty_pages as you suggested help?

I've not had a chance to try yet; here is the debug patch I've got:

@@ -1594,6 +1594,13 @@ static int ram_find_and_save_block(RAMState *rs, bool last_stage)
         }
     } while (!pages && again);
 
+    if (!pages && !again && pss.complete_round && rs->migration_dirty_pages)
+    {
+        /* Should make this fail migration ? */
+        fprintf(stderr, "%s: no page found, yet dirty_pages=%"PRIu64"\n",
+                __func__, rs->migration_dirty_pages);
+    }
+
     rs->last_seen_block = pss.block;
     rs->last_page = pss.page;

Dave

> >
> > Dave
> >
> >> @@ -2169,7 +2178,7 @@ static void ram_init_bitmaps(RAMState *rs)
> >>      qemu_mutex_lock_ramlist();
> >>      rcu_read_lock();
> >>
> >> -    ram_list_init_bitmaps();
> >> +    ram_list_init_bitmaps(rs);
> >>      memory_global_dirty_log_start();
> >>      migration_bitmap_sync(rs);
> >>
> >> diff --git a/qapi/migration.json b/qapi/migration.json
> >> index 9d0bf82cf4..45326480bd 100644
> >> --- a/qapi/migration.json
> >> +++ b/qapi/migration.json
> >> @@ -357,13 +357,17 @@
> >>  # @dirty-bitmaps: If enabled, QEMU will migrate named dirty bitmaps.
> >>  #                 (since 2.12)
> >>  #
> >> +# @bypass-shared-memory: the shared memory region will be bypassed on migration.
> >> +#          This feature allows the memory region to be reused by new qemu(s)
> >> +#          or be migrated separately. (since 2.13)
> >> +#
> >>  # Since: 1.2
> >>  ##
> >>  { 'enum': 'MigrationCapability',
> >>    'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks',
> >>             'compress', 'events', 'postcopy-ram', 'x-colo', 'release-ram',
> >>             'block', 'return-path', 'pause-before-switchover', 'x-multifd',
> >> -           'dirty-bitmaps' ] }
> >> +           'dirty-bitmaps', 'bypass-shared-memory' ] }
> >>
> >>  ##
> >>  # @MigrationCapabilityStatus:
> >> --
> >> 2.15.1 (Apple Git-101)
> >>
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Re: [Qemu-devel] [PATCH V5] migration: add capability to bypass the shared memory

Posted by Cédric Le Goater 7 years, 9 months ago

On 04/26/2018 09:05 PM, Dr. David Alan Gilbert wrote:
>>> Can you please rework this to combine with Cédric Le Goater's
>>> 'discard non-migratable RAMBlocks' - it's quite similar to what you're
>>> trying to do but for a different reason;  If you look at the v2 from
>>> April 13, I think you can  just find somewhere to clear the
>>> RAM_MIGRATABLE flag.
>> Hello Dave:
>>
>> It seems we need to add new qmp/hmp command to clear/add
>> RAM_MIGRATABLE flag which is overkill for such a simple feature.
>> Please point out if there is any simple way to do so.
> I'm fine with you still using a capability to enable/disable it - I
> think that part of your patch is fine;  but then I think you just
> need to check that capability somewhere in Cedric's code; perhaps in his
> qemu_ram_is_migratable?
> 

I have a v3 for this patch but it only adds an error_report(). Working
on the v2 should be fine.

Thanks,

C.

Re: [Qemu-devel] [PATCH V5] migration: add capability to bypass the shared memory

Posted by Liang Li 7 years, 7 months ago

On Mon, Apr 16, 2018 at 11:00:11PM +0800, Lai Jiangshan wrote:
> 
>  migration/migration.c | 22 ++++++++++++++++++++++
>  migration/migration.h |  1 +
>  migration/ram.c       | 27 ++++++++++++++++++---------
>  qapi/migration.json   |  6 +++++-
>  4 files changed, 46 insertions(+), 10 deletions(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 52a5092add..110b40f6d4 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -736,6 +736,19 @@ static bool migrate_caps_check(bool *cap_list,
>              return false;
>          }
>  
> +        if (cap_list[MIGRATION_CAPABILITY_BYPASS_SHARED_MEMORY]) {
> +            /* Bypass and postcopy are quite conflicting ways
> +             * to get memory in the destination.  And there
> +             * is not code to discriminate the differences and
> +             * handle the conflicts currently.  It should be possible
> +             * to fix, but it is generally useless when both ways
> +             * are used together.
> +             */
> +            error_setg(errp, "Bypass is not currently compatible "
> +                       "with postcopy");
> +            return false;
> +        }
> +
>          /* This check is reasonably expensive, so only when it's being
>           * set the first time, also it's only the destination that needs
>           * special support.
> @@ -1509,6 +1522,15 @@ bool migrate_release_ram(void)
>      return s->enabled_capabilities[MIGRATION_CAPABILITY_RELEASE_RAM];
>  }
>  
> +bool migrate_bypass_shared_memory(void)
> +{
> +    MigrationState *s;
> +
> +    s = migrate_get_current();
> +
> +    return s->enabled_capabilities[MIGRATION_CAPABILITY_BYPASS_SHARED_MEMORY];
> +}
> +
>  bool migrate_postcopy_ram(void)
>  {
>      MigrationState *s;
> diff --git a/migration/migration.h b/migration/migration.h
> index 8d2f320c48..cfd2513ef0 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -206,6 +206,7 @@ MigrationState *migrate_get_current(void);
>  
>  bool migrate_postcopy(void);
>  
> +bool migrate_bypass_shared_memory(void);
>  bool migrate_release_ram(void);
>  bool migrate_postcopy_ram(void);
>  bool migrate_zero_blocks(void);
> diff --git a/migration/ram.c b/migration/ram.c
> index 0e90efa092..bca170c386 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -780,6 +780,11 @@ unsigned long migration_bitmap_find_dirty(RAMState *rs, RAMBlock *rb,
>      unsigned long *bitmap = rb->bmap;
>      unsigned long next;
>  
> +    /* when this ramblock is requested bypassing */
> +    if (!bitmap) {
> +        return size;
> +    }
> +
>      if (rs->ram_bulk_stage && start > 0) {
>          next = start + 1;
>      } else {
> @@ -850,7 +855,9 @@ static void migration_bitmap_sync(RAMState *rs)
>      qemu_mutex_lock(&rs->bitmap_mutex);
>      rcu_read_lock();
>      RAMBLOCK_FOREACH(block) {
> -        migration_bitmap_sync_range(rs, block, 0, block->used_length);
> +        if (!migrate_bypass_shared_memory() || !qemu_ram_is_shared(block)) {
> +            migration_bitmap_sync_range(rs, block, 0, block->used_length);
> +        }
>      }
>      rcu_read_unlock();
>      qemu_mutex_unlock(&rs->bitmap_mutex);
> @@ -2132,18 +2139,12 @@ static int ram_state_init(RAMState **rsp)
>      qemu_mutex_init(&(*rsp)->src_page_req_mutex);
>      QSIMPLEQ_INIT(&(*rsp)->src_page_requests);
>  
> -    /*
> -     * Count the total number of pages used by ram blocks not including any
> -     * gaps due to alignment or unplugs.
> -     */
> -    (*rsp)->migration_dirty_pages = ram_bytes_total() >> TARGET_PAGE_BITS;
> -
>      ram_state_reset(*rsp);
>  
>      return 0;
>  }
>  
> -static void ram_list_init_bitmaps(void)
> +static void ram_list_init_bitmaps(RAMState *rs)
>  {
>      RAMBlock *block;
>      unsigned long pages;
> @@ -2151,9 +2152,17 @@ static void ram_list_init_bitmaps(void)
>      /* Skip setting bitmap if there is no RAM */
>      if (ram_bytes_total()) {
>          QLIST_FOREACH_RCU(block, &ram_list.blocks, next) {
> +            if (migrate_bypass_shared_memory() && qemu_ram_is_shared(block)) {
> +                continue;
> +            }
>              pages = block->max_length >> TARGET_PAGE_BITS;
>              block->bmap = bitmap_new(pages);
>              bitmap_set(block->bmap, 0, pages);
> +            /*
> +             * Count the total number of pages used by ram blocks not
> +             * including any gaps due to alignment or unplugs.
> +             */
> +            rs->migration_dirty_pages += pages;
Hi Jiangshan,

I think you should use 'block->used_length >> TARGET_PAGE_BITS' instead of pages
here.

As I have said before, we should skip dirty logging the related operations of
the shared memory to speed up the live migration process, and more important,
skipping dirty log can avoid splitting the EPT entry from 2M/1G to 4K if transparent
hugpage is used, and thus avoid performance degradation after migration. 

Some other things we should pay attention to is that some virtio devices may
change the vring status when the source qemu process exit, we have some find
in the previous version of QEMU, e.g. 2.6.


thanks!
Liang

>              if (migrate_postcopy_ram()) {
>                  block->unsentmap = bitmap_new(pages);
>                  bitmap_set(block->unsentmap, 0, pages);
> @@ -2169,7 +2178,7 @@ static void ram_init_bitmaps(RAMState *rs)
>      qemu_mutex_lock_ramlist();
>      rcu_read_lock();
>  
> -    ram_list_init_bitmaps();
> +    ram_list_init_bitmaps(rs);
>      memory_global_dirty_log_start();
>      migration_bitmap_sync(rs);
>  
> diff --git a/qapi/migration.json b/qapi/migration.json
> index 9d0bf82cf4..45326480bd 100644
> --- a/qapi/migration.json
> +++ b/qapi/migration.json
> @@ -357,13 +357,17 @@
>  # @dirty-bitmaps: If enabled, QEMU will migrate named dirty bitmaps.
>  #                 (since 2.12)
>  #
> +# @bypass-shared-memory: the shared memory region will be bypassed on migration.
> +#          This feature allows the memory region to be reused by new qemu(s)
> +#          or be migrated separately. (since 2.13)
> +#
>  # Since: 1.2
>  ##
>  { 'enum': 'MigrationCapability',
>    'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks',
>             'compress', 'events', 'postcopy-ram', 'x-colo', 'release-ram',
>             'block', 'return-path', 'pause-before-switchover', 'x-multifd',
> -           'dirty-bitmaps' ] }
> +           'dirty-bitmaps', 'bypass-shared-memory' ] }
>  
>  ##
>  # @MigrationCapabilityStatus:
> -- 
> 2.15.1 (Apple Git-101)
> 
>

Re: [Qemu-devel] [PATCH V4] migration: add capability to bypass the shared memory

Posted by Lai Jiangshan 7 years, 9 months ago

On Tue, Apr 10, 2018 at 1:30 AM, Dr. David Alan Gilbert
<dgilbert@redhat.com> wrote:

>>
>> +bool migrate_bypass_shared_memory(void)
>> +{
>> +    MigrationState *s;
>> +
>> +    /* it is not workable with postcopy yet. */
>> +    if (migrate_postcopy_ram()) {
>> +        return false;
>> +    }
>
> Please change this to work in the same way as the check for
> postcopy+compress in migration.c migrate_caps_check.


done in V5.

>
>> +    s = migrate_get_current();
>> +
>> +    return s->enabled_capabilities[MIGRATION_CAPABILITY_BYPASS_SHARED_MEMORY];
>> +}
>> +
>>  bool migrate_postcopy_ram(void)
>>  {
>>      MigrationState *s;
>> diff --git a/migration/migration.h b/migration/migration.h
>> index 8d2f320c48..cfd2513ef0 100644
>> --- a/migration/migration.h
>> +++ b/migration/migration.h
>> @@ -206,6 +206,7 @@ MigrationState *migrate_get_current(void);
>>
>>  bool migrate_postcopy(void);
>>
>> +bool migrate_bypass_shared_memory(void);
>>  bool migrate_release_ram(void);
>>  bool migrate_postcopy_ram(void);
>>  bool migrate_zero_blocks(void);
>> diff --git a/migration/ram.c b/migration/ram.c
>> index 0e90efa092..bca170c386 100644
>> --- a/migration/ram.c
>> +++ b/migration/ram.c
>> @@ -780,6 +780,11 @@ unsigned long migration_bitmap_find_dirty(RAMState *rs, RAMBlock *rb,
>>      unsigned long *bitmap = rb->bmap;
>>      unsigned long next;
>>
>> +    /* when this ramblock is requested bypassing */
>> +    if (!bitmap) {
>> +        return size;
>> +    }
>> +
>>      if (rs->ram_bulk_stage && start > 0) {
>>          next = start + 1;
>>      } else {
>> @@ -850,7 +855,9 @@ static void migration_bitmap_sync(RAMState *rs)
>>      qemu_mutex_lock(&rs->bitmap_mutex);
>>      rcu_read_lock();
>>      RAMBLOCK_FOREACH(block) {
>> -        migration_bitmap_sync_range(rs, block, 0, block->used_length);
>> +        if (!migrate_bypass_shared_memory() || !qemu_ram_is_shared(block)) {
>> +            migration_bitmap_sync_range(rs, block, 0, block->used_length);
>> +        }
>>      }
>>      rcu_read_unlock();
>>      qemu_mutex_unlock(&rs->bitmap_mutex);
>> @@ -2132,18 +2139,12 @@ static int ram_state_init(RAMState **rsp)
>>      qemu_mutex_init(&(*rsp)->src_page_req_mutex);
>>      QSIMPLEQ_INIT(&(*rsp)->src_page_requests);
>>
>> -    /*
>> -     * Count the total number of pages used by ram blocks not including any
>> -     * gaps due to alignment or unplugs.
>> -     */
>> -    (*rsp)->migration_dirty_pages = ram_bytes_total() >> TARGET_PAGE_BITS;
>> -
>>      ram_state_reset(*rsp);
>>
>>      return 0;
>>  }
>>
>> -static void ram_list_init_bitmaps(void)
>> +static void ram_list_init_bitmaps(RAMState *rs)
>>  {
>>      RAMBlock *block;
>>      unsigned long pages;
>> @@ -2151,9 +2152,17 @@ static void ram_list_init_bitmaps(void)
>>      /* Skip setting bitmap if there is no RAM */
>>      if (ram_bytes_total()) {
>
> I think you need to add here a :
>    rs->migration_dirty_pages = 0;

In ram_state_init(),
*rsp = g_try_new0(RAMState, 1);
so the state is always reset.

>
> I don't see anywhere else that initialises it, and there is the case of
> a migration that fails, followed by a 2nd attempt.
>
>>          QLIST_FOREACH_RCU(block, &ram_list.blocks, next) {
>> +            if (migrate_bypass_shared_memory() && qemu_ram_is_shared(block)) {
>> +                continue;
>> +            }
>>              pages = block->max_length >> TARGET_PAGE_BITS;
>>              block->bmap = bitmap_new(pages);
>>              bitmap_set(block->bmap, 0, pages);
>> +            /*
>> +             * Count the total number of pages used by ram blocks not
>> +             * including any gaps due to alignment or unplugs.
>> +             */
>> +            rs->migration_dirty_pages += pages;
>>              if (migrate_postcopy_ram()) {
>>                  block->unsentmap = bitmap_new(pages);
>>                  bitmap_set(block->unsentmap, 0, pages);
>> @@ -2169,7 +2178,7 @@ static void ram_init_bitmaps(RAMState *rs)
>>      qemu_mutex_lock_ramlist();
>>      rcu_read_lock();
>>
>> -    ram_list_init_bitmaps();
>> +    ram_list_init_bitmaps(rs);
>>      memory_global_dirty_log_start();
>>      migration_bitmap_sync(rs);
>>
>> diff --git a/qapi/migration.json b/qapi/migration.json
>> index 9d0bf82cf4..45326480bd 100644
>> --- a/qapi/migration.json
>> +++ b/qapi/migration.json
>> @@ -357,13 +357,17 @@
>>  # @dirty-bitmaps: If enabled, QEMU will migrate named dirty bitmaps.
>>  #                 (since 2.12)
>>  #
>> +# @bypass-shared-memory: the shared memory region will be bypassed on migration.
>> +#          This feature allows the memory region to be reused by new qemu(s)
>> +#          or be migrated separately. (since 2.13)
>> +#
>>  # Since: 1.2
>>  ##
>>  { 'enum': 'MigrationCapability',
>>    'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks',
>>             'compress', 'events', 'postcopy-ram', 'x-colo', 'release-ram',
>>             'block', 'return-path', 'pause-before-switchover', 'x-multifd',
>> -           'dirty-bitmaps' ] }
>> +           'dirty-bitmaps', 'bypass-shared-memory' ] }
>>
>>  ##
>>  # @MigrationCapabilityStatus:
>> --
>> 2.14.3 (Apple Git-98)
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Re: [Qemu-devel] [PATCH V4] migration: add capability to bypass the shared memory

Posted by Dr. David Alan Gilbert 7 years, 9 months ago

* Lai Jiangshan (jiangshanlai@gmail.com) wrote:
> On Tue, Apr 10, 2018 at 1:30 AM, Dr. David Alan Gilbert
> <dgilbert@redhat.com> wrote:
> 
> >>
> >> +bool migrate_bypass_shared_memory(void)
> >> +{
> >> +    MigrationState *s;
> >> +
> >> +    /* it is not workable with postcopy yet. */
> >> +    if (migrate_postcopy_ram()) {
> >> +        return false;
> >> +    }
> >
> > Please change this to work in the same way as the check for
> > postcopy+compress in migration.c migrate_caps_check.
> 
> 
> done in V5.
> 
> >
> >> +    s = migrate_get_current();
> >> +
> >> +    return s->enabled_capabilities[MIGRATION_CAPABILITY_BYPASS_SHARED_MEMORY];
> >> +}
> >> +
> >>  bool migrate_postcopy_ram(void)
> >>  {
> >>      MigrationState *s;
> >> diff --git a/migration/migration.h b/migration/migration.h
> >> index 8d2f320c48..cfd2513ef0 100644
> >> --- a/migration/migration.h
> >> +++ b/migration/migration.h
> >> @@ -206,6 +206,7 @@ MigrationState *migrate_get_current(void);
> >>
> >>  bool migrate_postcopy(void);
> >>
> >> +bool migrate_bypass_shared_memory(void);
> >>  bool migrate_release_ram(void);
> >>  bool migrate_postcopy_ram(void);
> >>  bool migrate_zero_blocks(void);
> >> diff --git a/migration/ram.c b/migration/ram.c
> >> index 0e90efa092..bca170c386 100644
> >> --- a/migration/ram.c
> >> +++ b/migration/ram.c
> >> @@ -780,6 +780,11 @@ unsigned long migration_bitmap_find_dirty(RAMState *rs, RAMBlock *rb,
> >>      unsigned long *bitmap = rb->bmap;
> >>      unsigned long next;
> >>
> >> +    /* when this ramblock is requested bypassing */
> >> +    if (!bitmap) {
> >> +        return size;
> >> +    }
> >> +
> >>      if (rs->ram_bulk_stage && start > 0) {
> >>          next = start + 1;
> >>      } else {
> >> @@ -850,7 +855,9 @@ static void migration_bitmap_sync(RAMState *rs)
> >>      qemu_mutex_lock(&rs->bitmap_mutex);
> >>      rcu_read_lock();
> >>      RAMBLOCK_FOREACH(block) {
> >> -        migration_bitmap_sync_range(rs, block, 0, block->used_length);
> >> +        if (!migrate_bypass_shared_memory() || !qemu_ram_is_shared(block)) {
> >> +            migration_bitmap_sync_range(rs, block, 0, block->used_length);
> >> +        }
> >>      }
> >>      rcu_read_unlock();
> >>      qemu_mutex_unlock(&rs->bitmap_mutex);
> >> @@ -2132,18 +2139,12 @@ static int ram_state_init(RAMState **rsp)
> >>      qemu_mutex_init(&(*rsp)->src_page_req_mutex);
> >>      QSIMPLEQ_INIT(&(*rsp)->src_page_requests);
> >>
> >> -    /*
> >> -     * Count the total number of pages used by ram blocks not including any
> >> -     * gaps due to alignment or unplugs.
> >> -     */
> >> -    (*rsp)->migration_dirty_pages = ram_bytes_total() >> TARGET_PAGE_BITS;
> >> -
> >>      ram_state_reset(*rsp);
> >>
> >>      return 0;
> >>  }
> >>
> >> -static void ram_list_init_bitmaps(void)
> >> +static void ram_list_init_bitmaps(RAMState *rs)
> >>  {
> >>      RAMBlock *block;
> >>      unsigned long pages;
> >> @@ -2151,9 +2152,17 @@ static void ram_list_init_bitmaps(void)
> >>      /* Skip setting bitmap if there is no RAM */
> >>      if (ram_bytes_total()) {
> >
> > I think you need to add here a :
> >    rs->migration_dirty_pages = 0;
> 
> In ram_state_init(),
> *rsp = g_try_new0(RAMState, 1);
> so the state is always reset.

Ah, you're right.

Dave

> >
> > I don't see anywhere else that initialises it, and there is the case of
> > a migration that fails, followed by a 2nd attempt.
> >
> >>          QLIST_FOREACH_RCU(block, &ram_list.blocks, next) {
> >> +            if (migrate_bypass_shared_memory() && qemu_ram_is_shared(block)) {
> >> +                continue;
> >> +            }
> >>              pages = block->max_length >> TARGET_PAGE_BITS;
> >>              block->bmap = bitmap_new(pages);
> >>              bitmap_set(block->bmap, 0, pages);
> >> +            /*
> >> +             * Count the total number of pages used by ram blocks not
> >> +             * including any gaps due to alignment or unplugs.
> >> +             */
> >> +            rs->migration_dirty_pages += pages;
> >>              if (migrate_postcopy_ram()) {
> >>                  block->unsentmap = bitmap_new(pages);
> >>                  bitmap_set(block->unsentmap, 0, pages);
> >> @@ -2169,7 +2178,7 @@ static void ram_init_bitmaps(RAMState *rs)
> >>      qemu_mutex_lock_ramlist();
> >>      rcu_read_lock();
> >>
> >> -    ram_list_init_bitmaps();
> >> +    ram_list_init_bitmaps(rs);
> >>      memory_global_dirty_log_start();
> >>      migration_bitmap_sync(rs);
> >>
> >> diff --git a/qapi/migration.json b/qapi/migration.json
> >> index 9d0bf82cf4..45326480bd 100644
> >> --- a/qapi/migration.json
> >> +++ b/qapi/migration.json
> >> @@ -357,13 +357,17 @@
> >>  # @dirty-bitmaps: If enabled, QEMU will migrate named dirty bitmaps.
> >>  #                 (since 2.12)
> >>  #
> >> +# @bypass-shared-memory: the shared memory region will be bypassed on migration.
> >> +#          This feature allows the memory region to be reused by new qemu(s)
> >> +#          or be migrated separately. (since 2.13)
> >> +#
> >>  # Since: 1.2
> >>  ##
> >>  { 'enum': 'MigrationCapability',
> >>    'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks',
> >>             'compress', 'events', 'postcopy-ram', 'x-colo', 'release-ram',
> >>             'block', 'return-path', 'pause-before-switchover', 'x-multifd',
> >> -           'dirty-bitmaps' ] }
> >> +           'dirty-bitmaps', 'bypass-shared-memory' ] }
> >>
> >>  ##
> >>  # @MigrationCapabilityStatus:
> >> --
> >> 2.14.3 (Apple Git-98)
> >>
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK