[v2] migration: bring improved savevm/loadvm/delvm to QMP

[PATCH v2 (BROKEN) 0/6] migration: bring improved savevm/loadvm/delvm to QMP

Posted by Daniel P. Berrangé 3 years, 9 months ago

A followup to:

 v1: https://lists.gnu.org/archive/html/qemu-devel/2020-07/msg00866.html

When QMP was first introduced some 10+ years ago now, the snapshot
related commands (savevm/loadvm/delvm) were not converted. This was
primarily because their implementation causes blocking of the thread
running the monitor commands. This was (and still is) considered
undesirable behaviour both in HMP and QMP.

In theory someone was supposed to fix this flaw at some point in the
past 10 years and bring them into the QMP world. Sadly, thus far it
hasn't happened as people always had more important things to work
on. Enterprise apps were much more interested in external snapshots
than internal snapshots as they have many more features.

Meanwhile users still want to use internal snapshots as there is
a certainly simplicity in having everything self-contained in one
image, even though it has limitations. Thus the apps that end up
executing the savevm/loadvm/delvm via the "human-monitor-command"
QMP command.

IOW, the problematic blocking behaviour that was one of the reasons
for not having savevm/loadvm/delvm in QMP is experienced by applications
regardless. By not portting the commands to QMP due to one design flaw,
we've forced apps and users to suffer from other design flaws of HMP (
bad error reporting, strong type checking of args, no introspection) for
an additional 10 years. This feels rather sub-optimal :-(

In practice users don't appear to care strongly about the fact that these
commands block the VM while they run. I might have seen one bug report
about it, but it certainly isn't something that comes up as a frequent
topic except among us QEMU maintainers. Users do care about having
access to the snapshot feature.

Where I am seeing frequent complaints is wrt the use of OVMF combined
with snapshots which has some serious pain points. This is getting worse
as the push to ditch legacy BIOS in favour of UEFI gain momentum both
across OS vendors and mgmt apps. Solving it requires new parameters to
the commands, but doing this in HMP is super unappealing.

After 10 years, I think it is time for us to be a little pragmatic about
our handling of snapshots commands. My desire is that libvirt should never
use "human-monitor-command" under any circumstances, because of the
inherant flaws in HMP as a protocol for machine consumption.

Thus in this series I'm proposing a fairly direct mapping of the existing
HMP commands for savevm/loadvm/delvm into QMP as a first step. This does
not solve the blocking thread problem, but it does put in a place a
design using the jobs framework which can facilitate solving it later.
It does also solve the error reporting, type checking and introspection
problems inherant to HMP. So we're winning on 3 out of the 4 problems,
and pushed apps to a QMP design that will let us solve the last
remaining problem.

With a QMP variant, we reasonably deal with the problems related to OVMF:

 - The logic to pick which disk to store the vmstate in is not
   satsifactory.

   The first block driver state cannot be assumed to be the root disk
   image, it might be OVMF varstore and we don't want to store vmstate
   in there.

 - The logic to decide which disks must be snapshotted is hardwired
   to all disks which are writable

   Again with OVMF there might be a writable varstore, but this can be
   raw rather than qcow2 format, and thus unable to be snapshotted.
   While users might wish to snapshot their varstore, in some/many/most
   cases it is entirely uneccessary. Users are blocked from snapshotting
   their VM though due to this varstore.

These are solved by adding two parameters to the commands. The first is
a block device node name that identifies the image to store vmstate in,
and the second is a list of node names to include for the snapshots.
If the list of nodes isn't given, it falls back to the historical
behaviour of using all disks matching some undocumented criteria.

In the block code I've only dealt with node names for block devices, as
IIUC, this is all that libvirt should need in the -blockdev world it now
lives in. IOW, I've made not attempt to cope with people wanting to use
these QMP commands in combination with -drive args.

I've done some minimal work in libvirt to start to make use of the new
commands to validate their functionality, but this isn't finished yet.

My ultimate goal is to make the GNOME Boxes maintainer happy again by
having internal snapshots work with OVMF:

  https://gitlab.gnome.org/GNOME/gnome-boxes/-/commit/c486da262f6566326fbcb5e=
f45c5f64048f16a6e

HELP NEEDED:  this series starts to implement the approach that Kevin
suggested wrto use of generic jobs.

When I try to actually run the code though it crashes:

ERROR:/home/berrange/src/virt/qemu/softmmu/cpus.c:1788:qemu_mutex_unlock_ioth=
read: assertion failed: (qemu_mutex_iothread_locked())
Bail out! ERROR:/home/berrange/src/virt/qemu/softmmu/cpus.c:1788:qemu_mutex_u=
nlock_iothread: assertion failed: (qemu_mutex_iothread_locked())

Obviously I've missed something related to locking, but I've no idea
what, so I'm sending this v2 simply as a way to solicit suggestions
for what I've messed up.

You can reproduce with I/O tests using "check -qcow2 310"  and it
gave a stack:

  Thread 5 (Thread 0x7fffe6e4c700 (LWP 3399011)):
  #0  futex_wait_cancelable (private=0, expected=0, futex_word=0x5555566a9fd8) at ../sysdeps/nptl/futex-internal.h:183
  #1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x555556227160 <qemu_global_mutex>, cond=0x5555566a9fb0) at pthread_cond_wait.c:508
  #2  __pthread_cond_wait (cond=cond@entry=0x5555566a9fb0, mutex=mutex@entry=0x555556227160 <qemu_global_mutex>) at pthread_cond_wait.c:638
  #3  0x0000555555ceb6cb in qemu_cond_wait_impl (cond=0x5555566a9fb0, mutex=0x555556227160 <qemu_global_mutex>, file=0x555555d44198 "/home/berrange/src/virt/qemu/softmmu/cpus.c", line=1145) at /home/berrange/src/virt/qemu/util/qemu-thread-posix.c:174
  #4  0x0000555555931974 in qemu_wait_io_event (cpu=cpu@entry=0x555556685050) at /home/berrange/src/virt/qemu/softmmu/cpus.c:1145
  #5  0x0000555555933a89 in qemu_dummy_cpu_thread_fn (arg=arg@entry=0x555556685050) at /home/berrange/src/virt/qemu/softmmu/cpus.c:1241
  #6  0x0000555555ceb049 in qemu_thread_start (args=0x7fffe6e476f0) at /home/berrange/src/virt/qemu/util/qemu-thread-posix.c:521
  #7  0x00007ffff4fdc432 in start_thread (arg=<optimized out>) at pthread_create.c:477
  #8  0x00007ffff4f0a9d3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
  
  Thread 4 (Thread 0x7fffe764d700 (LWP 3399010)):
  #0  0x00007ffff4effb6f in __GI___poll (fds=0x7fffdc006ec0, nfds=3, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
  #1  0x00007ffff7c1aace in g_main_context_iterate.constprop () at /lib64/libglib-2.0.so.0
  #2  0x00007ffff7c1ae53 in g_main_loop_run () at /lib64/libglib-2.0.so.0
  #3  0x00005555559a9d81 in iothread_run (opaque=opaque@entry=0x55555632f200) at /home/berrange/src/virt/qemu/iothread.c:82
  #4  0x0000555555ceb049 in qemu_thread_start (args=0x7fffe76486f0) at /home/berrange/src/virt/qemu/util/qemu-thread-posix.c:521
  #5  0x00007ffff4fdc432 in start_thread (arg=<optimized out>) at pthread_create.c:477
  #6  0x00007ffff4f0a9d3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
  
  Thread 3 (Thread 0x7fffe7e4e700 (LWP 3399009)):
  #0  0x00007ffff4fe5c58 in futex_abstimed_wait_cancelable (private=0, abstime=0x7fffe7e49650, clockid=0, expected=0, futex_word=0x5555562bf888) at ../sysdeps/nptl/futex-internal.h:320
  #1  do_futex_wait (sem=sem@entry=0x5555562bf888, abstime=abstime@entry=0x7fffe7e49650, clockid=0) at sem_waitcommon.c:112
  #2  0x00007ffff4fe5d83 in __new_sem_wait_slow (sem=sem@entry=0x5555562bf888, abstime=abstime@entry=0x7fffe7e49650, clockid=0) at sem_waitcommon.c:184
  #3  0x00007ffff4fe5e12 in sem_timedwait (sem=sem@entry=0x5555562bf888, abstime=abstime@entry=0x7fffe7e49650) at sem_timedwait.c:40
  #4  0x0000555555cebbdf in qemu_sem_timedwait (sem=sem@entry=0x5555562bf888, ms=ms@entry=10000) at /home/berrange/src/virt/qemu/util/qemu-thread-posix.c:307
  #5  0x0000555555d03fa4 in worker_thread (opaque=opaque@entry=0x5555562bf810) at /home/berrange/src/virt/qemu/util/thread-pool.c:91
  #6  0x0000555555ceb049 in qemu_thread_start (args=0x7fffe7e496f0) at /home/berrange/src/virt/qemu/util/qemu-thread-posix.c:521
  #7  0x00007ffff4fdc432 in start_thread (arg=<optimized out>) at pthread_create.c:477
  #8  0x00007ffff4f0a9d3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
  
  Thread 2 (Thread 0x7fffe8750700 (LWP 3399008)):
  #0  0x00007ffff4ed1871 in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=0x7fffe874b670, rem=0x7fffe874b680) at ../sysdeps/unix/sysv/linux/--Type <RET> for more, q to quit, c to continue without paging--
  clock_nanosleep.c:48
  #1  0x00007ffff4ed71c7 in __GI___nanosleep (requested_time=<optimized out>, remaining=<optimized out>) at nanosleep.c:27
  #2  0x00007ffff7c462f7 in g_usleep () at /lib64/libglib-2.0.so.0
  #3  0x0000555555cf3fd0 in call_rcu_thread (opaque=opaque@entry=0x0) at /home/berrange/src/virt/qemu/util/rcu.c:250
  #4  0x0000555555ceb049 in qemu_thread_start (args=0x7fffe874b6f0) at /home/berrange/src/virt/qemu/util/qemu-thread-posix.c:521
  #5  0x00007ffff4fdc432 in start_thread (arg=<optimized out>) at pthread_create.c:477
  #6  0x00007ffff4f0a9d3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
  
  Thread 1 (Thread 0x7fffe88abec0 (LWP 3398996)):
  #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
  #1  0x00007ffff4e2e895 in __GI_abort () at abort.c:79
  #2  0x00007ffff7be5b8c in g_assertion_message_expr.cold () at /lib64/libglib-2.0.so.0
  #3  0x00007ffff7c43a1f in g_assertion_message_expr () at /lib64/libglib-2.0.so.0
  #4  0x0000555555932da0 in qemu_mutex_unlock_iothread () at /home/berrange/src/virt/qemu/softmmu/cpus.c:1788
  #5  qemu_mutex_unlock_iothread () at /home/berrange/src/virt/qemu/softmmu/cpus.c:1786
  #6  0x0000555555cfeceb in os_host_main_loop_wait (timeout=26359275747000) at /home/berrange/src/virt/qemu/util/main-loop.c:232
  #7  main_loop_wait (nonblocking=nonblocking@entry=0) at /home/berrange/src/virt/qemu/util/main-loop.c:516
  #8  0x0000555555941f27 in qemu_main_loop () at /home/berrange/src/virt/qemu/softmmu/vl.c:1676
  #9  0x000055555581a18e in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /home/berrange/src/virt/qemu/softmmu/main.c:49
  (gdb) 


Changed in v2:

 - Use new command names "snapshot-{load,save,delete}" to make it
   clear that these are different from the "savevm|loadvm|delvm"
   as they use the Job framework

 - Use an include list for block devs, not an exclude list

Daniel P. Berrang=C3=A9 (6):
  migration: improve error reporting of block driver state name
  block: push error reporting into bdrv_all_*_snapshot functions
  migration: stop returning errno from load_snapshot()
  block: add ability to specify list of blockdevs during snapshot
  block: allow specifying name of block device for vmstate storage
  migration: introduce snapshot-{save,load,delete} QMP commands

 block/monitor/block-hmp-cmds.c |   7 +-
 block/snapshot.c               | 167 +++++++++++++++++---------
 include/block/snapshot.h       |  19 +--
 include/migration/snapshot.h   |  10 +-
 migration/savevm.c             | 210 +++++++++++++++++++++++++++------
 monitor/hmp-cmds.c             |  11 +-
 qapi/job.json                  |   9 +-
 qapi/migration.json            | 112 ++++++++++++++++++
 replay/replay-snapshot.c       |   4 +-
 softmmu/vl.c                   |   2 +-
 tests/qemu-iotests/267.out     |  14 +--
 tests/qemu-iotests/310         | 125 ++++++++++++++++++++
 tests/qemu-iotests/310.out     |   0
 tests/qemu-iotests/group       |   1 +
 14 files changed, 562 insertions(+), 129 deletions(-)
 create mode 100755 tests/qemu-iotests/310
 create mode 100644 tests/qemu-iotests/310.out

--=20
2.26.2

Re: [PATCH v2 (BROKEN) 0/6] migration: bring improved savevm/loadvm/delvm to QMP

Posted by Daniel P. Berrangé 3 years, 8 months ago

On Mon, Jul 27, 2020 at 04:08:37PM +0100, Daniel P. Berrangé wrote:
> A followup to:
> 
>  v1: https://lists.gnu.org/archive/html/qemu-devel/2020-07/msg00866.html

snip

> HELP NEEDED:  this series starts to implement the approach that Kevin
> suggested wrto use of generic jobs.
> 
> When I try to actually run the code though it crashes:
> 
> ERROR:/home/berrange/src/virt/qemu/softmmu/cpus.c:1788:qemu_mutex_unlock_ioth=
> read: assertion failed: (qemu_mutex_iothread_locked())
> Bail out! ERROR:/home/berrange/src/virt/qemu/softmmu/cpus.c:1788:qemu_mutex_u=
> nlock_iothread: assertion failed: (qemu_mutex_iothread_locked())
> 
> Obviously I've missed something related to locking, but I've no idea
> what, so I'm sending this v2 simply as a way to solicit suggestions
> for what I've messed up.

What I've found is

qmp_snapshot_save() is the QMP handler and runs in the main thread, so iothread
lock is held.


This calls job_create() which ends up invoking  snapshot_save_job_run
in a background coroutine, but IIUC  iothread lock is still held when
the coroutine starts.

This then invokes save_snapshot() which invokes qemu_savevm_state


This calls   qemu_mutex_unlock_iothread() and then 
qemu_savevm_state_setup().

Eventually something in the qcow2 code triggers qemu_coroutine_yield()
so control goes back to the main event loop thread.


The problem is that the iothread lock has been released, but the main
event loop thread is still expecting it to be held.

I've no idea how to go about solving this problem.


The save_snapshot() code, as written today, needs to run serialized with
everything else, but because the job framework has used a coroutine to
run it, we can switch back to the main event thread at any time.

I don't know how to force save_snapshot() to be serialized when using
the generic job framework.


> 
> You can reproduce with I/O tests using "check -qcow2 310"  and it
> gave a stack:
> 
>   Thread 5 (Thread 0x7fffe6e4c700 (LWP 3399011)):
>   #0  futex_wait_cancelable (private=0, expected=0, futex_word=0x5555566a9fd8) at ../sysdeps/nptl/futex-internal.h:183
>   #1  __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x555556227160 <qemu_global_mutex>, cond=0x5555566a9fb0) at pthread_cond_wait.c:508
>   #2  __pthread_cond_wait (cond=cond@entry=0x5555566a9fb0, mutex=mutex@entry=0x555556227160 <qemu_global_mutex>) at pthread_cond_wait.c:638
>   #3  0x0000555555ceb6cb in qemu_cond_wait_impl (cond=0x5555566a9fb0, mutex=0x555556227160 <qemu_global_mutex>, file=0x555555d44198 "/home/berrange/src/virt/qemu/softmmu/cpus.c", line=1145) at /home/berrange/src/virt/qemu/util/qemu-thread-posix.c:174
>   #4  0x0000555555931974 in qemu_wait_io_event (cpu=cpu@entry=0x555556685050) at /home/berrange/src/virt/qemu/softmmu/cpus.c:1145
>   #5  0x0000555555933a89 in qemu_dummy_cpu_thread_fn (arg=arg@entry=0x555556685050) at /home/berrange/src/virt/qemu/softmmu/cpus.c:1241
>   #6  0x0000555555ceb049 in qemu_thread_start (args=0x7fffe6e476f0) at /home/berrange/src/virt/qemu/util/qemu-thread-posix.c:521
>   #7  0x00007ffff4fdc432 in start_thread (arg=<optimized out>) at pthread_create.c:477
>   #8  0x00007ffff4f0a9d3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
>   
>   Thread 4 (Thread 0x7fffe764d700 (LWP 3399010)):
>   #0  0x00007ffff4effb6f in __GI___poll (fds=0x7fffdc006ec0, nfds=3, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
>   #1  0x00007ffff7c1aace in g_main_context_iterate.constprop () at /lib64/libglib-2.0.so.0
>   #2  0x00007ffff7c1ae53 in g_main_loop_run () at /lib64/libglib-2.0.so.0
>   #3  0x00005555559a9d81 in iothread_run (opaque=opaque@entry=0x55555632f200) at /home/berrange/src/virt/qemu/iothread.c:82
>   #4  0x0000555555ceb049 in qemu_thread_start (args=0x7fffe76486f0) at /home/berrange/src/virt/qemu/util/qemu-thread-posix.c:521
>   #5  0x00007ffff4fdc432 in start_thread (arg=<optimized out>) at pthread_create.c:477
>   #6  0x00007ffff4f0a9d3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
>   
>   Thread 3 (Thread 0x7fffe7e4e700 (LWP 3399009)):
>   #0  0x00007ffff4fe5c58 in futex_abstimed_wait_cancelable (private=0, abstime=0x7fffe7e49650, clockid=0, expected=0, futex_word=0x5555562bf888) at ../sysdeps/nptl/futex-internal.h:320
>   #1  do_futex_wait (sem=sem@entry=0x5555562bf888, abstime=abstime@entry=0x7fffe7e49650, clockid=0) at sem_waitcommon.c:112
>   #2  0x00007ffff4fe5d83 in __new_sem_wait_slow (sem=sem@entry=0x5555562bf888, abstime=abstime@entry=0x7fffe7e49650, clockid=0) at sem_waitcommon.c:184
>   #3  0x00007ffff4fe5e12 in sem_timedwait (sem=sem@entry=0x5555562bf888, abstime=abstime@entry=0x7fffe7e49650) at sem_timedwait.c:40
>   #4  0x0000555555cebbdf in qemu_sem_timedwait (sem=sem@entry=0x5555562bf888, ms=ms@entry=10000) at /home/berrange/src/virt/qemu/util/qemu-thread-posix.c:307
>   #5  0x0000555555d03fa4 in worker_thread (opaque=opaque@entry=0x5555562bf810) at /home/berrange/src/virt/qemu/util/thread-pool.c:91
>   #6  0x0000555555ceb049 in qemu_thread_start (args=0x7fffe7e496f0) at /home/berrange/src/virt/qemu/util/qemu-thread-posix.c:521
>   #7  0x00007ffff4fdc432 in start_thread (arg=<optimized out>) at pthread_create.c:477
>   #8  0x00007ffff4f0a9d3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
>   
>   Thread 2 (Thread 0x7fffe8750700 (LWP 3399008)):
>   #0  0x00007ffff4ed1871 in __GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0, req=0x7fffe874b670, rem=0x7fffe874b680) at ../sysdeps/unix/sysv/linux/--Type <RET> for more, q to quit, c to continue without paging--
>   clock_nanosleep.c:48
>   #1  0x00007ffff4ed71c7 in __GI___nanosleep (requested_time=<optimized out>, remaining=<optimized out>) at nanosleep.c:27
>   #2  0x00007ffff7c462f7 in g_usleep () at /lib64/libglib-2.0.so.0
>   #3  0x0000555555cf3fd0 in call_rcu_thread (opaque=opaque@entry=0x0) at /home/berrange/src/virt/qemu/util/rcu.c:250
>   #4  0x0000555555ceb049 in qemu_thread_start (args=0x7fffe874b6f0) at /home/berrange/src/virt/qemu/util/qemu-thread-posix.c:521
>   #5  0x00007ffff4fdc432 in start_thread (arg=<optimized out>) at pthread_create.c:477
>   #6  0x00007ffff4f0a9d3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
>   
>   Thread 1 (Thread 0x7fffe88abec0 (LWP 3398996)):
>   #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
>   #1  0x00007ffff4e2e895 in __GI_abort () at abort.c:79
>   #2  0x00007ffff7be5b8c in g_assertion_message_expr.cold () at /lib64/libglib-2.0.so.0
>   #3  0x00007ffff7c43a1f in g_assertion_message_expr () at /lib64/libglib-2.0.so.0
>   #4  0x0000555555932da0 in qemu_mutex_unlock_iothread () at /home/berrange/src/virt/qemu/softmmu/cpus.c:1788
>   #5  qemu_mutex_unlock_iothread () at /home/berrange/src/virt/qemu/softmmu/cpus.c:1786
>   #6  0x0000555555cfeceb in os_host_main_loop_wait (timeout=26359275747000) at /home/berrange/src/virt/qemu/util/main-loop.c:232
>   #7  main_loop_wait (nonblocking=nonblocking@entry=0) at /home/berrange/src/virt/qemu/util/main-loop.c:516
>   #8  0x0000555555941f27 in qemu_main_loop () at /home/berrange/src/virt/qemu/softmmu/vl.c:1676
>   #9  0x000055555581a18e in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at /home/berrange/src/virt/qemu/softmmu/main.c:49
>   (gdb)

The coroutine trace to go with this is

(gdb) qemu coroutine 0x5555575b86b0
#0  qemu_coroutine_switch (from_=<optimized out>, to_=<optimized out>, action=action@entry=COROUTINE_YIELD)
    at /home/berrange/src/virt/qemu/util/coroutine-ucontext.c:303
#1  0x0000555555cfbccd in qemu_coroutine_yield () at /home/berrange/src/virt/qemu/util/qemu-coroutine.c:193
#2  0x0000555555d00d49 in thread_pool_submit_co (pool=0x5555562b8810, func=func@entry=0x555555c3a2d0 <handle_aiocb_rw>, arg=arg@entry=0x7fffe61ff200)
    at /home/berrange/src/virt/qemu/util/thread-pool.c:288
#3  0x0000555555c3c4db in raw_thread_pool_submit (arg=0x7fffe61ff200, func=0x555555c3a2d0 <handle_aiocb_rw>, bs=0x55555653c2d0)
    at /home/berrange/src/virt/qemu/block/file-posix.c:1975
#4  raw_co_prw (bs=0x55555653c2d0, offset=131072, bytes=65536, qiov=0x7fffe61ff560, type=1) at /home/berrange/src/virt/qemu/block/file-posix.c:2022
#5  0x0000555555c43131 in bdrv_driver_preadv (bs=0x55555653c2d0, offset=131072, bytes=65536, qiov=0x7fffe61ff560, qiov_offset=<optimized out>, flags=0)
    at /home/berrange/src/virt/qemu/block/io.c:1139
#6  0x0000555555c47484 in bdrv_aligned_preadv
    (req=req@entry=0x7fffe61ff440, offset=131072, bytes=65536, align=<optimized out>, qiov=0x7fffe61ff560, qiov_offset=0, flags=0, child=0x5555562b2a80, child=<optimized out>) at /home/berrange/src/virt/qemu/block/io.c:1515
#7  0x0000555555c48955 in bdrv_co_preadv_part
    (child=0x5555562b2a80, offset=<optimized out>, bytes=<optimized out>, qiov=<optimized out>, qiov_offset=<optimized out>, flags=0)
    at /home/berrange/src/virt/qemu/block/io.c:1757
#8  0x0000555555c48ef4 in bdrv_run_co (opaque=0x7fffe61ff540, entry=0x555555c48a60 <bdrv_rw_co_entry>, bs=0x55555653c2d0)
    at /home/berrange/src/virt/qemu/block/io.c:915
#9  bdrv_prwv_co (flags=0, is_write=false, qiov=0x7fffe61ff560, offset=131072, child=<optimized out>) at /home/berrange/src/virt/qemu/block/io.c:966
#10 bdrv_preadv (qiov=0x7fffe61ff560, offset=131072, child=<optimized out>) at /home/berrange/src/virt/qemu/block/io.c:1024
#11 bdrv_pread (child=<optimized out>, offset=offset@entry=131072, buf=<optimized out>, bytes=<optimized out>) at /home/berrange/src/virt/qemu/block/io.c:1041
#12 0x0000555555c1fa2a in qcow2_cache_do_get (bs=0x555556542e00, c=0x55555654b130, offset=131072, table=0x7fffe61ff630, read_from_disk=<optimized out>)
    at /home/berrange/src/virt/qemu/block/qcow2-cache.c:49
#13 0x0000555555c14206 in qcow2_get_refcount (bs=bs@entry=0x555556542e00, cluster_index=0, refcount=refcount@entry=0x7fffe61ff670)
    at /home/berrange/src/virt/qemu/block/qcow2-refcount.c:271
#14 0x0000555555c1450d in alloc_clusters_noref (bs=bs@entry=0x555556542e00, size=140737054242416, size@entry=24, max=max@entry=72057594037927935)
    at /home/berrange/src/virt/qemu/block/qcow2-refcount.c:981
#15 0x0000555555c15534 in qcow2_alloc_clusters (bs=bs@entry=0x555556542e00, size=size@entry=24) at /home/berrange/src/virt/qemu/block/qcow2-refcount.c:1013
#16 0x0000555555c1a1f3 in qcow2_grow_l1_table (bs=0x555556542e00, min_size=min_size@entry=3, exact_size=exact_size@entry=false)
    at /home/berrange/src/virt/qemu/block/qcow2-cluster.c:139
#17 0x0000555555c1a983 in get_cluster_table
    (bs=bs@entry=0x555556542e00, offset=offset@entry=1073741824, new_l2_slice=new_l2_slice@entry=0x7fffe61ff8b0, new_l2_index=new_l2_index@entry=0x7fffe61ff8a8)
    at /home/berrange/src/virt/qemu/block/qcow2-cluster.c:688
#18 0x0000555555c1bee9 in handle_copied
    (m=0x7fffe61ff968, bytes=<synthetic pointer>, host_offset=<synthetic pointer>, guest_offset=1073741824, bs=0x555556542e00)
    at /home/berrange/src/virt/qemu/block/qcow2-cluster.c:1185
#19 qcow2_alloc_cluster_offset
    (bs=bs@entry=0x555556542e00, offset=offset@entry=1073741824, bytes=bytes@entry=0x7fffe61ff95c, host_offset=host_offset@entry=0x7fffe61ff960, m=m@entry=0x7fffe61ff968) at /home/berrange/src/virt/qemu/block/qcow2-cluster.c:1570
--Type <RET> for more, q to quit, c to continue without paging--
#20 0x0000555555c0adbc in qcow2_co_pwritev_part (bs=0x555556542e00, offset=1073741824, bytes=281, qiov=0x7fffe61ffa30, qiov_offset=0, flags=<optimized out>)
    at /home/berrange/src/virt/qemu/block/qcow2.c:2577
#21 0x0000555555c439f1 in bdrv_co_rw_vmstate (bs=0x555556542e00, qiov=<optimized out>, pos=<optimized out>, is_read=<optimized out>)
    at /home/berrange/src/virt/qemu/block/io.c:2660
#22 0x0000555555c45ecc in bdrv_co_rw_vmstate_entry (opaque=0x7fffe61ff9f0) at /home/berrange/src/virt/qemu/block/io.c:2674
#23 bdrv_run_co (opaque=0x7fffe61ff9f0, entry=0x555555c43a50 <bdrv_co_rw_vmstate_entry>, bs=0x555556542e00) at /home/berrange/src/virt/qemu/block/io.c:915
#24 bdrv_rw_vmstate (is_read=false, pos=0, qiov=0x7fffe61ff9e0, bs=0x555556542e00) at /home/berrange/src/virt/qemu/block/io.c:2688
#25 bdrv_writev_vmstate (bs=bs@entry=0x555556542e00, qiov=qiov@entry=0x7fffe61ffa30, pos=pos@entry=0) at /home/berrange/src/virt/qemu/block/io.c:2707
#26 0x0000555555b43ac8 in block_writev_buffer (opaque=0x555556542e00, iov=<optimized out>, iovcnt=<optimized out>, pos=0, errp=<optimized out>)
    at /home/berrange/src/virt/qemu/migration/savevm.c:139
#27 0x0000555555b4e698 in qemu_fflush (f=f@entry=0x5555566ae980) at /home/berrange/src/virt/qemu/migration/qemu-file.c:232
#28 0x000055555586bdeb in ram_save_setup (f=0x5555566ae980, opaque=<optimized out>) at /home/berrange/src/virt/qemu/migration/ram.c:2513
#29 0x0000555555b45bd1 in qemu_savevm_state_setup (f=f@entry=0x5555566ae980) at /home/berrange/src/virt/qemu/migration/savevm.c:1178
#30 0x0000555555b48713 in qemu_savevm_state (errp=0x5555575ba280, f=0x5555566ae980) at /home/berrange/src/virt/qemu/migration/savevm.c:1535
#31 save_snapshot (name=<optimized out>, vmstate=<optimized out>, devices=0x555556e50ba0, errp=errp@entry=0x5555575ba280)
    at /home/berrange/src/virt/qemu/migration/savevm.c:2744
#32 0x0000555555b48a8b in snapshot_save_job_run (job=0x5555575ba200, errp=0x5555575ba280) at /home/berrange/src/virt/qemu/migration/savevm.c:3016
#33 0x0000555555bef632 in job_co_entry (opaque=0x5555575ba200) at /home/berrange/src/virt/qemu/job.c:908
#34 0x0000555555cf74a3 in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at /home/berrange/src/virt/qemu/util/coroutine-ucontext.c:173
#35 0x00007ffff4e5a250 in __start_context () at ../sysdeps/unix/sysv/linux/x86_64/__start_context.S:91
#36 0x00007fffffffc6a0 in  ()
#37 0x0000000000000000 in  ()


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH v2 (BROKEN) 0/6] migration: bring improved savevm/loadvm/delvm to QMP

Posted by Markus Armbruster 3 years, 8 months ago

Sorry for taking so long to reply.

Daniel P. Berrangé <berrange@redhat.com> writes:

> A followup to:
>
>  v1: https://lists.gnu.org/archive/html/qemu-devel/2020-07/msg00866.html
>
> When QMP was first introduced some 10+ years ago now, the snapshot
> related commands (savevm/loadvm/delvm) were not converted. This was
> primarily because their implementation causes blocking of the thread
> running the monitor commands. This was (and still is) considered
> undesirable behaviour both in HMP and QMP.

One of several reasons.

> In theory someone was supposed to fix this flaw at some point in the
> past 10 years and bring them into the QMP world. Sadly, thus far it
> hasn't happened as people always had more important things to work
> on. Enterprise apps were much more interested in external snapshots
> than internal snapshots as they have many more features.

Several attempts have been made to bring the functionality to QMP.
Sadly, they went nowhere.

I posted an analysis of the issues in reply to one of the more serious
attempts:

    Message-ID: <87lh7l783q.fsf@blackfin.pond.sub.org>
    https://lists.nongnu.org/archive/html/qemu-devel/2016-01/msg03593.html

I'd like to hear your take on it.  I append the relevant part for your
convenience.  Perhaps your code is already close to what I describe
there.  I'm interested in where it falls short.

> Meanwhile users still want to use internal snapshots as there is
> a certainly simplicity in having everything self-contained in one
> image, even though it has limitations. Thus the apps that end up
> executing the savevm/loadvm/delvm via the "human-monitor-command"
> QMP command.
>
> IOW, the problematic blocking behaviour that was one of the reasons
> for not having savevm/loadvm/delvm in QMP is experienced by applications
> regardless. By not portting the commands to QMP due to one design flaw,
> we've forced apps and users to suffer from other design flaws of HMP (
> bad error reporting, strong type checking of args, no introspection) for
> an additional 10 years. This feels rather sub-optimal :-(
>
> In practice users don't appear to care strongly about the fact that these
> commands block the VM while they run. I might have seen one bug report
> about it, but it certainly isn't something that comes up as a frequent
> topic except among us QEMU maintainers. Users do care about having
> access to the snapshot feature.
>
> Where I am seeing frequent complaints is wrt the use of OVMF combined
> with snapshots which has some serious pain points. This is getting worse
> as the push to ditch legacy BIOS in favour of UEFI gain momentum both
> across OS vendors and mgmt apps. Solving it requires new parameters to
> the commands, but doing this in HMP is super unappealing.
>
> After 10 years, I think it is time for us to be a little pragmatic about
> our handling of snapshots commands. My desire is that libvirt should never
> use "human-monitor-command" under any circumstances, because of the
> inherant flaws in HMP as a protocol for machine consumption.
>
> Thus in this series I'm proposing a fairly direct mapping of the existing
> HMP commands for savevm/loadvm/delvm into QMP as a first step. This does
> not solve the blocking thread problem, but it does put in a place a
> design using the jobs framework which can facilitate solving it later.
> It does also solve the error reporting, type checking and introspection
> problems inherant to HMP. So we're winning on 3 out of the 4 problems,
> and pushed apps to a QMP design that will let us solve the last
> remaining problem.
>
> With a QMP variant, we reasonably deal with the problems related to OVMF:
>
>  - The logic to pick which disk to store the vmstate in is not
>    satsifactory.
>
>    The first block driver state cannot be assumed to be the root disk
>    image, it might be OVMF varstore and we don't want to store vmstate
>    in there.

Yes, this is one of the issues.  Glad you're addressing it.

>  - The logic to decide which disks must be snapshotted is hardwired
>    to all disks which are writable
>
>    Again with OVMF there might be a writable varstore, but this can be
>    raw rather than qcow2 format, and thus unable to be snapshotted.
>    While users might wish to snapshot their varstore, in some/many/most
>    cases it is entirely uneccessary. Users are blocked from snapshotting
>    their VM though due to this varstore.

Another one.  Glad again.

> These are solved by adding two parameters to the commands. The first is
> a block device node name that identifies the image to store vmstate in,
> and the second is a list of node names to include for the snapshots.
> If the list of nodes isn't given, it falls back to the historical
> behaviour of using all disks matching some undocumented criteria.
>
> In the block code I've only dealt with node names for block devices, as
> IIUC, this is all that libvirt should need in the -blockdev world it now
> lives in. IOW, I've made not attempt to cope with people wanting to use
> these QMP commands in combination with -drive args.
>
> I've done some minimal work in libvirt to start to make use of the new
> commands to validate their functionality, but this isn't finished yet.
>
> My ultimate goal is to make the GNOME Boxes maintainer happy again by
> having internal snapshots work with OVMF:
>
>   https://gitlab.gnome.org/GNOME/gnome-boxes/-/commit/c486da262f6566326fbcb5e=
> f45c5f64048f16a6e
>
> HELP NEEDED:  this series starts to implement the approach that Kevin
> suggested wrto use of generic jobs.

Does this mean you're trying to use the jobs infrastructure?

> When I try to actually run the code though it crashes:
>
> ERROR:/home/berrange/src/virt/qemu/softmmu/cpus.c:1788:qemu_mutex_unlock_ioth=
> read: assertion failed: (qemu_mutex_iothread_locked())
> Bail out! ERROR:/home/berrange/src/virt/qemu/softmmu/cpus.c:1788:qemu_mutex_u=
> nlock_iothread: assertion failed: (qemu_mutex_iothread_locked())
>
> Obviously I've missed something related to locking, but I've no idea
> what, so I'm sending this v2 simply as a way to solicit suggestions
> for what I've messed up.
>
> You can reproduce with I/O tests using "check -qcow2 310"  and it
> gave a stack:
[...]

Relevant part of Message-ID: <87lh7l783q.fsf@blackfin.pond.sub.org>

If we can't make a sane QMP interface, I'd rather have no QMP interface.
However, I believe we *can* make a sane QMP interface if we put in the
design work.

The design work must start with a review of what we're trying to
accomplish, and how to fit it into the rest of the system.  Here's my
attempt.  Since my knowledge on snapshots is rather superficial, I'm
cc'ing Kevin for additional snapshot expertise.  Kevin, please correct
me when I talk nonsense.  I'm further cc'ing Eric and Peter for the
management layer perspective.

A point-in-time snapshot of a system consists of a snapshot of its
machine state and snapshots of its storage.  All the snapshots need to
be made at the same point in time for the result to be consistent.

Snapshots of read-only storage carry no information and are commonly
omitted.

Isolated storage snapshots can make sense, but snapshotting the machine
state without also snapshotting the machine's storage doesn't sound
useful to me.

Both storage and machine state snapshots come in two flavours: internal
and external.

External ones can be made with any block backend, but internal storage
snapshots work only with certain formats, notably qcow2.  QMP supports
both kinds of storage snapshots.

Both kinds of storage snapshots need exclusive access while they work.
They're relatively quick, but the delay could be noticable for large
internal snapshots, and perhaps for external snapshots on really slow
storage.

Internal machine state snapshots are currently only available via HMP's
savevm, which integrates internal machine state and storage snapshots.
This is non-live, i.e. the guest is stopped while the snapshot gets
saved.  I figure we could make it live if we really wanted to.  Another
instance of the emerging background job concept.

On the implementation level, QCOW2 can't currently store a machine state
snapshot without also storing a storage snapshot.  I guess we could
change this if we really wanted to.

External machine state snapshots are basically migrate to file.
Supported by QMP.

Live migration to file is possible, but currently wastes space, because
memory dirtied during migration gets saved multiple times.  Could be
fixed either by making migration update previously saved memory instead
of appending (beware, random I/O), or by compacting the file afterwards.

Non-live migration to file doesn't waste space that way.

To take multiple *consistent* snapshots, you have to bundle them up in a
transaction.  Transactions currently support only *storage* snapshots,
though.

Work-around for external machine state snapshot: migrate to file
(possibly live), leaving the guest sopped on completion, take storage
snapshots, resume guest.

You can combine internal and external storage snapshots with an external
machine state snapshot to get a mixed system snapshot.

You currently can't do that with an internal machine state snapshot: the
only way to take one is HMP savevm, which insists on internally
snapshotting all writable storage, and doesn't transact together with
external storage snapshots.

Except for the case "purely internal snapshot with just one writable
storage device", a system snapshot consists of multiple parts stored in
separate files.  Tying the parts together is a management problem.  QEMU
provides rudimentary management for purely internal snapshots, but it's
flawed: missing storage isn't detected, and additional storage can creep
in if snapshot tags or IDs happen to match.  I guess managing the parts
is better left to the management layer.

I figure a fully general QMP interface would let you perform a system
snapshot by combining storage snapshots of either kind with either kind
of machine state snapshot.

We already have most of the building blocks: we can take external and
internal storage snapshots, and combine them in transactions.

What's missing is transactionable machine state snapshots.

We know how to work around it for external machine state snapshots (see
above), but a transaction-based solution might be nicer.

Any solution for internal machine state snapshots in QMP should at least
try to fit into this.  Some restrictions are okay.  For instance, we
probably need to restrict internal machine state snapshots to piggyback
on an internal storage snapshot for now, so we don't have to dig up
QCOW2 just to get QMP support.

We can talk about more convenient interfaces for common special cases,
but I feel we need to design for the general case.  We don't have to
implement the completely general case right away, though.  As long as we
know where we want to go, incremental steps towards that goal are fine.

Can we create a transactionable QMP command to take an internal machine
state snapshot?

This would be like HMP savevm with the following differences:

* Separate parameters for tag and ID.  I'll have none of this
  overloading nonsense in QMP.

* Specify the destination block device.  I'll have none of this "pick a
  device in some magic, undocumented way" in QMP.

* Leave alone the other block devices.  Adding transaction actions to
  snapshot all the writable block devices to get a full system snapshot
  is the user's responsibility.

Limitations:

* No live internal machine snapshot, yet.

* The storage device taking the internal snapshot must also be
  internally snapshot for now.  In fact, the command does both
  (tolerable wart).

Open questions:

* Do we want the QMP command to delete existing snapshots with
  conflicting tag / ID, like HMP savevm does?  Or do we want it to fail
  the transaction?

* Do we need transactions for switching to a system snapshot, too?

Opinions?

Re: [PATCH v2 (BROKEN) 0/6] migration: bring improved savevm/loadvm/delvm to QMP

Posted by Daniel P. Berrangé 3 years, 8 months ago

On Wed, Aug 26, 2020 at 05:52:06PM +0200, Markus Armbruster wrote:
> Sorry for taking so long to reply.
> 
> Daniel P. Berrangé <berrange@redhat.com> writes:
> 
> > A followup to:
> >
> >  v1: https://lists.gnu.org/archive/html/qemu-devel/2020-07/msg00866.html
> >
> > When QMP was first introduced some 10+ years ago now, the snapshot
> > related commands (savevm/loadvm/delvm) were not converted. This was
> > primarily because their implementation causes blocking of the thread
> > running the monitor commands. This was (and still is) considered
> > undesirable behaviour both in HMP and QMP.
> 
> One of several reasons.
> 
> > In theory someone was supposed to fix this flaw at some point in the
> > past 10 years and bring them into the QMP world. Sadly, thus far it
> > hasn't happened as people always had more important things to work
> > on. Enterprise apps were much more interested in external snapshots
> > than internal snapshots as they have many more features.
> 
> Several attempts have been made to bring the functionality to QMP.
> Sadly, they went nowhere.
> 
> I posted an analysis of the issues in reply to one of the more serious
> attempts:
> 
>     Message-ID: <87lh7l783q.fsf@blackfin.pond.sub.org>
>     https://lists.nongnu.org/archive/html/qemu-devel/2016-01/msg03593.html
> 
> I'd like to hear your take on it.  I append the relevant part for your
> convenience.  Perhaps your code is already close to what I describe
> there.  I'm interested in where it falls short.
> 
> > Meanwhile users still want to use internal snapshots as there is
> > a certainly simplicity in having everything self-contained in one
> > image, even though it has limitations. Thus the apps that end up
> > executing the savevm/loadvm/delvm via the "human-monitor-command"
> > QMP command.
> >
> > IOW, the problematic blocking behaviour that was one of the reasons
> > for not having savevm/loadvm/delvm in QMP is experienced by applications
> > regardless. By not portting the commands to QMP due to one design flaw,
> > we've forced apps and users to suffer from other design flaws of HMP (
> > bad error reporting, strong type checking of args, no introspection) for
> > an additional 10 years. This feels rather sub-optimal :-(
> >
> > In practice users don't appear to care strongly about the fact that these
> > commands block the VM while they run. I might have seen one bug report
> > about it, but it certainly isn't something that comes up as a frequent
> > topic except among us QEMU maintainers. Users do care about having
> > access to the snapshot feature.
> >
> > Where I am seeing frequent complaints is wrt the use of OVMF combined
> > with snapshots which has some serious pain points. This is getting worse
> > as the push to ditch legacy BIOS in favour of UEFI gain momentum both
> > across OS vendors and mgmt apps. Solving it requires new parameters to
> > the commands, but doing this in HMP is super unappealing.
> >
> > After 10 years, I think it is time for us to be a little pragmatic about
> > our handling of snapshots commands. My desire is that libvirt should never
> > use "human-monitor-command" under any circumstances, because of the
> > inherant flaws in HMP as a protocol for machine consumption.
> >
> > Thus in this series I'm proposing a fairly direct mapping of the existing
> > HMP commands for savevm/loadvm/delvm into QMP as a first step. This does
> > not solve the blocking thread problem, but it does put in a place a
> > design using the jobs framework which can facilitate solving it later.
> > It does also solve the error reporting, type checking and introspection
> > problems inherant to HMP. So we're winning on 3 out of the 4 problems,
> > and pushed apps to a QMP design that will let us solve the last
> > remaining problem.
> >
> > With a QMP variant, we reasonably deal with the problems related to OVMF:
> >
> >  - The logic to pick which disk to store the vmstate in is not
> >    satsifactory.
> >
> >    The first block driver state cannot be assumed to be the root disk
> >    image, it might be OVMF varstore and we don't want to store vmstate
> >    in there.
> 
> Yes, this is one of the issues.  Glad you're addressing it.
> 
> >  - The logic to decide which disks must be snapshotted is hardwired
> >    to all disks which are writable
> >
> >    Again with OVMF there might be a writable varstore, but this can be
> >    raw rather than qcow2 format, and thus unable to be snapshotted.
> >    While users might wish to snapshot their varstore, in some/many/most
> >    cases it is entirely uneccessary. Users are blocked from snapshotting
> >    their VM though due to this varstore.
> 
> Another one.  Glad again.
> 
> > These are solved by adding two parameters to the commands. The first is
> > a block device node name that identifies the image to store vmstate in,
> > and the second is a list of node names to include for the snapshots.
> > If the list of nodes isn't given, it falls back to the historical
> > behaviour of using all disks matching some undocumented criteria.
> >
> > In the block code I've only dealt with node names for block devices, as
> > IIUC, this is all that libvirt should need in the -blockdev world it now
> > lives in. IOW, I've made not attempt to cope with people wanting to use
> > these QMP commands in combination with -drive args.
> >
> > I've done some minimal work in libvirt to start to make use of the new
> > commands to validate their functionality, but this isn't finished yet.
> >
> > My ultimate goal is to make the GNOME Boxes maintainer happy again by
> > having internal snapshots work with OVMF:
> >
> >   https://gitlab.gnome.org/GNOME/gnome-boxes/-/commit/c486da262f6566326fbcb5e=
> > f45c5f64048f16a6e
> >
> > HELP NEEDED:  this series starts to implement the approach that Kevin
> > suggested wrto use of generic jobs.
> 
> Does this mean you're trying to use the jobs infrastructure?

Yes, this is working now.


> Relevant part of Message-ID: <87lh7l783q.fsf@blackfin.pond.sub.org>
> 
> If we can't make a sane QMP interface, I'd rather have no QMP interface.

I strongly disagree with this. This absolutist position is why we
have made zero progress in 10+ years, leaving mgmt apps suffering
with HMP passthrough, as described above. 

This is a prime example of perfect being the enemy of good.

> However, I believe we *can* make a sane QMP interface if we put in the
> design work.

We should make a credible attempt at QMP design, but if perfection
isn't practical, we should none the less do *something* in QMP, even
if we find we need to deprecate & replace it later.

> The design work must start with a review of what we're trying to
> accomplish, and how to fit it into the rest of the system.  Here's my
> attempt.  Since my knowledge on snapshots is rather superficial, I'm
> cc'ing Kevin for additional snapshot expertise.  Kevin, please correct
> me when I talk nonsense.  I'm further cc'ing Eric and Peter for the
> management layer perspective.

The things I'm trying to accomplish as listed in the text above.
Primarily this is about fixing the ability to snapshot guests where
QEMU's heuristics for picking block devices is broken. ie explicitly
list the disks to snapshot and which to store vmstate in.

Converting from HMP to QMP is esentially an enabler, because adding
new args to existing savevm/loadvm HMP commands is just too horrible
to contemplate, as mgmt apps have no sane way to probe HMP. That's
what QMP is for.

Essentially the goal is to fix the inability to use internal snapshots
with UEFI based VMs that is causing immediate pain for mgmt apps due
to increased need to support UEFI.

> A point-in-time snapshot of a system consists of a snapshot of its
> machine state and snapshots of its storage.  All the snapshots need to
> be made at the same point in time for the result to be consistent.
> 
> Snapshots of read-only storage carry no information and are commonly
> omitted.
> 
> Isolated storage snapshots can make sense, but snapshotting the machine
> state without also snapshotting the machine's storage doesn't sound
> useful to me.
> 
> Both storage and machine state snapshots come in two flavours: internal
> and external.
> 
> External ones can be made with any block backend, but internal storage
> snapshots work only with certain formats, notably qcow2.  QMP supports
> both kinds of storage snapshots.
> 
> Both kinds of storage snapshots need exclusive access while they work.
> They're relatively quick, but the delay could be noticable for large
> internal snapshots, and perhaps for external snapshots on really slow
> storage.
> 
> Internal machine state snapshots are currently only available via HMP's
> savevm, which integrates internal machine state and storage snapshots.
> This is non-live, i.e. the guest is stopped while the snapshot gets
> saved.  I figure we could make it live if we really wanted to.  Another
> instance of the emerging background job concept.
> 
> On the implementation level, QCOW2 can't currently store a machine state
> snapshot without also storing a storage snapshot.  I guess we could
> change this if we really wanted to.
> 
> External machine state snapshots are basically migrate to file.
> Supported by QMP.
> 
> Live migration to file is possible, but currently wastes space, because
> memory dirtied during migration gets saved multiple times.  Could be
> fixed either by making migration update previously saved memory instead
> of appending (beware, random I/O), or by compacting the file afterwards.
> 
> Non-live migration to file doesn't waste space that way.
> 
> To take multiple *consistent* snapshots, you have to bundle them up in a
> transaction.  Transactions currently support only *storage* snapshots,
> though.
> 
> Work-around for external machine state snapshot: migrate to file
> (possibly live), leaving the guest sopped on completion, take storage
> snapshots, resume guest.
> 
> You can combine internal and external storage snapshots with an external
> machine state snapshot to get a mixed system snapshot.
> 
> You currently can't do that with an internal machine state snapshot: the
> only way to take one is HMP savevm, which insists on internally
> snapshotting all writable storage, and doesn't transact together with
> external storage snapshots.
> 
> Except for the case "purely internal snapshot with just one writable
> storage device", a system snapshot consists of multiple parts stored in
> separate files.  Tying the parts together is a management problem.  QEMU
> provides rudimentary management for purely internal snapshots, but it's
> flawed: missing storage isn't detected, and additional storage can creep
> in if snapshot tags or IDs happen to match.  I guess managing the parts
> is better left to the management layer.
> 
> I figure a fully general QMP interface would let you perform a system
> snapshot by combining storage snapshots of either kind with either kind
> of machine state snapshot.
> 
> We already have most of the building blocks: we can take external and
> internal storage snapshots, and combine them in transactions.
> 
> What's missing is transactionable machine state snapshots.
> 
> We know how to work around it for external machine state snapshots (see
> above), but a transaction-based solution might be nicer.
> 
> Any solution for internal machine state snapshots in QMP should at least
> try to fit into this.  Some restrictions are okay.  For instance, we
> probably need to restrict internal machine state snapshots to piggyback
> on an internal storage snapshot for now, so we don't have to dig up
> QCOW2 just to get QMP support.

From the POV of practicality, making a design that unifies internal
and external snapshots is something I'm considering out of scope.
It increases the design time burden, as well as implementation burden.
On my side, improving internal snapshots is a "spare time" project,
not something I can justify spending weeks or months on.

My goal is to implement something that is achievable in a short
amount of time that gets us out of the hole we've been in for 10
years. Minimal refactoring of the internal snapshot code aside
from fixing the critical limitations we have today around choice
of disks to snapshot.

If someone later wants to come up with a grand unified design
for everything that's fine, we can deprecate the new QMP commands
I'm proposing now.

> We can talk about more convenient interfaces for common special cases,
> but I feel we need to design for the general case.  We don't have to
> implement the completely general case right away, though.  As long as we
> know where we want to go, incremental steps towards that goal are fine.
> 
> Can we create a transactionable QMP command to take an internal machine
> state snapshot?
> 
> This would be like HMP savevm with the following differences:
> 
> * Separate parameters for tag and ID.  I'll have none of this
>   overloading nonsense in QMP.
> 
> * Specify the destination block device.  I'll have none of this "pick a
>   device in some magic, undocumented way" in QMP.
> 
> * Leave alone the other block devices.  Adding transaction actions to
>   snapshot all the writable block devices to get a full system snapshot
>   is the user's responsibility.
> 
> Limitations:
> 
> * No live internal machine snapshot, yet.
> 
> * The storage device taking the internal snapshot must also be
>   internally snapshot for now.  In fact, the command does both
>   (tolerable wart).
> 
> Open questions:
> 
> * Do we want the QMP command to delete existing snapshots with
>   conflicting tag / ID, like HMP savevm does?  Or do we want it to fail
>   the transaction?

The intent is for the QMP commands to operate exclusively on
'tags', and never consider "ID".

> * Do we need transactions for switching to a system snapshot, too?
> 
> Opinions?



Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH v2 (BROKEN) 0/6] migration: bring improved savevm/loadvm/delvm to QMP

Posted by Daniel P. Berrangé 3 years, 8 months ago

On Wed, Aug 26, 2020 at 07:28:24PM +0100, Daniel P. Berrangé wrote:
> On Wed, Aug 26, 2020 at 05:52:06PM +0200, Markus Armbruster wrote:
> > Open questions:
> > 
> > * Do we want the QMP command to delete existing snapshots with
> >   conflicting tag / ID, like HMP savevm does?  Or do we want it to fail
> >   the transaction?
> 
> The intent is for the QMP commands to operate exclusively on
> 'tags', and never consider "ID".

I forgot that even HMP ignores "ID" now and works exclusively in terms
of tags since:


  commit 6ca080453ea403959ccde661030ca16264acc181
  Author: Daniel Henrique Barboza <danielhb413@gmail.com>
  Date:   Wed Nov 7 11:09:58 2018 -0200

    block/snapshot.c: eliminate use of ID input in snapshot operations
    

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH v2 (BROKEN) 0/6] migration: bring improved savevm/loadvm/delvm to QMP

Posted by Markus Armbruster 3 years, 8 months ago

Daniel P. Berrangé <berrange@redhat.com> writes:

> On Wed, Aug 26, 2020 at 07:28:24PM +0100, Daniel P. Berrangé wrote:
>> On Wed, Aug 26, 2020 at 05:52:06PM +0200, Markus Armbruster wrote:
>> > Open questions:
>> > 
>> > * Do we want the QMP command to delete existing snapshots with
>> >   conflicting tag / ID, like HMP savevm does?  Or do we want it to fail
>> >   the transaction?
>> 
>> The intent is for the QMP commands to operate exclusively on
>> 'tags', and never consider "ID".
>
> I forgot that even HMP ignores "ID" now and works exclusively in terms
> of tags since:
>
>
>   commit 6ca080453ea403959ccde661030ca16264acc181
>   Author: Daniel Henrique Barboza <danielhb413@gmail.com>
>   Date:   Wed Nov 7 11:09:58 2018 -0200
>
>     block/snapshot.c: eliminate use of ID input in snapshot operations

Almost a year after I sent the memo I quoted.  It's an incompatible
change, but nobody complained, and I'm glad we got this issue out of the
way.

Re: [PATCH v2 (BROKEN) 0/6] migration: bring improved savevm/loadvm/delvm to QMP

Posted by Kevin Wolf 3 years, 8 months ago

Am 27.08.2020 um 13:06 hat Markus Armbruster geschrieben:
> Daniel P. BerrangÃ© <berrange@redhat.com> writes:
> 
> > On Wed, Aug 26, 2020 at 07:28:24PM +0100, Daniel P. BerrangÃ© wrote:
> >> On Wed, Aug 26, 2020 at 05:52:06PM +0200, Markus Armbruster wrote:
> >> > Open questions:
> >> > 
> >> > * Do we want the QMP command to delete existing snapshots with
> >> >   conflicting tag / ID, like HMP savevm does?  Or do we want it to fail
> >> >   the transaction?
> >> 
> >> The intent is for the QMP commands to operate exclusively on
> >> 'tags', and never consider "ID".
> >
> > I forgot that even HMP ignores "ID" now and works exclusively in terms
> > of tags since:
> >
> >
> >   commit 6ca080453ea403959ccde661030ca16264acc181
> >   Author: Daniel Henrique Barboza <danielhb413@gmail.com>
> >   Date:   Wed Nov 7 11:09:58 2018 -0200
> >
> >     block/snapshot.c: eliminate use of ID input in snapshot operations
> 
> Almost a year after I sent the memo I quoted.  It's an incompatible
> change, but nobody complained, and I'm glad we got this issue out of the
> way.

FWIW, I would have ignored any complaint about incompatible changes in
HMP. It's not supposed to be a stable API, but UI.

Kevin

Re: [PATCH v2 (BROKEN) 0/6] migration: bring improved savevm/loadvm/delvm to QMP

Posted by Markus Armbruster 3 years, 8 months ago

Kevin Wolf <kwolf@redhat.com> writes:

> Am 27.08.2020 um 13:06 hat Markus Armbruster geschrieben:
>> Daniel P. BerrangÃƒÂ© <berrange@redhat.com> writes:
>> 
>> > On Wed, Aug 26, 2020 at 07:28:24PM +0100, Daniel P. BerrangÃƒÂ© wrote:
>> >> On Wed, Aug 26, 2020 at 05:52:06PM +0200, Markus Armbruster wrote:
>> >> > Open questions:
>> >> > 
>> >> > * Do we want the QMP command to delete existing snapshots with
>> >> >   conflicting tag / ID, like HMP savevm does?  Or do we want it to fail
>> >> >   the transaction?
>> >> 
>> >> The intent is for the QMP commands to operate exclusively on
>> >> 'tags', and never consider "ID".
>> >
>> > I forgot that even HMP ignores "ID" now and works exclusively in terms
>> > of tags since:
>> >
>> >
>> >   commit 6ca080453ea403959ccde661030ca16264acc181
>> >   Author: Daniel Henrique Barboza <danielhb413@gmail.com>
>> >   Date:   Wed Nov 7 11:09:58 2018 -0200
>> >
>> >     block/snapshot.c: eliminate use of ID input in snapshot operations
>> 
>> Almost a year after I sent the memo I quoted.  It's an incompatible
>> change, but nobody complained, and I'm glad we got this issue out of the
>> way.
>
> FWIW, I would have ignored any complaint about incompatible changes in
> HMP. It's not supposed to be a stable API, but UI.

The iffy part is actually the loss of ability to access snapshots that
lack a name.  Complaints about that would have been valid, I think.
Fortunately, there have been none.

Re: [PATCH v2 (BROKEN) 0/6] migration: bring improved savevm/loadvm/delvm to QMP

Posted by Kevin Wolf 3 years, 8 months ago

Am 28.08.2020 um 08:20 hat Markus Armbruster geschrieben:
> Kevin Wolf <kwolf@redhat.com> writes:
> 
> > Am 27.08.2020 um 13:06 hat Markus Armbruster geschrieben:
> >> Daniel P. BerrangÃƒÂ© <berrange@redhat.com> writes:
> >> 
> >> > On Wed, Aug 26, 2020 at 07:28:24PM +0100, Daniel P. BerrangÃƒÂ© wrote:
> >> >> On Wed, Aug 26, 2020 at 05:52:06PM +0200, Markus Armbruster wrote:
> >> >> > Open questions:
> >> >> > 
> >> >> > * Do we want the QMP command to delete existing snapshots with
> >> >> >   conflicting tag / ID, like HMP savevm does?  Or do we want it to fail
> >> >> >   the transaction?
> >> >> 
> >> >> The intent is for the QMP commands to operate exclusively on
> >> >> 'tags', and never consider "ID".
> >> >
> >> > I forgot that even HMP ignores "ID" now and works exclusively in terms
> >> > of tags since:
> >> >
> >> >
> >> >   commit 6ca080453ea403959ccde661030ca16264acc181
> >> >   Author: Daniel Henrique Barboza <danielhb413@gmail.com>
> >> >   Date:   Wed Nov 7 11:09:58 2018 -0200
> >> >
> >> >     block/snapshot.c: eliminate use of ID input in snapshot operations
> >> 
> >> Almost a year after I sent the memo I quoted.  It's an incompatible
> >> change, but nobody complained, and I'm glad we got this issue out of the
> >> way.
> >
> > FWIW, I would have ignored any complaint about incompatible changes in
> > HMP. It's not supposed to be a stable API, but UI.
> 
> The iffy part is actually the loss of ability to access snapshots that
> lack a name.  Complaints about that would have been valid, I think.
> Fortunately, there have been none.

'loadvm ""' should do the trick for these. The same way as you have to
use with 'savevm' to create them in non-prehistoric versions of QEMU.
We stopped creating snapshots with empty names by default in 0.14, so
they are probably not very relevant any more. (Versioned machine types
go back "only" to 1.0, so good luck loading a snapshot from an older
version. And I wouldn't bet money either on a 1.0 snapshot still working
with that machine type...)

Kevin

Re: [PATCH v2 (BROKEN) 0/6] migration: bring improved savevm/loadvm/delvm to QMP

Posted by Markus Armbruster 3 years, 8 months ago

Kevin Wolf <kwolf@redhat.com> writes:

> Am 28.08.2020 um 08:20 hat Markus Armbruster geschrieben:
>> Kevin Wolf <kwolf@redhat.com> writes:
>> 
>> > Am 27.08.2020 um 13:06 hat Markus Armbruster geschrieben:
>> >> Daniel P. BerrangÃƒÂ© <berrange@redhat.com> writes:
>> >> 
>> >> > On Wed, Aug 26, 2020 at 07:28:24PM +0100, Daniel P. BerrangÃƒÂ© wrote:
>> >> >> On Wed, Aug 26, 2020 at 05:52:06PM +0200, Markus Armbruster wrote:
>> >> >> > Open questions:
>> >> >> > 
>> >> >> > * Do we want the QMP command to delete existing snapshots with
>> >> >> >   conflicting tag / ID, like HMP savevm does?  Or do we want it to fail
>> >> >> >   the transaction?
>> >> >> 
>> >> >> The intent is for the QMP commands to operate exclusively on
>> >> >> 'tags', and never consider "ID".
>> >> >
>> >> > I forgot that even HMP ignores "ID" now and works exclusively in terms
>> >> > of tags since:
>> >> >
>> >> >
>> >> >   commit 6ca080453ea403959ccde661030ca16264acc181
>> >> >   Author: Daniel Henrique Barboza <danielhb413@gmail.com>
>> >> >   Date:   Wed Nov 7 11:09:58 2018 -0200
>> >> >
>> >> >     block/snapshot.c: eliminate use of ID input in snapshot operations
>> >> 
>> >> Almost a year after I sent the memo I quoted.  It's an incompatible
>> >> change, but nobody complained, and I'm glad we got this issue out of the
>> >> way.
>> >
>> > FWIW, I would have ignored any complaint about incompatible changes in
>> > HMP. It's not supposed to be a stable API, but UI.
>> 
>> The iffy part is actually the loss of ability to access snapshots that
>> lack a name.  Complaints about that would have been valid, I think.
>> Fortunately, there have been none.
>
> 'loadvm ""' should do the trick for these.

As long as you have at most one.

>                                            The same way as you have to
> use with 'savevm' to create them in non-prehistoric versions of QEMU.
> We stopped creating snapshots with empty names by default in 0.14, so
> they are probably not very relevant any more. (Versioned machine types
> go back "only" to 1.0, so good luck loading a snapshot from an older
> version. And I wouldn't bet money either on a 1.0 snapshot still working
> with that machine type...)

No argument.

Re: [PATCH v2 (BROKEN) 0/6] migration: bring improved savevm/loadvm/delvm to QMP

Posted by Markus Armbruster 3 years, 8 months ago

Daniel P. Berrangé <berrange@redhat.com> writes:

> On Wed, Aug 26, 2020 at 05:52:06PM +0200, Markus Armbruster wrote:
>> Sorry for taking so long to reply.
>> 
>> Daniel P. Berrangé <berrange@redhat.com> writes:
>> 
>> > A followup to:
>> >
>> >  v1: https://lists.gnu.org/archive/html/qemu-devel/2020-07/msg00866.html
>> >
>> > When QMP was first introduced some 10+ years ago now, the snapshot
>> > related commands (savevm/loadvm/delvm) were not converted. This was
>> > primarily because their implementation causes blocking of the thread
>> > running the monitor commands. This was (and still is) considered
>> > undesirable behaviour both in HMP and QMP.
>> 
>> One of several reasons.
>> 
>> > In theory someone was supposed to fix this flaw at some point in the
>> > past 10 years and bring them into the QMP world. Sadly, thus far it
>> > hasn't happened as people always had more important things to work
>> > on. Enterprise apps were much more interested in external snapshots
>> > than internal snapshots as they have many more features.
>> 
>> Several attempts have been made to bring the functionality to QMP.
>> Sadly, they went nowhere.
>> 
>> I posted an analysis of the issues in reply to one of the more serious
>> attempts:
>> 
>>     Message-ID: <87lh7l783q.fsf@blackfin.pond.sub.org>
>>     https://lists.nongnu.org/archive/html/qemu-devel/2016-01/msg03593.html
>> 
>> I'd like to hear your take on it.  I append the relevant part for your
>> convenience.  Perhaps your code is already close to what I describe
>> there.  I'm interested in where it falls short.
>> 
>> > Meanwhile users still want to use internal snapshots as there is
>> > a certainly simplicity in having everything self-contained in one
>> > image, even though it has limitations. Thus the apps that end up
>> > executing the savevm/loadvm/delvm via the "human-monitor-command"
>> > QMP command.
>> >
>> > IOW, the problematic blocking behaviour that was one of the reasons
>> > for not having savevm/loadvm/delvm in QMP is experienced by applications
>> > regardless. By not portting the commands to QMP due to one design flaw,
>> > we've forced apps and users to suffer from other design flaws of HMP (
>> > bad error reporting, strong type checking of args, no introspection) for
>> > an additional 10 years. This feels rather sub-optimal :-(
>> >
>> > In practice users don't appear to care strongly about the fact that these
>> > commands block the VM while they run. I might have seen one bug report
>> > about it, but it certainly isn't something that comes up as a frequent
>> > topic except among us QEMU maintainers. Users do care about having
>> > access to the snapshot feature.
>> >
>> > Where I am seeing frequent complaints is wrt the use of OVMF combined
>> > with snapshots which has some serious pain points. This is getting worse
>> > as the push to ditch legacy BIOS in favour of UEFI gain momentum both
>> > across OS vendors and mgmt apps. Solving it requires new parameters to
>> > the commands, but doing this in HMP is super unappealing.
>> >
>> > After 10 years, I think it is time for us to be a little pragmatic about
>> > our handling of snapshots commands. My desire is that libvirt should never
>> > use "human-monitor-command" under any circumstances, because of the
>> > inherant flaws in HMP as a protocol for machine consumption.
>> >
>> > Thus in this series I'm proposing a fairly direct mapping of the existing
>> > HMP commands for savevm/loadvm/delvm into QMP as a first step. This does
>> > not solve the blocking thread problem, but it does put in a place a
>> > design using the jobs framework which can facilitate solving it later.
>> > It does also solve the error reporting, type checking and introspection
>> > problems inherant to HMP. So we're winning on 3 out of the 4 problems,
>> > and pushed apps to a QMP design that will let us solve the last
>> > remaining problem.
>> >
>> > With a QMP variant, we reasonably deal with the problems related to OVMF:
>> >
>> >  - The logic to pick which disk to store the vmstate in is not
>> >    satsifactory.
>> >
>> >    The first block driver state cannot be assumed to be the root disk
>> >    image, it might be OVMF varstore and we don't want to store vmstate
>> >    in there.
>> 
>> Yes, this is one of the issues.  Glad you're addressing it.
>> 
>> >  - The logic to decide which disks must be snapshotted is hardwired
>> >    to all disks which are writable
>> >
>> >    Again with OVMF there might be a writable varstore, but this can be
>> >    raw rather than qcow2 format, and thus unable to be snapshotted.
>> >    While users might wish to snapshot their varstore, in some/many/most
>> >    cases it is entirely uneccessary. Users are blocked from snapshotting
>> >    their VM though due to this varstore.
>> 
>> Another one.  Glad again.
>> 
>> > These are solved by adding two parameters to the commands. The first is
>> > a block device node name that identifies the image to store vmstate in,
>> > and the second is a list of node names to include for the snapshots.
>> > If the list of nodes isn't given, it falls back to the historical
>> > behaviour of using all disks matching some undocumented criteria.
>> >
>> > In the block code I've only dealt with node names for block devices, as
>> > IIUC, this is all that libvirt should need in the -blockdev world it now
>> > lives in. IOW, I've made not attempt to cope with people wanting to use
>> > these QMP commands in combination with -drive args.
>> >
>> > I've done some minimal work in libvirt to start to make use of the new
>> > commands to validate their functionality, but this isn't finished yet.
>> >
>> > My ultimate goal is to make the GNOME Boxes maintainer happy again by
>> > having internal snapshots work with OVMF:
>> >
>> >   https://gitlab.gnome.org/GNOME/gnome-boxes/-/commit/c486da262f6566326fbcb5e=
>> > f45c5f64048f16a6e
>> >
>> > HELP NEEDED:  this series starts to implement the approach that Kevin
>> > suggested wrto use of generic jobs.
>> 
>> Does this mean you're trying to use the jobs infrastructure?
>
> Yes, this is working now.
>
>
>> Relevant part of Message-ID: <87lh7l783q.fsf@blackfin.pond.sub.org>
>> 
>> If we can't make a sane QMP interface, I'd rather have no QMP interface.
>
> I strongly disagree with this. This absolutist position is why we
> have made zero progress in 10+ years, leaving mgmt apps suffering
> with HMP passthrough, as described above. 
>
> This is a prime example of perfect being the enemy of good.

You seem to use a different version of 'sane' than I do :)

For me, a 'sane QMP interface' does not have to be perfect.  It should

* solve a problem people have (check)

* be non-magical (clearly feasible; just a bit more work than "slap QMP
  around the existing --- and less than sane --- HMP commands")

* be as general as we can make it, given the resources (this is a matter
  of thinking through alternatives and picking a reasonable one)

* allow compatible evolution towards a more perfect interface (this is a
  matter of thinking ahead, to reduce the risk of boxing ourselves into
  a corner)

>> However, I believe we *can* make a sane QMP interface if we put in the
>> design work.
>
> We should make a credible attempt at QMP design, but if perfection
> isn't practical, we should none the less do *something* in QMP, even
> if we find we need to deprecate & replace it later.

All the prior iterations failed at "attempt", i.e. long before coming
anywhere near "credible".

>> The design work must start with a review of what we're trying to
>> accomplish, and how to fit it into the rest of the system.  Here's my
>> attempt.  Since my knowledge on snapshots is rather superficial, I'm
>> cc'ing Kevin for additional snapshot expertise.  Kevin, please correct
>> me when I talk nonsense.  I'm further cc'ing Eric and Peter for the
>> management layer perspective.
>
> The things I'm trying to accomplish as listed in the text above.
> Primarily this is about fixing the ability to snapshot guests where
> QEMU's heuristics for picking block devices is broken. ie explicitly
> list the disks to snapshot and which to store vmstate in.
>
> Converting from HMP to QMP is esentially an enabler, because adding
> new args to existing savevm/loadvm HMP commands is just too horrible
> to contemplate, as mgmt apps have no sane way to probe HMP. That's
> what QMP is for.
>
> Essentially the goal is to fix the inability to use internal snapshots
> with UEFI based VMs that is causing immediate pain for mgmt apps due
> to increased need to support UEFI.

I think understand your goal.  Now try to understand mine: I want to
keep QMP supportable.  I'm willing to accept less than perfect
additions, but not when it's less than perfect just because people can't
be bothered to think about any other solution than the first that comes
to mind, in this case a 1:1 port of the existing HMP commands.

You've clearly put in some thought.  That's why I'm not sending you the
canned "the exact same thing has been tried before, I said no then
because of $reasons, the reasons haven't changed, and neither has my no"
letter.

>> A point-in-time snapshot of a system consists of a snapshot of its
>> machine state and snapshots of its storage.  All the snapshots need to
>> be made at the same point in time for the result to be consistent.
>> 
>> Snapshots of read-only storage carry no information and are commonly
>> omitted.
>> 
>> Isolated storage snapshots can make sense, but snapshotting the machine
>> state without also snapshotting the machine's storage doesn't sound
>> useful to me.
>> 
>> Both storage and machine state snapshots come in two flavours: internal
>> and external.
>> 
>> External ones can be made with any block backend, but internal storage
>> snapshots work only with certain formats, notably qcow2.  QMP supports
>> both kinds of storage snapshots.
>> 
>> Both kinds of storage snapshots need exclusive access while they work.
>> They're relatively quick, but the delay could be noticable for large
>> internal snapshots, and perhaps for external snapshots on really slow
>> storage.
>> 
>> Internal machine state snapshots are currently only available via HMP's
>> savevm, which integrates internal machine state and storage snapshots.
>> This is non-live, i.e. the guest is stopped while the snapshot gets
>> saved.  I figure we could make it live if we really wanted to.  Another
>> instance of the emerging background job concept.
>> 
>> On the implementation level, QCOW2 can't currently store a machine state
>> snapshot without also storing a storage snapshot.  I guess we could
>> change this if we really wanted to.
>> 
>> External machine state snapshots are basically migrate to file.
>> Supported by QMP.
>> 
>> Live migration to file is possible, but currently wastes space, because
>> memory dirtied during migration gets saved multiple times.  Could be
>> fixed either by making migration update previously saved memory instead
>> of appending (beware, random I/O), or by compacting the file afterwards.
>> 
>> Non-live migration to file doesn't waste space that way.
>> 
>> To take multiple *consistent* snapshots, you have to bundle them up in a
>> transaction.  Transactions currently support only *storage* snapshots,
>> though.
>> 
>> Work-around for external machine state snapshot: migrate to file
>> (possibly live), leaving the guest sopped on completion, take storage
>> snapshots, resume guest.
>> 
>> You can combine internal and external storage snapshots with an external
>> machine state snapshot to get a mixed system snapshot.
>> 
>> You currently can't do that with an internal machine state snapshot: the
>> only way to take one is HMP savevm, which insists on internally
>> snapshotting all writable storage, and doesn't transact together with
>> external storage snapshots.
>> 
>> Except for the case "purely internal snapshot with just one writable
>> storage device", a system snapshot consists of multiple parts stored in
>> separate files.  Tying the parts together is a management problem.  QEMU
>> provides rudimentary management for purely internal snapshots, but it's
>> flawed: missing storage isn't detected, and additional storage can creep
>> in if snapshot tags or IDs happen to match.  I guess managing the parts
>> is better left to the management layer.
>> 
>> I figure a fully general QMP interface would let you perform a system
>> snapshot by combining storage snapshots of either kind with either kind
>> of machine state snapshot.
>> 
>> We already have most of the building blocks: we can take external and
>> internal storage snapshots, and combine them in transactions.
>> 
>> What's missing is transactionable machine state snapshots.
>> 
>> We know how to work around it for external machine state snapshots (see
>> above), but a transaction-based solution might be nicer.
>> 
>> Any solution for internal machine state snapshots in QMP should at least
>> try to fit into this.  Some restrictions are okay.  For instance, we
>> probably need to restrict internal machine state snapshots to piggyback
>> on an internal storage snapshot for now, so we don't have to dig up
>> QCOW2 just to get QMP support.
>
> From the POV of practicality, making a design that unifies internal
> and external snapshots is something I'm considering out of scope.
> It increases the design time burden, as well as implementation burden.
> On my side, improving internal snapshots is a "spare time" project,
> not something I can justify spending weeks or months on.

I'm not demanding a solution that unifies internal and external
snapshots.  I'm asking for a bit of serious thought on an interface that
could compatibly evolve there.  Hours, not weeks or months.

> My goal is to implement something that is achievable in a short
> amount of time that gets us out of the hole we've been in for 10
> years. Minimal refactoring of the internal snapshot code aside
> from fixing the critical limitations we have today around choice
> of disks to snapshot.
>
> If someone later wants to come up with a grand unified design
> for everything that's fine, we can deprecate the new QMP commands
> I'm proposing now.

Failing at coming up with an interface that has a reasonable chance to
be future-proof is okay.

Not even trying is not okay.

Having not tried for 10+ years doesn't make it okay.

I believe I already did much of the design work back in 2016.  Now I'd
like you to engage with it a bit deeper than just "perfect is the enemy
of good; now take my patches, please".

Specifically, I'd like you to think about monolothic snapshot command
(internal snapshots only by design) vs. transaction of individual
snapshot commands (design is not restricted to internal snapshots, but
we may want to accept implementation restrictions).

We already have transactionable individual storage snapshot commands.
What's missing is a transactionable machine state snapshot command.

One restriction I'd readily accept at this time is "the machine state
snapshot must write to a QCOW2 that is also internally snapshot in the
same transaction".

Now explain to me why this is impractical.

>> We can talk about more convenient interfaces for common special cases,
>> but I feel we need to design for the general case.  We don't have to
>> implement the completely general case right away, though.  As long as we
>> know where we want to go, incremental steps towards that goal are fine.
>> 
>> Can we create a transactionable QMP command to take an internal machine
>> state snapshot?
>> 
>> This would be like HMP savevm with the following differences:
>> 
>> * Separate parameters for tag and ID.  I'll have none of this
>>   overloading nonsense in QMP.
>> 
>> * Specify the destination block device.  I'll have none of this "pick a
>>   device in some magic, undocumented way" in QMP.
>> 
>> * Leave alone the other block devices.  Adding transaction actions to
>>   snapshot all the writable block devices to get a full system snapshot
>>   is the user's responsibility.
>> 
>> Limitations:
>> 
>> * No live internal machine snapshot, yet.
>> 
>> * The storage device taking the internal snapshot must also be
>>   internally snapshot for now.  In fact, the command does both
>>   (tolerable wart).
>> 
>> Open questions:
>> 
>> * Do we want the QMP command to delete existing snapshots with
>>   conflicting tag / ID, like HMP savevm does?  Or do we want it to fail
>>   the transaction?
>
> The intent is for the QMP commands to operate exclusively on
> 'tags', and never consider "ID".
>
>> * Do we need transactions for switching to a system snapshot, too?
>> 
>> Opinions?
>
>
>
> Regards,
> Daniel

Re: [PATCH v2 (BROKEN) 0/6] migration: bring improved savevm/loadvm/delvm to QMP

Posted by Daniel P. Berrangé 3 years, 8 months ago

On Thu, Aug 27, 2020 at 01:04:43PM +0200, Markus Armbruster wrote:
> Daniel P. Berrangé <berrange@redhat.com> writes:
> 
> > On Wed, Aug 26, 2020 at 05:52:06PM +0200, Markus Armbruster wrote:
> > From the POV of practicality, making a design that unifies internal
> > and external snapshots is something I'm considering out of scope.
> > It increases the design time burden, as well as implementation burden.
> > On my side, improving internal snapshots is a "spare time" project,
> > not something I can justify spending weeks or months on.
> 
> I'm not demanding a solution that unifies internal and external
> snapshots.  I'm asking for a bit of serious thought on an interface that
> could compatibly evolve there.  Hours, not weeks or months.
> 
> > My goal is to implement something that is achievable in a short
> > amount of time that gets us out of the hole we've been in for 10
> > years. Minimal refactoring of the internal snapshot code aside
> > from fixing the critical limitations we have today around choice
> > of disks to snapshot.
> >
> > If someone later wants to come up with a grand unified design
> > for everything that's fine, we can deprecate the new QMP commands
> > I'm proposing now.
> 
> Failing at coming up with an interface that has a reasonable chance to
> be future-proof is okay.
> 
> Not even trying is not okay.

This was raised in my v1 posting:

  https://lists.gnu.org/archive/html/qemu-devel/2020-07/msg01346.html

but the conclusion was that it was a non-trivial amount of extra
implementation work

> Specifically, I'd like you to think about monolothic snapshot command
> (internal snapshots only by design) vs. transaction of individual
> snapshot commands (design is not restricted to internal snapshots, but
> we may want to accept implementation restrictions).
> 
> We already have transactionable individual storage snapshot commands.
> What's missing is a transactionable machine state snapshot command.

At a high level I consider what I've proposed as being higher level
syntax sugar vs a more generic future impl based on multiple commands
for snapshotting disk & state. I don't think I'd claim that it will
evolve to become the design you're suggesting here, as they are designs
from different levels in the stack. IOW, I think the would ultimately
just exist in parallel. I don't think that's a real problem from a
maint POV, as the large burden from the monolithic snapshot code is
not the HMP/QMP interface, but rather the guts of the internal
impl in the migration/savevm.c and block/snapshot.c files. That code
will exist for as long as the HMP commands exist, and adding a QMP
interface doesn't make that situation worse unless we were intending
to drop the existing HMP commands. If we did change our minds though,
we can just deprecate the QMP command at any time we like.

> One restriction I'd readily accept at this time is "the machine state
> snapshot must write to a QCOW2 that is also internally snapshot in the
> same transaction".
> 
> Now explain to me why this is impractical.

The issues were described by Kevin here:

  https://lists.gnu.org/archive/html/qemu-devel/2020-07/msg02057.html

Assuming the migration impl is actually possible to solve, there is
still the question of actually writing it. That's a non-trivial
amount of work someone has to find time for.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH v2 (BROKEN) 0/6] migration: bring improved savevm/loadvm/delvm to QMP

Posted by Markus Armbruster 3 years, 8 months ago

Daniel P. Berrangé <berrange@redhat.com> writes:

> On Thu, Aug 27, 2020 at 01:04:43PM +0200, Markus Armbruster wrote:
>> Daniel P. Berrangé <berrange@redhat.com> writes:
>> 
>> > On Wed, Aug 26, 2020 at 05:52:06PM +0200, Markus Armbruster wrote:
>> > From the POV of practicality, making a design that unifies internal
>> > and external snapshots is something I'm considering out of scope.
>> > It increases the design time burden, as well as implementation burden.
>> > On my side, improving internal snapshots is a "spare time" project,
>> > not something I can justify spending weeks or months on.
>> 
>> I'm not demanding a solution that unifies internal and external
>> snapshots.  I'm asking for a bit of serious thought on an interface that
>> could compatibly evolve there.  Hours, not weeks or months.
>> 
>> > My goal is to implement something that is achievable in a short
>> > amount of time that gets us out of the hole we've been in for 10
>> > years. Minimal refactoring of the internal snapshot code aside
>> > from fixing the critical limitations we have today around choice
>> > of disks to snapshot.
>> >
>> > If someone later wants to come up with a grand unified design
>> > for everything that's fine, we can deprecate the new QMP commands
>> > I'm proposing now.
>> 
>> Failing at coming up with an interface that has a reasonable chance to
>> be future-proof is okay.
>> 
>> Not even trying is not okay.
>
> This was raised in my v1 posting:
>
>   https://lists.gnu.org/archive/html/qemu-devel/2020-07/msg01346.html
>
> but the conclusion was that it was a non-trivial amount of extra
> implementation work

Thanks for the pointer.  I've now read that review thread.

>> Specifically, I'd like you to think about monolothic snapshot command
>> (internal snapshots only by design) vs. transaction of individual
>> snapshot commands (design is not restricted to internal snapshots, but
>> we may want to accept implementation restrictions).
>> 
>> We already have transactionable individual storage snapshot commands.
>> What's missing is a transactionable machine state snapshot command.
>
> At a high level I consider what I've proposed as being higher level
> syntax sugar vs a more generic future impl based on multiple commands
> for snapshotting disk & state. I don't think I'd claim that it will
> evolve to become the design you're suggesting here, as they are designs
> from different levels in the stack. IOW, I think the would ultimately
> just exist in parallel. I don't think that's a real problem from a
> maint POV, as the large burden from the monolithic snapshot code is
> not the HMP/QMP interface, but rather the guts of the internal
> impl in the migration/savevm.c and block/snapshot.c files. That code
> will exist for as long as the HMP commands exist, and adding a QMP
> interface doesn't make that situation worse unless we were intending
> to drop the existing HMP commands. If we did change our minds though,
> we can just deprecate the QMP command at any time we like.
>
>
>> One restriction I'd readily accept at this time is "the machine state
>> snapshot must write to a QCOW2 that is also internally snapshot in the
>> same transaction".
>> 
>> Now explain to me why this is impractical.
>
> The issues were described by Kevin here:
>
>   https://lists.gnu.org/archive/html/qemu-devel/2020-07/msg02057.html
>
> Assuming the migration impl is actually possible to solve, there is
> still the question of actually writing it. That's a non-trivial
> amount of work someone has to find time for.

Kevin explained how the transactionable machine state snapshot command
should be made non-blocking: post-copy.

I don't dispute that creating such a post-copy snapshot is a non-trivial
task.  It is out of reach for you and me.  I didn't actually ask for it,
though.

You argue that providing a blocking snapshot in QMP is better than
nothing, and good enough for quite a few applications.  I agree!  I
blocked prior attempts at porting HMP's savevm/loadvm to QMP not because
they were blocking, but because they stuck to the HMP interface, and the
HMP interface is bonkers.  I would accept the restriction "snapshotting
machine state is blocking, i.e. it stops the machine."  As I wrote in
2016, "Limitations: No live internal machine snapshot, yet."

Aside: unless I'm mistaken, taking an internal block device snapshot is
also blocking, but unlike taking a machine state snapshot, it's fast
enough for the blocking not to matter.  That's the "sync" in
blockdev-snapshot-internal-sync.

I asked you to consider the interface design I proposed back in 2016.
You point out above that your interface is more high-level, and could be
turned into sugar for a low level interface.

If true, this means your proposal doesn't box us into a corner, which is
good.

Let me elaborate a bit on the desugaring, just to make sure we're on the
same page.  Please correct me where I'm talking nonsense.

snapshot-save creates job that snapshots a set of block devices and the
machine state.  The snapshots are consistent, i.e. they are all taken at
the same point in time.  The block device snapshots are all internal.
The machine state snapshot is saved together with one of the (internal)
block device snapshots.

This is basically a transaction of blockdev-snapshot-internal-sync
(which exists) plus machine-snapshot-internal-sync (which doesn't exist)
wrapped in a job.

Likweise for snapshot-load, except there not even the command for block
snapshots exists.

I doubt creating the (transactionable, but blocking) low-level commands
is actually out of reach.  It's a matter of adding interfaces to parts
of the code you already got working.

I'm not demanding you do that, though.  As I said, my chief concerns are
keeping "bonkers" out of QMP, and not boxing us in needlessly.

Re: [PATCH v2 (BROKEN) 0/6] migration: bring improved savevm/loadvm/delvm to QMP

Posted by Daniel P. Berrangé 3 years, 8 months ago

On Tue, Sep 01, 2020 at 03:22:24PM +0200, Markus Armbruster wrote:
> Daniel P. Berrangé <berrange@redhat.com> writes:
> 
> > On Thu, Aug 27, 2020 at 01:04:43PM +0200, Markus Armbruster wrote:
> >> Daniel P. Berrangé <berrange@redhat.com> writes:
> >> 
> >> > On Wed, Aug 26, 2020 at 05:52:06PM +0200, Markus Armbruster wrote:
> >> > From the POV of practicality, making a design that unifies internal
> >> > and external snapshots is something I'm considering out of scope.
> >> > It increases the design time burden, as well as implementation burden.
> >> > On my side, improving internal snapshots is a "spare time" project,
> >> > not something I can justify spending weeks or months on.
> >> 
> >> I'm not demanding a solution that unifies internal and external
> >> snapshots.  I'm asking for a bit of serious thought on an interface that
> >> could compatibly evolve there.  Hours, not weeks or months.
> >> 
> >> > My goal is to implement something that is achievable in a short
> >> > amount of time that gets us out of the hole we've been in for 10
> >> > years. Minimal refactoring of the internal snapshot code aside
> >> > from fixing the critical limitations we have today around choice
> >> > of disks to snapshot.
> >> >
> >> > If someone later wants to come up with a grand unified design
> >> > for everything that's fine, we can deprecate the new QMP commands
> >> > I'm proposing now.
> >> 
> >> Failing at coming up with an interface that has a reasonable chance to
> >> be future-proof is okay.
> >> 
> >> Not even trying is not okay.
> >
> > This was raised in my v1 posting:
> >
> >   https://lists.gnu.org/archive/html/qemu-devel/2020-07/msg01346.html
> >
> > but the conclusion was that it was a non-trivial amount of extra
> > implementation work
> 
> Thanks for the pointer.  I've now read that review thread.
> 
> >> Specifically, I'd like you to think about monolothic snapshot command
> >> (internal snapshots only by design) vs. transaction of individual
> >> snapshot commands (design is not restricted to internal snapshots, but
> >> we may want to accept implementation restrictions).
> >> 
> >> We already have transactionable individual storage snapshot commands.
> >> What's missing is a transactionable machine state snapshot command.
> >
> > At a high level I consider what I've proposed as being higher level
> > syntax sugar vs a more generic future impl based on multiple commands
> > for snapshotting disk & state. I don't think I'd claim that it will
> > evolve to become the design you're suggesting here, as they are designs
> > from different levels in the stack. IOW, I think the would ultimately
> > just exist in parallel. I don't think that's a real problem from a
> > maint POV, as the large burden from the monolithic snapshot code is
> > not the HMP/QMP interface, but rather the guts of the internal
> > impl in the migration/savevm.c and block/snapshot.c files. That code
> > will exist for as long as the HMP commands exist, and adding a QMP
> > interface doesn't make that situation worse unless we were intending
> > to drop the existing HMP commands. If we did change our minds though,
> > we can just deprecate the QMP command at any time we like.
> >
> >
> >> One restriction I'd readily accept at this time is "the machine state
> >> snapshot must write to a QCOW2 that is also internally snapshot in the
> >> same transaction".
> >> 
> >> Now explain to me why this is impractical.
> >
> > The issues were described by Kevin here:
> >
> >   https://lists.gnu.org/archive/html/qemu-devel/2020-07/msg02057.html
> >
> > Assuming the migration impl is actually possible to solve, there is
> > still the question of actually writing it. That's a non-trivial
> > amount of work someone has to find time for.
> 
> Kevin explained how the transactionable machine state snapshot command
> should be made non-blocking: post-copy.
> 
> I don't dispute that creating such a post-copy snapshot is a non-trivial
> task.  It is out of reach for you and me.  I didn't actually ask for it,
> though.
> 
> You argue that providing a blocking snapshot in QMP is better than
> nothing, and good enough for quite a few applications.  I agree!  I
> blocked prior attempts at porting HMP's savevm/loadvm to QMP not because
> they were blocking, but because they stuck to the HMP interface, and the
> HMP interface is bonkers.  I would accept the restriction "snapshotting
> machine state is blocking, i.e. it stops the machine."  As I wrote in
> 2016, "Limitations: No live internal machine snapshot, yet."

FYI, when I documented the new QAPI commands I implemented, i choose to
*not* say that the snapshot is blocking. Instead I said:

  # Applications should not assume that the snapshot load is complete
  # when this command returns. Completion is indicated by the job
  # status. Clients can wait for the JOB_STATUS_CHANGE event. If the
  # job aborts, errors can be obtained via the 'query-jobs' command,
  # though.

The idea was that if we fix these QAPI commands to not block in future,
we want to minimize the risk of breaking clients, by discouraging them
from assuming the impl will always be blocking in future. IOW they
should assume the commands are asynchronous right now, even though they
are not.

> Aside: unless I'm mistaken, taking an internal block device snapshot is
> also blocking, but unlike taking a machine state snapshot, it's fast
> enough for the blocking not to matter.  That's the "sync" in
> blockdev-snapshot-internal-sync.


> Let me elaborate a bit on the desugaring, just to make sure we're on the
> same page.  Please correct me where I'm talking nonsense.
> 
> snapshot-save creates job that snapshots a set of block devices and the
> machine state.  The snapshots are consistent, i.e. they are all taken at
> the same point in time.  The block device snapshots are all internal.
> The machine state snapshot is saved together with one of the (internal)
> block device snapshots.

Yep

> This is basically a transaction of blockdev-snapshot-internal-sync
> (which exists) plus machine-snapshot-internal-sync (which doesn't exist)
> wrapped in a job.

Yes, except it isn't clear to me whether you can separate out the
functionality into two separate commands. It might be neccessary
to save the vmstate at the same time as the disk snapshot. In such
a case, instead of machine-snapshot-internal-sync, we might end up
having a "include vmstate" bool option to blockdev-snapshot-internal-sync
Either way though, we'd end up with a series of commands inside a
transaction.

> Likweise or snapshot-load, except there not even the command for block
> snapshots exists.
> 
> I doubt creating the (transactionable, but blocking) low-level commands
> is actually out of reach.  It's a matter of adding interfaces to parts
> of the code you already got working.

If we're splitting it up into one command per disk, and another command
for vmstate, then it will require refactoring the current migration/savevm.c
and block/snapshot.c code AFAICT, because its all written around the idea
of processing all disks at the same time.

> I'm not demanding you do that, though.  As I said, my chief concerns are
> keeping "bonkers" out of QMP, and not boxing us in needlessly.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH v2 (BROKEN) 0/6] migration: bring improved savevm/loadvm/delvm to QMP

Posted by Kevin Wolf 3 years, 8 months ago

Am 01.09.2020 um 15:22 hat Markus Armbruster geschrieben:
> Daniel P. BerrangÃ© <berrange@redhat.com> writes:
> 
> > On Thu, Aug 27, 2020 at 01:04:43PM +0200, Markus Armbruster wrote:
> >> Daniel P. BerrangÃ© <berrange@redhat.com> writes:
> >> 
> >> > On Wed, Aug 26, 2020 at 05:52:06PM +0200, Markus Armbruster wrote:
> >> > From the POV of practicality, making a design that unifies internal
> >> > and external snapshots is something I'm considering out of scope.
> >> > It increases the design time burden, as well as implementation burden.
> >> > On my side, improving internal snapshots is a "spare time" project,
> >> > not something I can justify spending weeks or months on.
> >> 
> >> I'm not demanding a solution that unifies internal and external
> >> snapshots.  I'm asking for a bit of serious thought on an interface that
> >> could compatibly evolve there.  Hours, not weeks or months.
> >> 
> >> > My goal is to implement something that is achievable in a short
> >> > amount of time that gets us out of the hole we've been in for 10
> >> > years. Minimal refactoring of the internal snapshot code aside
> >> > from fixing the critical limitations we have today around choice
> >> > of disks to snapshot.
> >> >
> >> > If someone later wants to come up with a grand unified design
> >> > for everything that's fine, we can deprecate the new QMP commands
> >> > I'm proposing now.
> >> 
> >> Failing at coming up with an interface that has a reasonable chance to
> >> be future-proof is okay.
> >> 
> >> Not even trying is not okay.
> >
> > This was raised in my v1 posting:
> >
> >   https://lists.gnu.org/archive/html/qemu-devel/2020-07/msg01346.html
> >
> > but the conclusion was that it was a non-trivial amount of extra
> > implementation work
> 
> Thanks for the pointer.  I've now read that review thread.
> 
> >> Specifically, I'd like you to think about monolothic snapshot command
> >> (internal snapshots only by design) vs. transaction of individual
> >> snapshot commands (design is not restricted to internal snapshots, but
> >> we may want to accept implementation restrictions).
> >> 
> >> We already have transactionable individual storage snapshot commands.
> >> What's missing is a transactionable machine state snapshot command.
> >
> > At a high level I consider what I've proposed as being higher level
> > syntax sugar vs a more generic future impl based on multiple commands
> > for snapshotting disk & state. I don't think I'd claim that it will
> > evolve to become the design you're suggesting here, as they are designs
> > from different levels in the stack. IOW, I think the would ultimately
> > just exist in parallel. I don't think that's a real problem from a
> > maint POV, as the large burden from the monolithic snapshot code is
> > not the HMP/QMP interface, but rather the guts of the internal
> > impl in the migration/savevm.c and block/snapshot.c files. That code
> > will exist for as long as the HMP commands exist, and adding a QMP
> > interface doesn't make that situation worse unless we were intending
> > to drop the existing HMP commands. If we did change our minds though,
> > we can just deprecate the QMP command at any time we like.
> >
> >
> >> One restriction I'd readily accept at this time is "the machine state
> >> snapshot must write to a QCOW2 that is also internally snapshot in the
> >> same transaction".
> >> 
> >> Now explain to me why this is impractical.
> >
> > The issues were described by Kevin here:
> >
> >   https://lists.gnu.org/archive/html/qemu-devel/2020-07/msg02057.html
> >
> > Assuming the migration impl is actually possible to solve, there is
> > still the question of actually writing it. That's a non-trivial
> > amount of work someone has to find time for.
> 
> Kevin explained how the transactionable machine state snapshot command
> should be made non-blocking: post-copy.
> 
> I don't dispute that creating such a post-copy snapshot is a non-trivial
> task.  It is out of reach for you and me.  I didn't actually ask for it,
> though.
> 
> You argue that providing a blocking snapshot in QMP is better than
> nothing, and good enough for quite a few applications.  I agree!  I
> blocked prior attempts at porting HMP's savevm/loadvm to QMP not because
> they were blocking, but because they stuck to the HMP interface, and the
> HMP interface is bonkers.  I would accept the restriction "snapshotting
> machine state is blocking, i.e. it stops the machine."  As I wrote in
> 2016, "Limitations: No live internal machine snapshot, yet."
> 
> Aside: unless I'm mistaken, taking an internal block device snapshot is
> also blocking, but unlike taking a machine state snapshot, it's fast
> enough for the blocking not to matter.  That's the "sync" in
> blockdev-snapshot-internal-sync.
> 
> I asked you to consider the interface design I proposed back in 2016.
> You point out above that your interface is more high-level, and could be
> turned into sugar for a low level interface.
> 
> If true, this means your proposal doesn't box us into a corner, which is
> good.
> 
> Let me elaborate a bit on the desugaring, just to make sure we're on the
> same page.  Please correct me where I'm talking nonsense.
> 
> snapshot-save creates job that snapshots a set of block devices and the
> machine state.  The snapshots are consistent, i.e. they are all taken at
> the same point in time.  The block device snapshots are all internal.
> The machine state snapshot is saved together with one of the (internal)
> block device snapshots.
> 
> This is basically a transaction of blockdev-snapshot-internal-sync
> (which exists) plus machine-snapshot-internal-sync (which doesn't exist)
> wrapped in a job.
> 
> Likweise for snapshot-load, except there not even the command for block
> snapshots exists.
> 
> I doubt creating the (transactionable, but blocking) low-level commands
> is actually out of reach.  It's a matter of adding interfaces to parts
> of the code you already got working.
> 
> I'm not demanding you do that, though.  As I said, my chief concerns are
> keeping "bonkers" out of QMP, and not boxing us in needlessly.

I doubt this is as easy as it may seem at the first sight. To remind
everyone, the way internal snapshots with VM state work today is:

1. Write VM state to VM state area inside the image (in qcow2, this is
   essentially the same as disk content, except at offsets higher than
   the image size).

2. Create the qcow2 snapshot, which means that the current disk content
   (including the VM state at higher offsets) becomes COW and the
   snapshotted copy becomes read-ony.

3. Discard the VM state in the active layer again, we don't need it
   there.

The implication is that either 1. has to be completed before 2. can
happen, or that 2. must be able to write into an already taken snapshot
rather than to the active layer (which in turn implies that the snapshot
must have completed).

So a naive implementation with a transaction might not give the right
result. It's not just independent operations, but some ordering between
them is required.

I feel having a single block job that covers both parts gives us more
flexibility in how we want to synchronise writing the VM state and
taking the disk snapshot - or in your words, avoids boxing us into a
corner.

Kevin