[PATCH V3 00/22] Live Update

Steve Sistare posted 22 patches 2 years, 11 months ago
Test checkpatch failed
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/1620390320-301716-1-git-send-email-steven.sistare@oracle.com
Maintainers: Juan Quintela <quintela@redhat.com>, Markus Armbruster <armbru@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>, Michael Roth <michael.roth@amd.com>, Igor Mammedov <imammedo@redhat.com>, "Alex Bennée" <alex.bennee@linaro.org>, Stefan Weil <sw@weilnetz.de>, "Philippe Mathieu-Daudé" <philmd@redhat.com>, "Dr. David Alan Gilbert" <dgilbert@redhat.com>, Marcel Apfelbaum <marcel.apfelbaum@gmail.com>, "Marc-André Lureau" <marcandre.lureau@redhat.com>, Eduardo Habkost <ehabkost@redhat.com>, Alex Williamson <alex.williamson@redhat.com>, "Michael S. Tsirkin" <mst@redhat.com>, Stefan Hajnoczi <stefanha@redhat.com>, Eric Blake <eblake@redhat.com>, Cornelia Huck <cohuck@redhat.com>
There is a newer version of this series
MAINTAINERS                   |  11 +++
backends/hostmem-memfd.c      |  21 +++--
chardev/char-mux.c            |   1 +
chardev/char-null.c           |   1 +
chardev/char-pty.c            |  15 ++-
chardev/char-serial.c         |   1 +
chardev/char-socket.c         |  35 +++++++
chardev/char-stdio.c          |   8 ++
chardev/char.c                |  41 +++++++-
gdbstub.c                     |   1 +
hmp-commands.hx               |  44 +++++++++
hw/core/machine.c             |  19 ++++
hw/pci/msi.c                  |   4 +
hw/pci/msix.c                 |  20 ++--
hw/pci/pci.c                  |   7 +-
hw/vfio/common.c              |  68 +++++++++++++-
hw/vfio/cpr.c                 | 131 ++++++++++++++++++++++++++
hw/vfio/meson.build           |   1 +
hw/vfio/pci.c                 | 214 ++++++++++++++++++++++++++++++++++++++----
hw/vfio/trace-events          |   1 +
hw/virtio/vhost.c             |  11 +++
include/chardev/char.h        |   6 ++
include/exec/memory.h         |  25 +++++
include/hw/boards.h           |   1 +
include/hw/pci/msix.h         |   5 +
include/hw/pci/pci.h          |   2 +
include/hw/vfio/vfio-common.h |   8 ++
include/hw/virtio/vhost.h     |   1 +
include/migration/cpr.h       |  17 ++++
include/monitor/hmp.h         |   3 +
include/qemu/env.h            |  23 +++++
include/qemu/osdep.h          |   1 +
include/sysemu/runstate.h     |   2 +
include/sysemu/sysemu.h       |   2 +
linux-headers/linux/vfio.h    |  27 ++++++
migration/cpr.c               | 200 +++++++++++++++++++++++++++++++++++++++
migration/meson.build         |   1 +
migration/migration.c         |   5 +
migration/savevm.c            |  21 ++---
migration/savevm.h            |   2 +
monitor/hmp-cmds.c            |  48 ++++++++++
monitor/hmp.c                 |   3 +
monitor/qmp-cmds.c            |  31 ++++++
monitor/qmp.c                 |   3 +
qapi/char.json                |   5 +-
qapi/cpr.json                 |  76 +++++++++++++++
qapi/meson.build              |   1 +
qapi/qapi-schema.json         |   1 +
qemu-options.hx               |  39 +++++++-
softmmu/globals.c             |   2 +
softmmu/memory.c              |  48 ++++++++++
softmmu/physmem.c             |  49 ++++++++--
softmmu/runstate.c            |  49 +++++++++-
softmmu/vl.c                  |  21 ++++-
stubs/cpr.c                   |   3 +
stubs/meson.build             |   1 +
trace-events                  |   1 +
util/env.c                    |  99 +++++++++++++++++++
util/meson.build              |   1 +
util/oslib-posix.c            |   9 ++
util/oslib-win32.c            |   4 +
util/qemu-config.c            |   4 +
62 files changed, 1431 insertions(+), 74 deletions(-)
create mode 100644 hw/vfio/cpr.c
create mode 100644 include/migration/cpr.h
create mode 100644 include/qemu/env.h
create mode 100644 migration/cpr.c
create mode 100644 qapi/cpr.json
create mode 100644 stubs/cpr.c
create mode 100644 util/env.c
[PATCH V3 00/22] Live Update
Posted by Steve Sistare 2 years, 11 months ago
Provide the cprsave and cprload commands for live update.  These save and
restore VM state, with minimal guest pause time, so that qemu may be updated
to a new version in between.

cprsave stops the VM and saves vmstate to an ordinary file.  It supports two
modes: restart and reboot.  For restart, cprsave exec's the qemu binary (or
/usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in a
paused state and waits for the cprload command.

To use the restart mode, qemu must be started with the memfd-alloc option,
which allocates guest ram using memfd_create.  The memfd's are saved to
the environment and kept open across exec, after which they are found from
the environment and re-mmap'd.  Hence guest ram is preserved in place,
albeit with new virtual addresses in the qemu process.  The caller resumes
the guest by calling cprload, which loads state from the file.  If the VM
was running at cprsave time, then VM execution resumes.  cprsave supports
any type of guest image and block device, but the caller must not modify
guest block devices between cprsave and cprload.

The restart mode supports vfio devices by preserving the vfio container,
group, device, and event descriptors across the qemu re-exec, and by
updating DMA mapping virtual addresses using VFIO_DMA_UNMAP_FLAG_VADDR and
VFIO_DMA_MAP_FLAG_VADDR as defined in https://lore.kernel.org/kvm/1611939252-7240-1-git-send-email-steven.sistare@oracle.com/
and integrated in Linux kernel 5.12.

For the reboot mode, cprsave saves state and exits qemu, and the caller is
allowed to update the host kernel and system software and reboot.  The
caller resumes the guest by running qemu with the same arguments as the
original process and calling cprload.  To use this mode, guest ram must be
mapped to a persistent shared memory file such as /dev/dax0.0, or /dev/shm
PKRAM as proposed in https://lore.kernel.org/lkml/1617140178-8773-1-git-send-email-anthony.yznaga@oracle.com.

The reboot mode supports vfio devices if the caller suspends the guest
instead of stopping the VM, such as by issuing guest-suspend-ram to the
qemu guest agent.  The guest drivers' suspend methods flush outstanding
requests and re-initialize the devices, and thus there is no device state
to save and restore.

The first patches add helper functions:

  - as_flat_walk
  - qemu_ram_volatile
  - oslib: qemu_clr_cloexec
  - util: env var helpers
  - machine: memfd-alloc option
  - vl: add helper to request re-exec

The next patches implement cprsave and cprload:

  - cpr
  - cpr: QMP interfaces
  - cpr: HMP interfaces

The next patches add vfio support for the restart mode:

  - pci: export functions for cpr
  - vfio-pci: refactor for cpr
  - vfio-pci: cpr part 1
  - vfio-pci: cpr part 2

The next patches preserve various descriptor-based backend devices across
a cprsave restart:

  - vhost: reset vhost devices upon cprsave
  - hostmem-memfd: cpr support
  - chardev: cpr framework
  - chardev: cpr for simple devices
  - chardev: cpr for pty
  - chardev: cpr for sockets
  - cpr: only-cpr-capable option
  - cpr: maintainers
  - simplify savevm

Here is an example of updating qemu from v4.2.0 to v4.2.1 using 
"cprload restart".  The software update is performed while the guest is
running to minimize downtime.

window 1				| window 2
					|
# qemu-system-x86_64 ... 		|
QEMU 4.2.0 monitor - type 'help' ...	|
(qemu) info status			|
VM status: running			|
					| # yum update qemu
(qemu) cprsave /tmp/qemu.sav restart	|
QEMU 4.2.1 monitor - type 'help' ...	|
(qemu) info status			|
VM status: paused (prelaunch)		|
(qemu) cprload /tmp/qemu.sav		|
(qemu) info status			|
VM status: running			|


Here is an example of updating the host kernel using "cprload reboot"

window 1					| window 2
						|
# qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...|
QEMU 4.2.1 monitor - type 'help' ...		|
(qemu) info status				|
VM status: running				|
						| # yum update kernel-uek
(qemu) cprsave /tmp/qemu.sav restart		|
						|
# systemctl kexec				|
kexec_core: Starting new kernel			|
...						|
						|
# qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...|
QEMU 4.2.1 monitor - type 'help' ...		|
(qemu) info status				|
VM status: paused (prelaunch)			|
(qemu) cprload /tmp/qemu.sav			|
(qemu) info status				|
VM status: running				|

Changes from V1 to V2:
  - revert vmstate infrastructure changes
  - refactor cpr functions into new files
  - delete MADV_DOEXEC and use memfd + VFIO_DMA_UNMAP_FLAG_SUSPEND to 
    preserve memory.
  - add framework to filter chardev's that support cpr
  - save and restore vfio eventfd's
  - modify cprinfo QMP interface
  - incorporate misc review feedback
  - remove unrelated and unneeded patches
  - refactor all patches into a shorter and easier to review series

Changes from V2 to V3:
  - rebase to qemu 6.0.0
  - use final definition of vfio ioctls (VFIO_DMA_UNMAP_FLAG_VADDR etc)
  - change memfd-alloc to a machine option
  - use existing channel socket function instead of defining new ones
  - close monitor socket during cpr
  - support memory-backend-memfd
  - fix a few unreported bugs

Steve Sistare (18):
  as_flat_walk
  qemu_ram_volatile
  oslib: qemu_clr_cloexec
  util: env var helpers
  machine: memfd-alloc option
  vl: add helper to request re-exec
  cpr
  pci: export functions for cpr
  vfio-pci: refactor for cpr
  vfio-pci: cpr part 1
  vfio-pci: cpr part 2
  hostmem-memfd: cpr support
  chardev: cpr framework
  chardev: cpr for simple devices
  chardev: cpr for pty
  cpr: only-cpr-capable option
  cpr: maintainers
  simplify savevm

Mark Kanda, Steve Sistare (4):
  cpr: QMP interfaces
  cpr: HMP interfaces
  vhost: reset vhost devices upon cprsave
  chardev: cpr for sockets

 MAINTAINERS                   |  11 +++
 backends/hostmem-memfd.c      |  21 +++--
 chardev/char-mux.c            |   1 +
 chardev/char-null.c           |   1 +
 chardev/char-pty.c            |  15 ++-
 chardev/char-serial.c         |   1 +
 chardev/char-socket.c         |  35 +++++++
 chardev/char-stdio.c          |   8 ++
 chardev/char.c                |  41 +++++++-
 gdbstub.c                     |   1 +
 hmp-commands.hx               |  44 +++++++++
 hw/core/machine.c             |  19 ++++
 hw/pci/msi.c                  |   4 +
 hw/pci/msix.c                 |  20 ++--
 hw/pci/pci.c                  |   7 +-
 hw/vfio/common.c              |  68 +++++++++++++-
 hw/vfio/cpr.c                 | 131 ++++++++++++++++++++++++++
 hw/vfio/meson.build           |   1 +
 hw/vfio/pci.c                 | 214 ++++++++++++++++++++++++++++++++++++++----
 hw/vfio/trace-events          |   1 +
 hw/virtio/vhost.c             |  11 +++
 include/chardev/char.h        |   6 ++
 include/exec/memory.h         |  25 +++++
 include/hw/boards.h           |   1 +
 include/hw/pci/msix.h         |   5 +
 include/hw/pci/pci.h          |   2 +
 include/hw/vfio/vfio-common.h |   8 ++
 include/hw/virtio/vhost.h     |   1 +
 include/migration/cpr.h       |  17 ++++
 include/monitor/hmp.h         |   3 +
 include/qemu/env.h            |  23 +++++
 include/qemu/osdep.h          |   1 +
 include/sysemu/runstate.h     |   2 +
 include/sysemu/sysemu.h       |   2 +
 linux-headers/linux/vfio.h    |  27 ++++++
 migration/cpr.c               | 200 +++++++++++++++++++++++++++++++++++++++
 migration/meson.build         |   1 +
 migration/migration.c         |   5 +
 migration/savevm.c            |  21 ++---
 migration/savevm.h            |   2 +
 monitor/hmp-cmds.c            |  48 ++++++++++
 monitor/hmp.c                 |   3 +
 monitor/qmp-cmds.c            |  31 ++++++
 monitor/qmp.c                 |   3 +
 qapi/char.json                |   5 +-
 qapi/cpr.json                 |  76 +++++++++++++++
 qapi/meson.build              |   1 +
 qapi/qapi-schema.json         |   1 +
 qemu-options.hx               |  39 +++++++-
 softmmu/globals.c             |   2 +
 softmmu/memory.c              |  48 ++++++++++
 softmmu/physmem.c             |  49 ++++++++--
 softmmu/runstate.c            |  49 +++++++++-
 softmmu/vl.c                  |  21 ++++-
 stubs/cpr.c                   |   3 +
 stubs/meson.build             |   1 +
 trace-events                  |   1 +
 util/env.c                    |  99 +++++++++++++++++++
 util/meson.build              |   1 +
 util/oslib-posix.c            |   9 ++
 util/oslib-win32.c            |   4 +
 util/qemu-config.c            |   4 +
 62 files changed, 1431 insertions(+), 74 deletions(-)
 create mode 100644 hw/vfio/cpr.c
 create mode 100644 include/migration/cpr.h
 create mode 100644 include/qemu/env.h
 create mode 100644 migration/cpr.c
 create mode 100644 qapi/cpr.json
 create mode 100644 stubs/cpr.c
 create mode 100644 util/env.c

-- 
1.8.3.1


Re: [PATCH V3 00/22] Live Update
Posted by no-reply@patchew.org 2 years, 11 months ago
Patchew URL: https://patchew.org/QEMU/1620390320-301716-1-git-send-email-steven.sistare@oracle.com/



Hi,

This series seems to have some coding style problems. See output below for
more information:

Type: series
Message-id: 1620390320-301716-1-git-send-email-steven.sistare@oracle.com
Subject: [PATCH V3 00/22] Live Update

=== TEST SCRIPT BEGIN ===
#!/bin/bash
git rev-parse base > /dev/null || exit 0
git config --local diff.renamelimit 0
git config --local diff.renames True
git config --local diff.algorithm histogram
./scripts/checkpatch.pl --mailback base..
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
From https://github.com/patchew-project/qemu
 * [new tag]         patchew/1620390320-301716-1-git-send-email-steven.sistare@oracle.com -> patchew/1620390320-301716-1-git-send-email-steven.sistare@oracle.com
 - [tag update]      patchew/20210504124140.1100346-1-linux@roeck-us.net -> patchew/20210504124140.1100346-1-linux@roeck-us.net
 - [tag update]      patchew/20210506185641.284821-1-dgilbert@redhat.com -> patchew/20210506185641.284821-1-dgilbert@redhat.com
 - [tag update]      patchew/20210506193341.140141-1-lvivier@redhat.com -> patchew/20210506193341.140141-1-lvivier@redhat.com
 - [tag update]      patchew/20210506194358.3925-1-peter.maydell@linaro.org -> patchew/20210506194358.3925-1-peter.maydell@linaro.org
Switched to a new branch 'test'
8c778e6 simplify savevm
aca4f09 cpr: maintainers
697f8d0 cpr: only-cpr-capable option
0a8c20e chardev: cpr for sockets
cb270f4 chardev: cpr for pty
279230e chardev: cpr for simple devices
b122cfa chardev: cpr framework
6596676 hostmem-memfd: cpr support
8cb6348 vhost: reset vhost devices upon cprsave
e3ae86d vfio-pci: cpr part 2
02c628d vfio-pci: cpr part 1
d93623c vfio-pci: refactor for cpr
bc63b3e pci: export functions for cpr
2b10bdd cpr: HMP interfaces
29bc20a cpr: QMP interfaces
3f84e6c cpr
0a32588 vl: add helper to request re-exec
466b4cf machine: memfd-alloc option
50c3e84 util: env var helpers
76c3550 oslib: qemu_clr_cloexec
d819bd4 qemu_ram_volatile
c466ddf as_flat_walk

=== OUTPUT BEGIN ===
1/22 Checking commit c466ddfd2209 (as_flat_walk)
2/22 Checking commit d819bd4dcc09 (qemu_ram_volatile)
3/22 Checking commit 76c3550a677b (oslib: qemu_clr_cloexec)
4/22 Checking commit 50c3e84cf5a6 (util: env var helpers)
Use of uninitialized value $acpi_testexpected in string eq at ./scripts/checkpatch.pl line 1529.
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#19: 
new file mode 100644

ERROR: consider using qemu_strtol in preference to strtol
#72: FILE: util/env.c:20:
+        res = strtol(val, 0, 10);

total: 1 errors, 1 warnings, 129 lines checked

Patch 4/22 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

5/22 Checking commit 466b4cf4ce8c (machine: memfd-alloc option)
6/22 Checking commit 0a32588de76e (vl: add helper to request re-exec)
7/22 Checking commit 3f84e6c38bd6 (cpr)
Use of uninitialized value $acpi_testexpected in string eq at ./scripts/checkpatch.pl line 1529.
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#55: 
new file mode 100644

total: 0 errors, 1 warnings, 314 lines checked

Patch 7/22 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
8/22 Checking commit 29bc20ab5870 (cpr: QMP interfaces)
Use of uninitialized value $acpi_testexpected in string eq at ./scripts/checkpatch.pl line 1529.
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#81: 
new file mode 100644

total: 0 errors, 1 warnings, 136 lines checked

Patch 8/22 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
9/22 Checking commit 2b10bdd5edb3 (cpr: HMP interfaces)
10/22 Checking commit bc63b3edc621 (pci: export functions for cpr)
11/22 Checking commit d93623c4da4d (vfio-pci: refactor for cpr)
12/22 Checking commit 02c628d50b57 (vfio-pci: cpr part 1)
Use of uninitialized value $acpi_testexpected in string eq at ./scripts/checkpatch.pl line 1529.
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#271: 
new file mode 100644

total: 0 errors, 1 warnings, 566 lines checked

Patch 12/22 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
13/22 Checking commit e3ae86d2076c (vfio-pci: cpr part 2)
14/22 Checking commit 8cb6348c8cff (vhost: reset vhost devices upon cprsave)
15/22 Checking commit 65966769fa93 (hostmem-memfd: cpr support)
16/22 Checking commit b122cfa96106 (chardev: cpr framework)
17/22 Checking commit 279230e03a78 (chardev: cpr for simple devices)
18/22 Checking commit cb270f49693f (chardev: cpr for pty)
19/22 Checking commit 0a8c20e0a8d4 (chardev: cpr for sockets)
20/22 Checking commit 697f8d021f43 (cpr: only-cpr-capable option)
Use of uninitialized value $acpi_testexpected in string eq at ./scripts/checkpatch.pl line 1529.
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#200: 
new file mode 100644

total: 0 errors, 1 warnings, 133 lines checked

Patch 20/22 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
21/22 Checking commit aca4f092c865 (cpr: maintainers)
22/22 Checking commit 8c778e6f284c (simplify savevm)
=== OUTPUT END ===

Test command exited with code: 1


The full log is available at
http://patchew.org/logs/1620390320-301716-1-git-send-email-steven.sistare@oracle.com/testing.checkpatch/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com
Re: [PATCH V3 00/22] Live Update
Posted by Steven Sistare 2 years, 11 months ago
I will add to MAINTAINERS incrementally instead of at the end to make checkpatch happy.

I will use qemu_strtol even though I thought the message "consider using qemu_strtol"
was giving me a choice.

You can't fight The Man when the man is a robot.

- Steve

On 5/7/2021 9:00 AM, no-reply@patchew.org wrote:
> Patchew URL: https://patchew.org/QEMU/1620390320-301716-1-git-send-email-steven.sistare@oracle.com/
> 
> Hi,
> 
> This series seems to have some coding style problems. See output below for
> more information:
> 
> Type: series
> Message-id: 1620390320-301716-1-git-send-email-steven.sistare@oracle.com
> Subject: [PATCH V3 00/22] Live Update
> 
> === TEST SCRIPT BEGIN ===
> #!/bin/bash
> git rev-parse base > /dev/null || exit 0
> git config --local diff.renamelimit 0
> git config --local diff.renames True
> git config --local diff.algorithm histogram
> ./scripts/checkpatch.pl --mailback base..
> === TEST SCRIPT END ===
> 
> Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
> From https://github.com/patchew-project/qemu
>  * [new tag]         patchew/1620390320-301716-1-git-send-email-steven.sistare@oracle.com -> patchew/1620390320-301716-1-git-send-email-steven.sistare@oracle.com
>  - [tag update]      patchew/20210504124140.1100346-1-linux@roeck-us.net -> patchew/20210504124140.1100346-1-linux@roeck-us.net
>  - [tag update]      patchew/20210506185641.284821-1-dgilbert@redhat.com -> patchew/20210506185641.284821-1-dgilbert@redhat.com
>  - [tag update]      patchew/20210506193341.140141-1-lvivier@redhat.com -> patchew/20210506193341.140141-1-lvivier@redhat.com
>  - [tag update]      patchew/20210506194358.3925-1-peter.maydell@linaro.org -> patchew/20210506194358.3925-1-peter.maydell@linaro.org
> Switched to a new branch 'test'
> 8c778e6 simplify savevm
> aca4f09 cpr: maintainers
> 697f8d0 cpr: only-cpr-capable option
> 0a8c20e chardev: cpr for sockets
> cb270f4 chardev: cpr for pty
> 279230e chardev: cpr for simple devices
> b122cfa chardev: cpr framework
> 6596676 hostmem-memfd: cpr support
> 8cb6348 vhost: reset vhost devices upon cprsave
> e3ae86d vfio-pci: cpr part 2
> 02c628d vfio-pci: cpr part 1
> d93623c vfio-pci: refactor for cpr
> bc63b3e pci: export functions for cpr
> 2b10bdd cpr: HMP interfaces
> 29bc20a cpr: QMP interfaces
> 3f84e6c cpr
> 0a32588 vl: add helper to request re-exec
> 466b4cf machine: memfd-alloc option
> 50c3e84 util: env var helpers
> 76c3550 oslib: qemu_clr_cloexec
> d819bd4 qemu_ram_volatile
> c466ddf as_flat_walk
> 
> === OUTPUT BEGIN ===
> 1/22 Checking commit c466ddfd2209 (as_flat_walk)
> 2/22 Checking commit d819bd4dcc09 (qemu_ram_volatile)
> 3/22 Checking commit 76c3550a677b (oslib: qemu_clr_cloexec)
> 4/22 Checking commit 50c3e84cf5a6 (util: env var helpers)
> Use of uninitialized value $acpi_testexpected in string eq at ./scripts/checkpatch.pl line 1529.
> WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
> #19: 
> new file mode 100644
> 
> ERROR: consider using qemu_strtol in preference to strtol
> #72: FILE: util/env.c:20:
> +        res = strtol(val, 0, 10);
> 
> total: 1 errors, 1 warnings, 129 lines checked
> 
> Patch 4/22 has style problems, please review.  If any of these errors
> are false positives report them to the maintainer, see
> CHECKPATCH in MAINTAINERS.
> 
> 5/22 Checking commit 466b4cf4ce8c (machine: memfd-alloc option)
> 6/22 Checking commit 0a32588de76e (vl: add helper to request re-exec)
> 7/22 Checking commit 3f84e6c38bd6 (cpr)
> Use of uninitialized value $acpi_testexpected in string eq at ./scripts/checkpatch.pl line 1529.
> WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
> #55: 
> new file mode 100644
> 
> total: 0 errors, 1 warnings, 314 lines checked
> 
> Patch 7/22 has style problems, please review.  If any of these errors
> are false positives report them to the maintainer, see
> CHECKPATCH in MAINTAINERS.
> 8/22 Checking commit 29bc20ab5870 (cpr: QMP interfaces)
> Use of uninitialized value $acpi_testexpected in string eq at ./scripts/checkpatch.pl line 1529.
> WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
> #81: 
> new file mode 100644
> 
> total: 0 errors, 1 warnings, 136 lines checked
> 
> Patch 8/22 has style problems, please review.  If any of these errors
> are false positives report them to the maintainer, see
> CHECKPATCH in MAINTAINERS.
> 9/22 Checking commit 2b10bdd5edb3 (cpr: HMP interfaces)
> 10/22 Checking commit bc63b3edc621 (pci: export functions for cpr)
> 11/22 Checking commit d93623c4da4d (vfio-pci: refactor for cpr)
> 12/22 Checking commit 02c628d50b57 (vfio-pci: cpr part 1)
> Use of uninitialized value $acpi_testexpected in string eq at ./scripts/checkpatch.pl line 1529.
> WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
> #271: 
> new file mode 100644
> 
> total: 0 errors, 1 warnings, 566 lines checked
> 
> Patch 12/22 has style problems, please review.  If any of these errors
> are false positives report them to the maintainer, see
> CHECKPATCH in MAINTAINERS.
> 13/22 Checking commit e3ae86d2076c (vfio-pci: cpr part 2)
> 14/22 Checking commit 8cb6348c8cff (vhost: reset vhost devices upon cprsave)
> 15/22 Checking commit 65966769fa93 (hostmem-memfd: cpr support)
> 16/22 Checking commit b122cfa96106 (chardev: cpr framework)
> 17/22 Checking commit 279230e03a78 (chardev: cpr for simple devices)
> 18/22 Checking commit cb270f49693f (chardev: cpr for pty)
> 19/22 Checking commit 0a8c20e0a8d4 (chardev: cpr for sockets)
> 20/22 Checking commit 697f8d021f43 (cpr: only-cpr-capable option)
> Use of uninitialized value $acpi_testexpected in string eq at ./scripts/checkpatch.pl line 1529.
> WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
> #200: 
> new file mode 100644
> 
> total: 0 errors, 1 warnings, 133 lines checked
> 
> Patch 20/22 has style problems, please review.  If any of these errors
> are false positives report them to the maintainer, see
> CHECKPATCH in MAINTAINERS.
> 21/22 Checking commit aca4f092c865 (cpr: maintainers)
> 22/22 Checking commit 8c778e6f284c (simplify savevm)
> === OUTPUT END ===
> 
> Test command exited with code: 1
> 
> 
> The full log is available at
> http://patchew.org/logs/1620390320-301716-1-git-send-email-steven.sistare@oracle.com/testing.checkpatch/?type=message.
> ---
> Email generated automatically by Patchew [https://patchew.org/].
> Please send your feedback to patchew-devel@redhat.com
> 

Re: [PATCH V3 00/22] Live Update
Posted by Steven Sistare 2 years, 11 months ago
Hi Michael, Marcel,
  I hope you have time to review the pci and vfio-pci related patches in this
series.  They are an essential part of the live update functionality.  The
first 2 patches are straightforward, just exposing functions for use in vfio.
The last 2 patches are more substantial.

  - pci: export functions for cpr
  - vfio-pci: refactor for cpr
  - vfio-pci: cpr part 1
  - vfio-pci: cpr part 2

- Steve

On 5/7/2021 8:24 AM, Steve Sistare wrote:
> Provide the cprsave and cprload commands for live update.  These save and
> restore VM state, with minimal guest pause time, so that qemu may be updated
> to a new version in between.
> 
> cprsave stops the VM and saves vmstate to an ordinary file.  It supports two
> modes: restart and reboot.  For restart, cprsave exec's the qemu binary (or
> /usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in a
> paused state and waits for the cprload command.
> 
> To use the restart mode, qemu must be started with the memfd-alloc option,
> which allocates guest ram using memfd_create.  The memfd's are saved to
> the environment and kept open across exec, after which they are found from
> the environment and re-mmap'd.  Hence guest ram is preserved in place,
> albeit with new virtual addresses in the qemu process.  The caller resumes
> the guest by calling cprload, which loads state from the file.  If the VM
> was running at cprsave time, then VM execution resumes.  cprsave supports
> any type of guest image and block device, but the caller must not modify
> guest block devices between cprsave and cprload.
> 
> The restart mode supports vfio devices by preserving the vfio container,
> group, device, and event descriptors across the qemu re-exec, and by
> updating DMA mapping virtual addresses using VFIO_DMA_UNMAP_FLAG_VADDR and
> VFIO_DMA_MAP_FLAG_VADDR as defined in https://lore.kernel.org/kvm/1611939252-7240-1-git-send-email-steven.sistare@oracle.com/
> and integrated in Linux kernel 5.12.
> 
> For the reboot mode, cprsave saves state and exits qemu, and the caller is
> allowed to update the host kernel and system software and reboot.  The
> caller resumes the guest by running qemu with the same arguments as the
> original process and calling cprload.  To use this mode, guest ram must be
> mapped to a persistent shared memory file such as /dev/dax0.0, or /dev/shm
> PKRAM as proposed in https://lore.kernel.org/lkml/1617140178-8773-1-git-send-email-anthony.yznaga@oracle.com.
> 
> The reboot mode supports vfio devices if the caller suspends the guest
> instead of stopping the VM, such as by issuing guest-suspend-ram to the
> qemu guest agent.  The guest drivers' suspend methods flush outstanding
> requests and re-initialize the devices, and thus there is no device state
> to save and restore.
> 
> The first patches add helper functions:
> 
>   - as_flat_walk
>   - qemu_ram_volatile
>   - oslib: qemu_clr_cloexec
>   - util: env var helpers
>   - machine: memfd-alloc option
>   - vl: add helper to request re-exec
> 
> The next patches implement cprsave and cprload:
> 
>   - cpr
>   - cpr: QMP interfaces
>   - cpr: HMP interfaces
> 
> The next patches add vfio support for the restart mode:
> 
>   - pci: export functions for cpr
>   - vfio-pci: refactor for cpr
>   - vfio-pci: cpr part 1
>   - vfio-pci: cpr part 2
> 
> The next patches preserve various descriptor-based backend devices across
> a cprsave restart:
> 
>   - vhost: reset vhost devices upon cprsave
>   - hostmem-memfd: cpr support
>   - chardev: cpr framework
>   - chardev: cpr for simple devices
>   - chardev: cpr for pty
>   - chardev: cpr for sockets
>   - cpr: only-cpr-capable option
>   - cpr: maintainers
>   - simplify savevm
> 
> Here is an example of updating qemu from v4.2.0 to v4.2.1 using 
> "cprload restart".  The software update is performed while the guest is
> running to minimize downtime.
> 
> window 1				| window 2
> 					|
> # qemu-system-x86_64 ... 		|
> QEMU 4.2.0 monitor - type 'help' ...	|
> (qemu) info status			|
> VM status: running			|
> 					| # yum update qemu
> (qemu) cprsave /tmp/qemu.sav restart	|
> QEMU 4.2.1 monitor - type 'help' ...	|
> (qemu) info status			|
> VM status: paused (prelaunch)		|
> (qemu) cprload /tmp/qemu.sav		|
> (qemu) info status			|
> VM status: running			|
> 
> 
> Here is an example of updating the host kernel using "cprload reboot"
> 
> window 1					| window 2
> 						|
> # qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...|
> QEMU 4.2.1 monitor - type 'help' ...		|
> (qemu) info status				|
> VM status: running				|
> 						| # yum update kernel-uek
> (qemu) cprsave /tmp/qemu.sav restart		|
> 						|
> # systemctl kexec				|
> kexec_core: Starting new kernel			|
> ...						|
> 						|
> # qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...|
> QEMU 4.2.1 monitor - type 'help' ...		|
> (qemu) info status				|
> VM status: paused (prelaunch)			|
> (qemu) cprload /tmp/qemu.sav			|
> (qemu) info status				|
> VM status: running				|
> 
> Changes from V1 to V2:
>   - revert vmstate infrastructure changes
>   - refactor cpr functions into new files
>   - delete MADV_DOEXEC and use memfd + VFIO_DMA_UNMAP_FLAG_SUSPEND to 
>     preserve memory.
>   - add framework to filter chardev's that support cpr
>   - save and restore vfio eventfd's
>   - modify cprinfo QMP interface
>   - incorporate misc review feedback
>   - remove unrelated and unneeded patches
>   - refactor all patches into a shorter and easier to review series
> 
> Changes from V2 to V3:
>   - rebase to qemu 6.0.0
>   - use final definition of vfio ioctls (VFIO_DMA_UNMAP_FLAG_VADDR etc)
>   - change memfd-alloc to a machine option
>   - use existing channel socket function instead of defining new ones
>   - close monitor socket during cpr
>   - support memory-backend-memfd
>   - fix a few unreported bugs
> 
> Steve Sistare (18):
>   as_flat_walk
>   qemu_ram_volatile
>   oslib: qemu_clr_cloexec
>   util: env var helpers
>   machine: memfd-alloc option
>   vl: add helper to request re-exec
>   cpr
>   pci: export functions for cpr
>   vfio-pci: refactor for cpr
>   vfio-pci: cpr part 1
>   vfio-pci: cpr part 2
>   hostmem-memfd: cpr support
>   chardev: cpr framework
>   chardev: cpr for simple devices
>   chardev: cpr for pty
>   cpr: only-cpr-capable option
>   cpr: maintainers
>   simplify savevm
> 
> Mark Kanda, Steve Sistare (4):
>   cpr: QMP interfaces
>   cpr: HMP interfaces
>   vhost: reset vhost devices upon cprsave
>   chardev: cpr for sockets
> 
>  MAINTAINERS                   |  11 +++
>  backends/hostmem-memfd.c      |  21 +++--
>  chardev/char-mux.c            |   1 +
>  chardev/char-null.c           |   1 +
>  chardev/char-pty.c            |  15 ++-
>  chardev/char-serial.c         |   1 +
>  chardev/char-socket.c         |  35 +++++++
>  chardev/char-stdio.c          |   8 ++
>  chardev/char.c                |  41 +++++++-
>  gdbstub.c                     |   1 +
>  hmp-commands.hx               |  44 +++++++++
>  hw/core/machine.c             |  19 ++++
>  hw/pci/msi.c                  |   4 +
>  hw/pci/msix.c                 |  20 ++--
>  hw/pci/pci.c                  |   7 +-
>  hw/vfio/common.c              |  68 +++++++++++++-
>  hw/vfio/cpr.c                 | 131 ++++++++++++++++++++++++++
>  hw/vfio/meson.build           |   1 +
>  hw/vfio/pci.c                 | 214 ++++++++++++++++++++++++++++++++++++++----
>  hw/vfio/trace-events          |   1 +
>  hw/virtio/vhost.c             |  11 +++
>  include/chardev/char.h        |   6 ++
>  include/exec/memory.h         |  25 +++++
>  include/hw/boards.h           |   1 +
>  include/hw/pci/msix.h         |   5 +
>  include/hw/pci/pci.h          |   2 +
>  include/hw/vfio/vfio-common.h |   8 ++
>  include/hw/virtio/vhost.h     |   1 +
>  include/migration/cpr.h       |  17 ++++
>  include/monitor/hmp.h         |   3 +
>  include/qemu/env.h            |  23 +++++
>  include/qemu/osdep.h          |   1 +
>  include/sysemu/runstate.h     |   2 +
>  include/sysemu/sysemu.h       |   2 +
>  linux-headers/linux/vfio.h    |  27 ++++++
>  migration/cpr.c               | 200 +++++++++++++++++++++++++++++++++++++++
>  migration/meson.build         |   1 +
>  migration/migration.c         |   5 +
>  migration/savevm.c            |  21 ++---
>  migration/savevm.h            |   2 +
>  monitor/hmp-cmds.c            |  48 ++++++++++
>  monitor/hmp.c                 |   3 +
>  monitor/qmp-cmds.c            |  31 ++++++
>  monitor/qmp.c                 |   3 +
>  qapi/char.json                |   5 +-
>  qapi/cpr.json                 |  76 +++++++++++++++
>  qapi/meson.build              |   1 +
>  qapi/qapi-schema.json         |   1 +
>  qemu-options.hx               |  39 +++++++-
>  softmmu/globals.c             |   2 +
>  softmmu/memory.c              |  48 ++++++++++
>  softmmu/physmem.c             |  49 ++++++++--
>  softmmu/runstate.c            |  49 +++++++++-
>  softmmu/vl.c                  |  21 ++++-
>  stubs/cpr.c                   |   3 +
>  stubs/meson.build             |   1 +
>  trace-events                  |   1 +
>  util/env.c                    |  99 +++++++++++++++++++
>  util/meson.build              |   1 +
>  util/oslib-posix.c            |   9 ++
>  util/oslib-win32.c            |   4 +
>  util/qemu-config.c            |   4 +
>  62 files changed, 1431 insertions(+), 74 deletions(-)
>  create mode 100644 hw/vfio/cpr.c
>  create mode 100644 include/migration/cpr.h
>  create mode 100644 include/qemu/env.h
>  create mode 100644 migration/cpr.c
>  create mode 100644 qapi/cpr.json
>  create mode 100644 stubs/cpr.c
>  create mode 100644 util/env.c
> 

Re: [PATCH V3 00/22] Live Update
Posted by Steven Sistare 2 years, 10 months ago
Hi Michael,
  Alex has reviewed the vfio-pci patches.  If you could give me a thumbs-up or a 
needs-work on "pci: export functions for cpr", I would appreciate it.  Thanks!

[PATCH V3 10/22] pci: export functions for cpr
https://lore.kernel.org/qemu-devel/1620390320-301716-11-git-send-email-steven.sistare@oracle.com

- Steve

On 5/19/2021 12:43 PM, Steven Sistare wrote:
> Hi Michael, Marcel,
>   I hope you have time to review the pci and vfio-pci related patches in this
> series.  They are an essential part of the live update functionality.  The
> first 2 patches are straightforward, just exposing functions for use in vfio.
> The last 2 patches are more substantial.
> 
>   - pci: export functions for cpr
>   - vfio-pci: refactor for cpr
>   - vfio-pci: cpr part 1
>   - vfio-pci: cpr part 2
> 
> - Steve
> 
> On 5/7/2021 8:24 AM, Steve Sistare wrote:
>> Provide the cprsave and cprload commands for live update.  These save and
>> restore VM state, with minimal guest pause time, so that qemu may be updated
>> to a new version in between.
>>
>> cprsave stops the VM and saves vmstate to an ordinary file.  It supports two
>> modes: restart and reboot.  For restart, cprsave exec's the qemu binary (or
>> /usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in a
>> paused state and waits for the cprload command.
>>
>> To use the restart mode, qemu must be started with the memfd-alloc option,
>> which allocates guest ram using memfd_create.  The memfd's are saved to
>> the environment and kept open across exec, after which they are found from
>> the environment and re-mmap'd.  Hence guest ram is preserved in place,
>> albeit with new virtual addresses in the qemu process.  The caller resumes
>> the guest by calling cprload, which loads state from the file.  If the VM
>> was running at cprsave time, then VM execution resumes.  cprsave supports
>> any type of guest image and block device, but the caller must not modify
>> guest block devices between cprsave and cprload.
>>
>> The restart mode supports vfio devices by preserving the vfio container,
>> group, device, and event descriptors across the qemu re-exec, and by
>> updating DMA mapping virtual addresses using VFIO_DMA_UNMAP_FLAG_VADDR and
>> VFIO_DMA_MAP_FLAG_VADDR as defined in https://lore.kernel.org/kvm/1611939252-7240-1-git-send-email-steven.sistare@oracle.com/
>> and integrated in Linux kernel 5.12.
>>
>> For the reboot mode, cprsave saves state and exits qemu, and the caller is
>> allowed to update the host kernel and system software and reboot.  The
>> caller resumes the guest by running qemu with the same arguments as the
>> original process and calling cprload.  To use this mode, guest ram must be
>> mapped to a persistent shared memory file such as /dev/dax0.0, or /dev/shm
>> PKRAM as proposed in https://lore.kernel.org/lkml/1617140178-8773-1-git-send-email-anthony.yznaga@oracle.com.
>>
>> The reboot mode supports vfio devices if the caller suspends the guest
>> instead of stopping the VM, such as by issuing guest-suspend-ram to the
>> qemu guest agent.  The guest drivers' suspend methods flush outstanding
>> requests and re-initialize the devices, and thus there is no device state
>> to save and restore.
>>
>> The first patches add helper functions:
>>
>>   - as_flat_walk
>>   - qemu_ram_volatile
>>   - oslib: qemu_clr_cloexec
>>   - util: env var helpers
>>   - machine: memfd-alloc option
>>   - vl: add helper to request re-exec
>>
>> The next patches implement cprsave and cprload:
>>
>>   - cpr
>>   - cpr: QMP interfaces
>>   - cpr: HMP interfaces
>>
>> The next patches add vfio support for the restart mode:
>>
>>   - pci: export functions for cpr
>>   - vfio-pci: refactor for cpr
>>   - vfio-pci: cpr part 1
>>   - vfio-pci: cpr part 2
>>
>> The next patches preserve various descriptor-based backend devices across
>> a cprsave restart:
>>
>>   - vhost: reset vhost devices upon cprsave
>>   - hostmem-memfd: cpr support
>>   - chardev: cpr framework
>>   - chardev: cpr for simple devices
>>   - chardev: cpr for pty
>>   - chardev: cpr for sockets
>>   - cpr: only-cpr-capable option
>>   - cpr: maintainers
>>   - simplify savevm
>>
>> Here is an example of updating qemu from v4.2.0 to v4.2.1 using 
>> "cprload restart".  The software update is performed while the guest is
>> running to minimize downtime.
>>
>> window 1				| window 2
>> 					|
>> # qemu-system-x86_64 ... 		|
>> QEMU 4.2.0 monitor - type 'help' ...	|
>> (qemu) info status			|
>> VM status: running			|
>> 					| # yum update qemu
>> (qemu) cprsave /tmp/qemu.sav restart	|
>> QEMU 4.2.1 monitor - type 'help' ...	|
>> (qemu) info status			|
>> VM status: paused (prelaunch)		|
>> (qemu) cprload /tmp/qemu.sav		|
>> (qemu) info status			|
>> VM status: running			|
>>
>>
>> Here is an example of updating the host kernel using "cprload reboot"
>>
>> window 1					| window 2
>> 						|
>> # qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...|
>> QEMU 4.2.1 monitor - type 'help' ...		|
>> (qemu) info status				|
>> VM status: running				|
>> 						| # yum update kernel-uek
>> (qemu) cprsave /tmp/qemu.sav restart		|
>> 						|
>> # systemctl kexec				|
>> kexec_core: Starting new kernel			|
>> ...						|
>> 						|
>> # qemu-system-x86_64 ...mem-path=/dev/dax0.0 ...|
>> QEMU 4.2.1 monitor - type 'help' ...		|
>> (qemu) info status				|
>> VM status: paused (prelaunch)			|
>> (qemu) cprload /tmp/qemu.sav			|
>> (qemu) info status				|
>> VM status: running				|
>>
>> Changes from V1 to V2:
>>   - revert vmstate infrastructure changes
>>   - refactor cpr functions into new files
>>   - delete MADV_DOEXEC and use memfd + VFIO_DMA_UNMAP_FLAG_SUSPEND to 
>>     preserve memory.
>>   - add framework to filter chardev's that support cpr
>>   - save and restore vfio eventfd's
>>   - modify cprinfo QMP interface
>>   - incorporate misc review feedback
>>   - remove unrelated and unneeded patches
>>   - refactor all patches into a shorter and easier to review series
>>
>> Changes from V2 to V3:
>>   - rebase to qemu 6.0.0
>>   - use final definition of vfio ioctls (VFIO_DMA_UNMAP_FLAG_VADDR etc)
>>   - change memfd-alloc to a machine option
>>   - use existing channel socket function instead of defining new ones
>>   - close monitor socket during cpr
>>   - support memory-backend-memfd
>>   - fix a few unreported bugs
>>
>> Steve Sistare (18):
>>   as_flat_walk
>>   qemu_ram_volatile
>>   oslib: qemu_clr_cloexec
>>   util: env var helpers
>>   machine: memfd-alloc option
>>   vl: add helper to request re-exec
>>   cpr
>>   pci: export functions for cpr
>>   vfio-pci: refactor for cpr
>>   vfio-pci: cpr part 1
>>   vfio-pci: cpr part 2
>>   hostmem-memfd: cpr support
>>   chardev: cpr framework
>>   chardev: cpr for simple devices
>>   chardev: cpr for pty
>>   cpr: only-cpr-capable option
>>   cpr: maintainers
>>   simplify savevm
>>
>> Mark Kanda, Steve Sistare (4):
>>   cpr: QMP interfaces
>>   cpr: HMP interfaces
>>   vhost: reset vhost devices upon cprsave
>>   chardev: cpr for sockets
>>
>>  MAINTAINERS                   |  11 +++
>>  backends/hostmem-memfd.c      |  21 +++--
>>  chardev/char-mux.c            |   1 +
>>  chardev/char-null.c           |   1 +
>>  chardev/char-pty.c            |  15 ++-
>>  chardev/char-serial.c         |   1 +
>>  chardev/char-socket.c         |  35 +++++++
>>  chardev/char-stdio.c          |   8 ++
>>  chardev/char.c                |  41 +++++++-
>>  gdbstub.c                     |   1 +
>>  hmp-commands.hx               |  44 +++++++++
>>  hw/core/machine.c             |  19 ++++
>>  hw/pci/msi.c                  |   4 +
>>  hw/pci/msix.c                 |  20 ++--
>>  hw/pci/pci.c                  |   7 +-
>>  hw/vfio/common.c              |  68 +++++++++++++-
>>  hw/vfio/cpr.c                 | 131 ++++++++++++++++++++++++++
>>  hw/vfio/meson.build           |   1 +
>>  hw/vfio/pci.c                 | 214 ++++++++++++++++++++++++++++++++++++++----
>>  hw/vfio/trace-events          |   1 +
>>  hw/virtio/vhost.c             |  11 +++
>>  include/chardev/char.h        |   6 ++
>>  include/exec/memory.h         |  25 +++++
>>  include/hw/boards.h           |   1 +
>>  include/hw/pci/msix.h         |   5 +
>>  include/hw/pci/pci.h          |   2 +
>>  include/hw/vfio/vfio-common.h |   8 ++
>>  include/hw/virtio/vhost.h     |   1 +
>>  include/migration/cpr.h       |  17 ++++
>>  include/monitor/hmp.h         |   3 +
>>  include/qemu/env.h            |  23 +++++
>>  include/qemu/osdep.h          |   1 +
>>  include/sysemu/runstate.h     |   2 +
>>  include/sysemu/sysemu.h       |   2 +
>>  linux-headers/linux/vfio.h    |  27 ++++++
>>  migration/cpr.c               | 200 +++++++++++++++++++++++++++++++++++++++
>>  migration/meson.build         |   1 +
>>  migration/migration.c         |   5 +
>>  migration/savevm.c            |  21 ++---
>>  migration/savevm.h            |   2 +
>>  monitor/hmp-cmds.c            |  48 ++++++++++
>>  monitor/hmp.c                 |   3 +
>>  monitor/qmp-cmds.c            |  31 ++++++
>>  monitor/qmp.c                 |   3 +
>>  qapi/char.json                |   5 +-
>>  qapi/cpr.json                 |  76 +++++++++++++++
>>  qapi/meson.build              |   1 +
>>  qapi/qapi-schema.json         |   1 +
>>  qemu-options.hx               |  39 +++++++-
>>  softmmu/globals.c             |   2 +
>>  softmmu/memory.c              |  48 ++++++++++
>>  softmmu/physmem.c             |  49 ++++++++--
>>  softmmu/runstate.c            |  49 +++++++++-
>>  softmmu/vl.c                  |  21 ++++-
>>  stubs/cpr.c                   |   3 +
>>  stubs/meson.build             |   1 +
>>  trace-events                  |   1 +
>>  util/env.c                    |  99 +++++++++++++++++++
>>  util/meson.build              |   1 +
>>  util/oslib-posix.c            |   9 ++
>>  util/oslib-win32.c            |   4 +
>>  util/qemu-config.c            |   4 +
>>  62 files changed, 1431 insertions(+), 74 deletions(-)
>>  create mode 100644 hw/vfio/cpr.c
>>  create mode 100644 include/migration/cpr.h
>>  create mode 100644 include/qemu/env.h
>>  create mode 100644 migration/cpr.c
>>  create mode 100644 qapi/cpr.json
>>  create mode 100644 stubs/cpr.c
>>  create mode 100644 util/env.c
>>

Re: [PATCH V3 00/22] Live Update
Posted by Stefan Hajnoczi 2 years, 11 months ago
On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote:
> Provide the cprsave and cprload commands for live update.  These save and
> restore VM state, with minimal guest pause time, so that qemu may be updated
> to a new version in between.
> 
> cprsave stops the VM and saves vmstate to an ordinary file.  It supports two
> modes: restart and reboot.  For restart, cprsave exec's the qemu binary (or
> /usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in a
> paused state and waits for the cprload command.

I think cprsave/cprload could be generalized by using QMP to stash the
file descriptors. The 'getfd' QMP command already exists and QEMU code
already opens fds passed using this mechanism.

I haven't checked but it may be possible to drop some patches by reusing
QEMU's monitor file descriptor passing since the code already knows how
to open from 'getfd' fds.

The reason why using QMP is interesting is because it eliminates the
need for execve(2). QEMU may be unable to execute a program due to
chroot, seccomp, etc.

QMP would enable cprsave/cprload to work both with and without
execve(2).

One tricky thing with this approach might be startup ordering: how to
get fds via the QMP monitor in the new process before processing the
entire command-line.

Stefan
Re: [PATCH V3 00/22] Live Update
Posted by Steven Sistare 2 years, 11 months ago
On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote:
> On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote:
>> Provide the cprsave and cprload commands for live update.  These save and
>> restore VM state, with minimal guest pause time, so that qemu may be updated
>> to a new version in between.
>>
>> cprsave stops the VM and saves vmstate to an ordinary file.  It supports two
>> modes: restart and reboot.  For restart, cprsave exec's the qemu binary (or
>> /usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in a
>> paused state and waits for the cprload command.
> 
> I think cprsave/cprload could be generalized by using QMP to stash the
> file descriptors. The 'getfd' QMP command already exists and QEMU code
> already opens fds passed using this mechanism.
> 
> I haven't checked but it may be possible to drop some patches by reusing
> QEMU's monitor file descriptor passing since the code already knows how
> to open from 'getfd' fds.
> 
> The reason why using QMP is interesting is because it eliminates the
> need for execve(2). QEMU may be unable to execute a program due to
> chroot, seccomp, etc.
> 
> QMP would enable cprsave/cprload to work both with and without
> execve(2).
> 
> One tricky thing with this approach might be startup ordering: how to
> get fds via the QMP monitor in the new process before processing the
> entire command-line.

Early on I experimented with a similar approach.  Old qemu passed descriptors to an
escrow process and exited; new qemu started and retrieved the descriptors from escrow.
vfio mostly worked after I hacked the kernel to suppress the original-pid owner check.
I suspect my recent vfio extensions would smooth the rough edges.

However, the main issue is that guest ram must be backed by named shared memory, and
we would need to add code to support shared memory for all the secondary memory objects.
That makes it less interesting for us at this time; we care about updating legacy qemu 
instances with anonymous guest memory.

Having said all that, this would be an interesting project, just not the one I want to 
push now.  In the future we could add a new cprsave mode to support it in a backward
compatible manner.

- Steve

Re: [PATCH V3 00/22] Live Update
Posted by Stefan Hajnoczi 2 years, 11 months ago
On Thu, May 13, 2021 at 04:21:15PM -0400, Steven Sistare wrote:
> On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote:
> > On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote:
> >> Provide the cprsave and cprload commands for live update.  These save and
> >> restore VM state, with minimal guest pause time, so that qemu may be updated
> >> to a new version in between.
> >>
> >> cprsave stops the VM and saves vmstate to an ordinary file.  It supports two
> >> modes: restart and reboot.  For restart, cprsave exec's the qemu binary (or
> >> /usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in a
> >> paused state and waits for the cprload command.
> > 
> > I think cprsave/cprload could be generalized by using QMP to stash the
> > file descriptors. The 'getfd' QMP command already exists and QEMU code
> > already opens fds passed using this mechanism.
> > 
> > I haven't checked but it may be possible to drop some patches by reusing
> > QEMU's monitor file descriptor passing since the code already knows how
> > to open from 'getfd' fds.
> > 
> > The reason why using QMP is interesting is because it eliminates the
> > need for execve(2). QEMU may be unable to execute a program due to
> > chroot, seccomp, etc.
> > 
> > QMP would enable cprsave/cprload to work both with and without
> > execve(2).
> > 
> > One tricky thing with this approach might be startup ordering: how to
> > get fds via the QMP monitor in the new process before processing the
> > entire command-line.
> 
> Early on I experimented with a similar approach.  Old qemu passed descriptors to an
> escrow process and exited; new qemu started and retrieved the descriptors from escrow.
> vfio mostly worked after I hacked the kernel to suppress the original-pid owner check.
> I suspect my recent vfio extensions would smooth the rough edges.

I wonder about the reason for VFIO's pid limitation, maybe because it
pins pages from the original process?

Is this VFIO pid limitation the main reason why you chose to make QEMU
execve(2) the new binary?

> However, the main issue is that guest ram must be backed by named shared memory, and
> we would need to add code to support shared memory for all the secondary memory objects.
> That makes it less interesting for us at this time; we care about updating legacy qemu 
> instances with anonymous guest memory.

Thanks for explaining this more in the other sub-thread. The secondary
memory objects you mentioned are relatively small so I don't think
saving them in the traditional way is a problem.

Two approaches for zero-copy memory migration fit into QEMU's existing
migration infrastructure:

- Marking RAM blocks that are backed by named memory (tmpfs, hugetlbfs,
  etc) so they are not saved into the savevm file. The existing --object
  memory-backend-file syntax can be used.

- Extending the live migration protocol to detect when file descriptor
  passing is available (i.e. UNIX domain socket migration) and using
  that for memory-backend-* objects that have fds.

Either of these approaches would handle RAM with existing savevm/migrate
commands.

The remaining issue is how to migrate VFIO and other file descriptors
that cannot be reopened by the new process. As mentioned, QEMU already
has file descriptor passing support in the QMP monitor and support for
opening passed file descriptors (see qemu_open_internal(),
monitor_fd_param(), and socket_get_fd()).

The advantage of integrating live update functionality into the existing
savevm/migrate commands is that it will work in more use cases with
less code duplication/maintenance/bitrot prevention than the
special-case cprsave command in this patch series.

Maybe there is a fundamental technical reason why live update needs to
be different from QEMU's existing migration commands but I haven't
figured it out yet.

Stefan
Re: [PATCH V3 00/22] Live Update
Posted by Steven Sistare 2 years, 11 months ago
On 5/14/2021 7:53 AM, Stefan Hajnoczi wrote:
> On Thu, May 13, 2021 at 04:21:15PM -0400, Steven Sistare wrote:
>> On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote:
>>> On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote:
>>>> Provide the cprsave and cprload commands for live update.  These save and
>>>> restore VM state, with minimal guest pause time, so that qemu may be updated
>>>> to a new version in between.
>>>>
>>>> cprsave stops the VM and saves vmstate to an ordinary file.  It supports two
>>>> modes: restart and reboot.  For restart, cprsave exec's the qemu binary (or
>>>> /usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in a
>>>> paused state and waits for the cprload command.
>>>
>>> I think cprsave/cprload could be generalized by using QMP to stash the
>>> file descriptors. The 'getfd' QMP command already exists and QEMU code
>>> already opens fds passed using this mechanism.
>>>
>>> I haven't checked but it may be possible to drop some patches by reusing
>>> QEMU's monitor file descriptor passing since the code already knows how
>>> to open from 'getfd' fds.
>>>
>>> The reason why using QMP is interesting is because it eliminates the
>>> need for execve(2). QEMU may be unable to execute a program due to
>>> chroot, seccomp, etc.
>>>
>>> QMP would enable cprsave/cprload to work both with and without
>>> execve(2).
>>>
>>> One tricky thing with this approach might be startup ordering: how to
>>> get fds via the QMP monitor in the new process before processing the
>>> entire command-line.
>>
>> Early on I experimented with a similar approach.  Old qemu passed descriptors to an
>> escrow process and exited; new qemu started and retrieved the descriptors from escrow.
>> vfio mostly worked after I hacked the kernel to suppress the original-pid owner check.
>> I suspect my recent vfio extensions would smooth the rough edges.
> 
> I wonder about the reason for VFIO's pid limitation, maybe because it
> pins pages from the original process?

The dma unmap code verifies that the requesting task is the same as the task that mapped
the pages.  We could add an ioctl that passes ownership to a new task.  We would also need
to fix locked memory accounting, which is associated with the mm of the original task.

> Is this VFIO pid limitation the main reason why you chose to make QEMU
> execve(2) the new binary?

That is one.  Plus, re-attaching to named shared memory for pc.ram causes the vfio conflict
errors I mentioned in the previous email.  We would need to suppress redundant dma map calls,
but allow legitimate dma maps and unmaps in response to the ongoing address space changes and
diff callbacks caused by some drivers. It would be messy and fragile. In general, it felt like 
I was working against vfio rather than with it.

Another big reason is a requirement to preserve anonymous memory for legacy qemu updates (via
code injection which I briefly mentioned in KVM forum).  If we extend cpr to allow updates 
without exec, I still need the exec option.

>> However, the main issue is that guest ram must be backed by named shared memory, and
>> we would need to add code to support shared memory for all the secondary memory objects.
>> That makes it less interesting for us at this time; we care about updating legacy qemu 
>> instances with anonymous guest memory.
> 
> Thanks for explaining this more in the other sub-thread. The secondary
> memory objects you mentioned are relatively small so I don't think
> saving them in the traditional way is a problem.
> 
> Two approaches for zero-copy memory migration fit into QEMU's existing
> migration infrastructure:
> 
> - Marking RAM blocks that are backed by named memory (tmpfs, hugetlbfs,
>   etc) so they are not saved into the savevm file. The existing --object
>   memory-backend-file syntax can be used.
> 
> - Extending the live migration protocol to detect when file descriptor
>   passing is available (i.e. UNIX domain socket migration) and using
>   that for memory-backend-* objects that have fds.
> 
> Either of these approaches would handle RAM with existing savevm/migrate
> commands.

Yes, but the vfio issues would still need to be solved, and we would need new
command line options to back existing and future secondary memory objects with 
named shared memory.

> The remaining issue is how to migrate VFIO and other file descriptors
> that cannot be reopened by the new process. As mentioned, QEMU already
> has file descriptor passing support in the QMP monitor and support for
> opening passed file descriptors (see qemu_open_internal(),
> monitor_fd_param(), and socket_get_fd()).
> 
> The advantage of integrating live update functionality into the existing
> savevm/migrate commands is that it will work in more use cases with
> less code duplication/maintenance/bitrot prevention than the
> special-case cprsave command in this patch series.
> 
> Maybe there is a fundamental technical reason why live update needs to
> be different from QEMU's existing migration commands but I haven't
> figured it out yet.

vfio and anonymous memory.

Regarding code duplication, I did consider whether to extend the migration
syntax and implementation versus creating something new.  Those functions
handle stuff like bdrv snapshot, aio, and migration which are n/a for the cpr
use case, and the cpr functions handle state that is n/a for the migration case.
I judged that handling both in the same functions would be less readable and
maintainable.  After feedback during the V1 review, I simplified the cprsave
code by by calling qemu_save_device_state, as Xen does, thus eliminating any
interaction with the migration code.

Regarding bit rot, I still need to add a cpr test to the test suite, when the 
review is more complete and folks agree on the final form of the functionality.

I do like the idea of supporting update without exec, but as a future project, 
and not at the expense of dropping update with exec.

- Steve

Re: [PATCH V3 00/22] Live Update
Posted by Stefan Hajnoczi 2 years, 11 months ago
On Fri, May 14, 2021 at 11:15:18AM -0400, Steven Sistare wrote:
> On 5/14/2021 7:53 AM, Stefan Hajnoczi wrote:
> > On Thu, May 13, 2021 at 04:21:15PM -0400, Steven Sistare wrote:
> >> On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote:
> >>> On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote:
> >>>> Provide the cprsave and cprload commands for live update.  These save and
> >>>> restore VM state, with minimal guest pause time, so that qemu may be updated
> >>>> to a new version in between.
> >>>>
> >>>> cprsave stops the VM and saves vmstate to an ordinary file.  It supports two
> >>>> modes: restart and reboot.  For restart, cprsave exec's the qemu binary (or
> >>>> /usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in a
> >>>> paused state and waits for the cprload command.
> >>>
> >>> I think cprsave/cprload could be generalized by using QMP to stash the
> >>> file descriptors. The 'getfd' QMP command already exists and QEMU code
> >>> already opens fds passed using this mechanism.
> >>>
> >>> I haven't checked but it may be possible to drop some patches by reusing
> >>> QEMU's monitor file descriptor passing since the code already knows how
> >>> to open from 'getfd' fds.
> >>>
> >>> The reason why using QMP is interesting is because it eliminates the
> >>> need for execve(2). QEMU may be unable to execute a program due to
> >>> chroot, seccomp, etc.
> >>>
> >>> QMP would enable cprsave/cprload to work both with and without
> >>> execve(2).
> >>>
> >>> One tricky thing with this approach might be startup ordering: how to
> >>> get fds via the QMP monitor in the new process before processing the
> >>> entire command-line.
> >>
> >> Early on I experimented with a similar approach.  Old qemu passed descriptors to an
> >> escrow process and exited; new qemu started and retrieved the descriptors from escrow.
> >> vfio mostly worked after I hacked the kernel to suppress the original-pid owner check.
> >> I suspect my recent vfio extensions would smooth the rough edges.
> > 
> > I wonder about the reason for VFIO's pid limitation, maybe because it
> > pins pages from the original process?
> 
> The dma unmap code verifies that the requesting task is the same as the task that mapped
> the pages.  We could add an ioctl that passes ownership to a new task.  We would also need
> to fix locked memory accounting, which is associated with the mm of the original task.
> 
> > Is this VFIO pid limitation the main reason why you chose to make QEMU
> > execve(2) the new binary?
> 
> That is one.  Plus, re-attaching to named shared memory for pc.ram causes the vfio conflict
> errors I mentioned in the previous email.  We would need to suppress redundant dma map calls,
> but allow legitimate dma maps and unmaps in response to the ongoing address space changes and
> diff callbacks caused by some drivers. It would be messy and fragile. In general, it felt like 
> I was working against vfio rather than with it.
> 
> Another big reason is a requirement to preserve anonymous memory for legacy qemu updates (via
> code injection which I briefly mentioned in KVM forum).  If we extend cpr to allow updates 
> without exec, I still need the exec option.
> 
> >> However, the main issue is that guest ram must be backed by named shared memory, and
> >> we would need to add code to support shared memory for all the secondary memory objects.
> >> That makes it less interesting for us at this time; we care about updating legacy qemu 
> >> instances with anonymous guest memory.
> > 
> > Thanks for explaining this more in the other sub-thread. The secondary
> > memory objects you mentioned are relatively small so I don't think
> > saving them in the traditional way is a problem.
> > 
> > Two approaches for zero-copy memory migration fit into QEMU's existing
> > migration infrastructure:
> > 
> > - Marking RAM blocks that are backed by named memory (tmpfs, hugetlbfs,
> >   etc) so they are not saved into the savevm file. The existing --object
> >   memory-backend-file syntax can be used.
> > 
> > - Extending the live migration protocol to detect when file descriptor
> >   passing is available (i.e. UNIX domain socket migration) and using
> >   that for memory-backend-* objects that have fds.
> > 
> > Either of these approaches would handle RAM with existing savevm/migrate
> > commands.
> 
> Yes, but the vfio issues would still need to be solved, and we would need new
> command line options to back existing and future secondary memory objects with 
> named shared memory.
> 
> > The remaining issue is how to migrate VFIO and other file descriptors
> > that cannot be reopened by the new process. As mentioned, QEMU already
> > has file descriptor passing support in the QMP monitor and support for
> > opening passed file descriptors (see qemu_open_internal(),
> > monitor_fd_param(), and socket_get_fd()).
> > 
> > The advantage of integrating live update functionality into the existing
> > savevm/migrate commands is that it will work in more use cases with
> > less code duplication/maintenance/bitrot prevention than the
> > special-case cprsave command in this patch series.
> > 
> > Maybe there is a fundamental technical reason why live update needs to
> > be different from QEMU's existing migration commands but I haven't
> > figured it out yet.
> 
> vfio and anonymous memory.
> 
> Regarding code duplication, I did consider whether to extend the migration
> syntax and implementation versus creating something new.  Those functions
> handle stuff like bdrv snapshot, aio, and migration which are n/a for the cpr
> use case, and the cpr functions handle state that is n/a for the migration case.
> I judged that handling both in the same functions would be less readable and
> maintainable.  After feedback during the V1 review, I simplified the cprsave
> code by by calling qemu_save_device_state, as Xen does, thus eliminating any
> interaction with the migration code.
> 
> Regarding bit rot, I still need to add a cpr test to the test suite, when the 
> review is more complete and folks agree on the final form of the functionality.
> 
> I do like the idea of supporting update without exec, but as a future project, 
> and not at the expense of dropping update with exec.

Alex: We're discussing how to live update QEMU while VFIO devices are
running. This patch series introduces monitor commands that call
execve(2) to run the new QEMU binary and inherit the memory/vfio/etc
file descriptors. This way live update is transparent to VFIO but it
won't work if a sandboxed QEMU process is forbidden to call execve(2).
What are your thoughts on 1) the execve(2) approach and 2) extending
VFIO to allow running devices to be attached to a different process so
that execve(2) is not necessary?

Steven: Do you know if cpr will work with Intel's upcoming Shared
Virtual Addressing? I'm worried that execve(2) may be a short-term
solution that works around VFIO's current limitations but even execve(2)
may stop working in the future as IOMMUs and DMA approaches change.

Stefan
Re: [PATCH V3 00/22] Live Update
Posted by Alex Williamson 2 years, 11 months ago
On Mon, 17 May 2021 12:40:43 +0100
Stefan Hajnoczi <stefanha@redhat.com> wrote:

> On Fri, May 14, 2021 at 11:15:18AM -0400, Steven Sistare wrote:
> > On 5/14/2021 7:53 AM, Stefan Hajnoczi wrote:  
> > > On Thu, May 13, 2021 at 04:21:15PM -0400, Steven Sistare wrote:  
> > >> On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote:  
> > >>> On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote:  
> > >>>> Provide the cprsave and cprload commands for live update.  These save and
> > >>>> restore VM state, with minimal guest pause time, so that qemu may be updated
> > >>>> to a new version in between.
> > >>>>
> > >>>> cprsave stops the VM and saves vmstate to an ordinary file.  It supports two
> > >>>> modes: restart and reboot.  For restart, cprsave exec's the qemu binary (or
> > >>>> /usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in a
> > >>>> paused state and waits for the cprload command.  
> > >>>
> > >>> I think cprsave/cprload could be generalized by using QMP to stash the
> > >>> file descriptors. The 'getfd' QMP command already exists and QEMU code
> > >>> already opens fds passed using this mechanism.
> > >>>
> > >>> I haven't checked but it may be possible to drop some patches by reusing
> > >>> QEMU's monitor file descriptor passing since the code already knows how
> > >>> to open from 'getfd' fds.
> > >>>
> > >>> The reason why using QMP is interesting is because it eliminates the
> > >>> need for execve(2). QEMU may be unable to execute a program due to
> > >>> chroot, seccomp, etc.
> > >>>
> > >>> QMP would enable cprsave/cprload to work both with and without
> > >>> execve(2).
> > >>>
> > >>> One tricky thing with this approach might be startup ordering: how to
> > >>> get fds via the QMP monitor in the new process before processing the
> > >>> entire command-line.  
> > >>
> > >> Early on I experimented with a similar approach.  Old qemu passed descriptors to an
> > >> escrow process and exited; new qemu started and retrieved the descriptors from escrow.
> > >> vfio mostly worked after I hacked the kernel to suppress the original-pid owner check.
> > >> I suspect my recent vfio extensions would smooth the rough edges.  
> > > 
> > > I wonder about the reason for VFIO's pid limitation, maybe because it
> > > pins pages from the original process?  
> > 
> > The dma unmap code verifies that the requesting task is the same as the task that mapped
> > the pages.  We could add an ioctl that passes ownership to a new task.  We would also need
> > to fix locked memory accounting, which is associated with the mm of the original task.
> >   
> > > Is this VFIO pid limitation the main reason why you chose to make QEMU
> > > execve(2) the new binary?  
> > 
> > That is one.  Plus, re-attaching to named shared memory for pc.ram causes the vfio conflict
> > errors I mentioned in the previous email.  We would need to suppress redundant dma map calls,
> > but allow legitimate dma maps and unmaps in response to the ongoing address space changes and
> > diff callbacks caused by some drivers. It would be messy and fragile. In general, it felt like 
> > I was working against vfio rather than with it.
> > 
> > Another big reason is a requirement to preserve anonymous memory for legacy qemu updates (via
> > code injection which I briefly mentioned in KVM forum).  If we extend cpr to allow updates 
> > without exec, I still need the exec option.
> >   
> > >> However, the main issue is that guest ram must be backed by named shared memory, and
> > >> we would need to add code to support shared memory for all the secondary memory objects.
> > >> That makes it less interesting for us at this time; we care about updating legacy qemu 
> > >> instances with anonymous guest memory.  
> > > 
> > > Thanks for explaining this more in the other sub-thread. The secondary
> > > memory objects you mentioned are relatively small so I don't think
> > > saving them in the traditional way is a problem.
> > > 
> > > Two approaches for zero-copy memory migration fit into QEMU's existing
> > > migration infrastructure:
> > > 
> > > - Marking RAM blocks that are backed by named memory (tmpfs, hugetlbfs,
> > >   etc) so they are not saved into the savevm file. The existing --object
> > >   memory-backend-file syntax can be used.
> > > 
> > > - Extending the live migration protocol to detect when file descriptor
> > >   passing is available (i.e. UNIX domain socket migration) and using
> > >   that for memory-backend-* objects that have fds.
> > > 
> > > Either of these approaches would handle RAM with existing savevm/migrate
> > > commands.  
> > 
> > Yes, but the vfio issues would still need to be solved, and we would need new
> > command line options to back existing and future secondary memory objects with 
> > named shared memory.
> >   
> > > The remaining issue is how to migrate VFIO and other file descriptors
> > > that cannot be reopened by the new process. As mentioned, QEMU already
> > > has file descriptor passing support in the QMP monitor and support for
> > > opening passed file descriptors (see qemu_open_internal(),
> > > monitor_fd_param(), and socket_get_fd()).
> > > 
> > > The advantage of integrating live update functionality into the existing
> > > savevm/migrate commands is that it will work in more use cases with
> > > less code duplication/maintenance/bitrot prevention than the
> > > special-case cprsave command in this patch series.
> > > 
> > > Maybe there is a fundamental technical reason why live update needs to
> > > be different from QEMU's existing migration commands but I haven't
> > > figured it out yet.  
> > 
> > vfio and anonymous memory.
> > 
> > Regarding code duplication, I did consider whether to extend the migration
> > syntax and implementation versus creating something new.  Those functions
> > handle stuff like bdrv snapshot, aio, and migration which are n/a for the cpr
> > use case, and the cpr functions handle state that is n/a for the migration case.
> > I judged that handling both in the same functions would be less readable and
> > maintainable.  After feedback during the V1 review, I simplified the cprsave
> > code by by calling qemu_save_device_state, as Xen does, thus eliminating any
> > interaction with the migration code.
> > 
> > Regarding bit rot, I still need to add a cpr test to the test suite, when the 
> > review is more complete and folks agree on the final form of the functionality.
> > 
> > I do like the idea of supporting update without exec, but as a future project, 
> > and not at the expense of dropping update with exec.  
> 
> Alex: We're discussing how to live update QEMU while VFIO devices are
> running. This patch series introduces monitor commands that call
> execve(2) to run the new QEMU binary and inherit the memory/vfio/etc
> file descriptors. This way live update is transparent to VFIO but it
> won't work if a sandboxed QEMU process is forbidden to call execve(2).
> What are your thoughts on 1) the execve(2) approach and 2) extending
> VFIO to allow running devices to be attached to a different process so
> that execve(2) is not necessary?

Tracking processes is largely to support page pinning; we need to be
able to support both asynchronous page pinning to handle requests from
mdev drivers and we need to make sure pinned page accounting is
tracked to the same process.  If userspace can "pay" for locked pages
from one process on mappping, then "credit" them to another process on
unmap, that seems fairly exploitable.  We'd need some way to transfer
the locked memory accounting or handle it outside of vfio.  Thanks,

Alex


Re: [PATCH V3 00/22] Live Update
Posted by Dr. David Alan Gilbert 2 years, 11 months ago
* Steven Sistare (steven.sistare@oracle.com) wrote:
> On 5/14/2021 7:53 AM, Stefan Hajnoczi wrote:
> > On Thu, May 13, 2021 at 04:21:15PM -0400, Steven Sistare wrote:
> >> On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote:
> >>> On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote:
> >>>> Provide the cprsave and cprload commands for live update.  These save and
> >>>> restore VM state, with minimal guest pause time, so that qemu may be updated
> >>>> to a new version in between.
> >>>>
> >>>> cprsave stops the VM and saves vmstate to an ordinary file.  It supports two
> >>>> modes: restart and reboot.  For restart, cprsave exec's the qemu binary (or
> >>>> /usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in a
> >>>> paused state and waits for the cprload command.
> >>>
> >>> I think cprsave/cprload could be generalized by using QMP to stash the
> >>> file descriptors. The 'getfd' QMP command already exists and QEMU code
> >>> already opens fds passed using this mechanism.
> >>>
> >>> I haven't checked but it may be possible to drop some patches by reusing
> >>> QEMU's monitor file descriptor passing since the code already knows how
> >>> to open from 'getfd' fds.
> >>>
> >>> The reason why using QMP is interesting is because it eliminates the
> >>> need for execve(2). QEMU may be unable to execute a program due to
> >>> chroot, seccomp, etc.
> >>>
> >>> QMP would enable cprsave/cprload to work both with and without
> >>> execve(2).
> >>>
> >>> One tricky thing with this approach might be startup ordering: how to
> >>> get fds via the QMP monitor in the new process before processing the
> >>> entire command-line.
> >>
> >> Early on I experimented with a similar approach.  Old qemu passed descriptors to an
> >> escrow process and exited; new qemu started and retrieved the descriptors from escrow.
> >> vfio mostly worked after I hacked the kernel to suppress the original-pid owner check.
> >> I suspect my recent vfio extensions would smooth the rough edges.
> > 
> > I wonder about the reason for VFIO's pid limitation, maybe because it
> > pins pages from the original process?
> 
> The dma unmap code verifies that the requesting task is the same as the task that mapped
> the pages.  We could add an ioctl that passes ownership to a new task.  We would also need
> to fix locked memory accounting, which is associated with the mm of the original task.

> > Is this VFIO pid limitation the main reason why you chose to make QEMU
> > execve(2) the new binary?
> 
> That is one.  Plus, re-attaching to named shared memory for pc.ram causes the vfio conflict
> errors I mentioned in the previous email.  We would need to suppress redundant dma map calls,
> but allow legitimate dma maps and unmaps in response to the ongoing address space changes and
> diff callbacks caused by some drivers. It would be messy and fragile. In general, it felt like 
> I was working against vfio rather than with it.

OK the weirdness of vfio helps explain a bit about why you're doing it
this way; can you help separate some difference between restart and
reboot for me though:

In 'reboot' mode; where the guest must do suspend in it's drivers, how
much of these vfio requirements are needed?  I guess the memfd use
for the anonymous areas isn't any use for reboot mode.

You mention cprsave calls VFIO_DMA_UNMAP_FLAG_VADDR - after that does
vfio still care about the currently-anonymous areas?

> Another big reason is a requirement to preserve anonymous memory for legacy qemu updates (via
> code injection which I briefly mentioned in KVM forum).  If we extend cpr to allow updates 
> without exec, I still need the exec option.

Can you explain what that code injection mechanism is for those of us
who didn't see that?

Dave

> >> However, the main issue is that guest ram must be backed by named shared memory, and
> >> we would need to add code to support shared memory for all the secondary memory objects.
> >> That makes it less interesting for us at this time; we care about updating legacy qemu 
> >> instances with anonymous guest memory.
> > 
> > Thanks for explaining this more in the other sub-thread. The secondary
> > memory objects you mentioned are relatively small so I don't think
> > saving them in the traditional way is a problem.
> > 
> > Two approaches for zero-copy memory migration fit into QEMU's existing
> > migration infrastructure:
> > 
> > - Marking RAM blocks that are backed by named memory (tmpfs, hugetlbfs,
> >   etc) so they are not saved into the savevm file. The existing --object
> >   memory-backend-file syntax can be used.
> > 
> > - Extending the live migration protocol to detect when file descriptor
> >   passing is available (i.e. UNIX domain socket migration) and using
> >   that for memory-backend-* objects that have fds.
> > 
> > Either of these approaches would handle RAM with existing savevm/migrate
> > commands.
> 
> Yes, but the vfio issues would still need to be solved, and we would need new
> command line options to back existing and future secondary memory objects with 
> named shared memory.
> 
> > The remaining issue is how to migrate VFIO and other file descriptors
> > that cannot be reopened by the new process. As mentioned, QEMU already
> > has file descriptor passing support in the QMP monitor and support for
> > opening passed file descriptors (see qemu_open_internal(),
> > monitor_fd_param(), and socket_get_fd()).
> > 
> > The advantage of integrating live update functionality into the existing
> > savevm/migrate commands is that it will work in more use cases with
> > less code duplication/maintenance/bitrot prevention than the
> > special-case cprsave command in this patch series.
> > 
> > Maybe there is a fundamental technical reason why live update needs to
> > be different from QEMU's existing migration commands but I haven't
> > figured it out yet.
> 
> vfio and anonymous memory.
> 
> Regarding code duplication, I did consider whether to extend the migration
> syntax and implementation versus creating something new.  Those functions
> handle stuff like bdrv snapshot, aio, and migration which are n/a for the cpr
> use case, and the cpr functions handle state that is n/a for the migration case.
> I judged that handling both in the same functions would be less readable and
> maintainable.  After feedback during the V1 review, I simplified the cprsave
> code by by calling qemu_save_device_state, as Xen does, thus eliminating any
> interaction with the migration code.
> 
> Regarding bit rot, I still need to add a cpr test to the test suite, when the 
> review is more complete and folks agree on the final form of the functionality.
> 
> I do like the idea of supporting update without exec, but as a future project, 
> and not at the expense of dropping update with exec.
> 
> - Steve
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


Re: [PATCH V3 00/22] Live Update
Posted by Steven Sistare 2 years, 11 months ago
On 5/18/2021 5:57 AM, Dr. David Alan Gilbert wrote:
> * Steven Sistare (steven.sistare@oracle.com) wrote:
>> On 5/14/2021 7:53 AM, Stefan Hajnoczi wrote:
>>> On Thu, May 13, 2021 at 04:21:15PM -0400, Steven Sistare wrote:
>>>> On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote:
>>>>> On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote:
>>>>>> Provide the cprsave and cprload commands for live update.  These save and
>>>>>> restore VM state, with minimal guest pause time, so that qemu may be updated
>>>>>> to a new version in between.
>>>>>>
>>>>>> cprsave stops the VM and saves vmstate to an ordinary file.  It supports two
>>>>>> modes: restart and reboot.  For restart, cprsave exec's the qemu binary (or
>>>>>> /usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in a
>>>>>> paused state and waits for the cprload command.
>>>>>
>>>>> I think cprsave/cprload could be generalized by using QMP to stash the
>>>>> file descriptors. The 'getfd' QMP command already exists and QEMU code
>>>>> already opens fds passed using this mechanism.
>>>>>
>>>>> I haven't checked but it may be possible to drop some patches by reusing
>>>>> QEMU's monitor file descriptor passing since the code already knows how
>>>>> to open from 'getfd' fds.
>>>>>
>>>>> The reason why using QMP is interesting is because it eliminates the
>>>>> need for execve(2). QEMU may be unable to execute a program due to
>>>>> chroot, seccomp, etc.
>>>>>
>>>>> QMP would enable cprsave/cprload to work both with and without
>>>>> execve(2).
>>>>>
>>>>> One tricky thing with this approach might be startup ordering: how to
>>>>> get fds via the QMP monitor in the new process before processing the
>>>>> entire command-line.
>>>>
>>>> Early on I experimented with a similar approach.  Old qemu passed descriptors to an
>>>> escrow process and exited; new qemu started and retrieved the descriptors from escrow.
>>>> vfio mostly worked after I hacked the kernel to suppress the original-pid owner check.
>>>> I suspect my recent vfio extensions would smooth the rough edges.
>>>
>>> I wonder about the reason for VFIO's pid limitation, maybe because it
>>> pins pages from the original process?
>>
>> The dma unmap code verifies that the requesting task is the same as the task that mapped
>> the pages.  We could add an ioctl that passes ownership to a new task.  We would also need
>> to fix locked memory accounting, which is associated with the mm of the original task.
> 
>>> Is this VFIO pid limitation the main reason why you chose to make QEMU
>>> execve(2) the new binary?
>>
>> That is one.  Plus, re-attaching to named shared memory for pc.ram causes the vfio conflict
>> errors I mentioned in the previous email.  We would need to suppress redundant dma map calls,
>> but allow legitimate dma maps and unmaps in response to the ongoing address space changes and
>> diff callbacks caused by some drivers. It would be messy and fragile. In general, it felt like 
>> I was working against vfio rather than with it.
> 
> OK the weirdness of vfio helps explain a bit about why you're doing it
> this way; can you help separate some difference between restart and
> reboot for me though:
> 
> In 'reboot' mode; where the guest must do suspend in it's drivers, how
> much of these vfio requirements are needed?  I guess the memfd use
> for the anonymous areas isn't any use for reboot mode.

Correct.  For reboot no special vfio support or fiddling is needed.

> You mention cprsave calls VFIO_DMA_UNMAP_FLAG_VADDR - after that does
> vfio still care about the currently-anonymous areas?

Yes, for restart mode.  The physical pages behind the anonymous memory remain pinned and 
are targets for ongoing DMA.  Post-exec qemu needs a way to find those same pages.

>> Another big reason is a requirement to preserve anonymous memory for legacy qemu updates (via
>> code injection which I briefly mentioned in KVM forum).  If we extend cpr to allow updates 
>> without exec, I still need the exec option.
> 
> Can you explain what that code injection mechanism is for those of us
> who didn't see that?

Sure.  Here is slide 12 from the talk.  It relies on mmap(MADV_DOEXEC) which was not
accepted upstream.

-----------------------------------------------------------------------------
Legacy Live Update

 * Update legacy qemu process to latest version
   - Inject code into legacy qemu process to perform cprsave: vmsave.so
     . Access qemu data structures and globals
       - eg ram_list, savevm_state, chardevs, vhost_devices
       - dlopen does not resolve them, must get addresses via symbol lookup.
     . Delete some vmstate handlers, register new ones (eg vfio)
     . Call MADV_DOEXEC on guest memory. Find devices, preserve fd
 * Hot patch a monitor function to dlopen vmsave.so, call entry point
   - write patch to /proc/pid/mem
   - Call the monitor function via monitor socket
 * Send cprload to update qemu
 * vmsave.so has binary dependency on qemu data structures and variables
   - Build vmsave-ver.so per legacy version
   - Indexed by qemu's gcc build-id

-----------------------------------------------------------------------------

- Steve
 
>>>> However, the main issue is that guest ram must be backed by named shared memory, and
>>>> we would need to add code to support shared memory for all the secondary memory objects.
>>>> That makes it less interesting for us at this time; we care about updating legacy qemu 
>>>> instances with anonymous guest memory.
>>>
>>> Thanks for explaining this more in the other sub-thread. The secondary
>>> memory objects you mentioned are relatively small so I don't think
>>> saving them in the traditional way is a problem.
>>>
>>> Two approaches for zero-copy memory migration fit into QEMU's existing
>>> migration infrastructure:
>>>
>>> - Marking RAM blocks that are backed by named memory (tmpfs, hugetlbfs,
>>>   etc) so they are not saved into the savevm file. The existing --object
>>>   memory-backend-file syntax can be used.
>>>
>>> - Extending the live migration protocol to detect when file descriptor
>>>   passing is available (i.e. UNIX domain socket migration) and using
>>>   that for memory-backend-* objects that have fds.
>>>
>>> Either of these approaches would handle RAM with existing savevm/migrate
>>> commands.
>>
>> Yes, but the vfio issues would still need to be solved, and we would need new
>> command line options to back existing and future secondary memory objects with 
>> named shared memory.
>>
>>> The remaining issue is how to migrate VFIO and other file descriptors
>>> that cannot be reopened by the new process. As mentioned, QEMU already
>>> has file descriptor passing support in the QMP monitor and support for
>>> opening passed file descriptors (see qemu_open_internal(),
>>> monitor_fd_param(), and socket_get_fd()).
>>>
>>> The advantage of integrating live update functionality into the existing
>>> savevm/migrate commands is that it will work in more use cases with
>>> less code duplication/maintenance/bitrot prevention than the
>>> special-case cprsave command in this patch series.
>>>
>>> Maybe there is a fundamental technical reason why live update needs to
>>> be different from QEMU's existing migration commands but I haven't
>>> figured it out yet.
>>
>> vfio and anonymous memory.
>>
>> Regarding code duplication, I did consider whether to extend the migration
>> syntax and implementation versus creating something new.  Those functions
>> handle stuff like bdrv snapshot, aio, and migration which are n/a for the cpr
>> use case, and the cpr functions handle state that is n/a for the migration case.
>> I judged that handling both in the same functions would be less readable and
>> maintainable.  After feedback during the V1 review, I simplified the cprsave
>> code by by calling qemu_save_device_state, as Xen does, thus eliminating any
>> interaction with the migration code.
>>
>> Regarding bit rot, I still need to add a cpr test to the test suite, when the 
>> review is more complete and folks agree on the final form of the functionality.
>>
>> I do like the idea of supporting update without exec, but as a future project, 
>> and not at the expense of dropping update with exec.
>>
>> - Steve
>>

Re: [PATCH V3 00/22] Live Update
Posted by Dr. David Alan Gilbert 2 years, 11 months ago
* Steven Sistare (steven.sistare@oracle.com) wrote:
> On 5/18/2021 5:57 AM, Dr. David Alan Gilbert wrote:
> > * Steven Sistare (steven.sistare@oracle.com) wrote:
> >> On 5/14/2021 7:53 AM, Stefan Hajnoczi wrote:
> >>> On Thu, May 13, 2021 at 04:21:15PM -0400, Steven Sistare wrote:
> >>>> On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote:
> >>>>> On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote:
> >>>>>> Provide the cprsave and cprload commands for live update.  These save and
> >>>>>> restore VM state, with minimal guest pause time, so that qemu may be updated
> >>>>>> to a new version in between.
> >>>>>>
> >>>>>> cprsave stops the VM and saves vmstate to an ordinary file.  It supports two
> >>>>>> modes: restart and reboot.  For restart, cprsave exec's the qemu binary (or
> >>>>>> /usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in a
> >>>>>> paused state and waits for the cprload command.
> >>>>>
> >>>>> I think cprsave/cprload could be generalized by using QMP to stash the
> >>>>> file descriptors. The 'getfd' QMP command already exists and QEMU code
> >>>>> already opens fds passed using this mechanism.
> >>>>>
> >>>>> I haven't checked but it may be possible to drop some patches by reusing
> >>>>> QEMU's monitor file descriptor passing since the code already knows how
> >>>>> to open from 'getfd' fds.
> >>>>>
> >>>>> The reason why using QMP is interesting is because it eliminates the
> >>>>> need for execve(2). QEMU may be unable to execute a program due to
> >>>>> chroot, seccomp, etc.
> >>>>>
> >>>>> QMP would enable cprsave/cprload to work both with and without
> >>>>> execve(2).
> >>>>>
> >>>>> One tricky thing with this approach might be startup ordering: how to
> >>>>> get fds via the QMP monitor in the new process before processing the
> >>>>> entire command-line.
> >>>>
> >>>> Early on I experimented with a similar approach.  Old qemu passed descriptors to an
> >>>> escrow process and exited; new qemu started and retrieved the descriptors from escrow.
> >>>> vfio mostly worked after I hacked the kernel to suppress the original-pid owner check.
> >>>> I suspect my recent vfio extensions would smooth the rough edges.
> >>>
> >>> I wonder about the reason for VFIO's pid limitation, maybe because it
> >>> pins pages from the original process?
> >>
> >> The dma unmap code verifies that the requesting task is the same as the task that mapped
> >> the pages.  We could add an ioctl that passes ownership to a new task.  We would also need
> >> to fix locked memory accounting, which is associated with the mm of the original task.
> > 
> >>> Is this VFIO pid limitation the main reason why you chose to make QEMU
> >>> execve(2) the new binary?
> >>
> >> That is one.  Plus, re-attaching to named shared memory for pc.ram causes the vfio conflict
> >> errors I mentioned in the previous email.  We would need to suppress redundant dma map calls,
> >> but allow legitimate dma maps and unmaps in response to the ongoing address space changes and
> >> diff callbacks caused by some drivers. It would be messy and fragile. In general, it felt like 
> >> I was working against vfio rather than with it.
> > 
> > OK the weirdness of vfio helps explain a bit about why you're doing it
> > this way; can you help separate some difference between restart and
> > reboot for me though:
> > 
> > In 'reboot' mode; where the guest must do suspend in it's drivers, how
> > much of these vfio requirements are needed?  I guess the memfd use
> > for the anonymous areas isn't any use for reboot mode.
> 
> Correct.  For reboot no special vfio support or fiddling is needed.
> 
> > You mention cprsave calls VFIO_DMA_UNMAP_FLAG_VADDR - after that does
> > vfio still care about the currently-anonymous areas?
> 
> Yes, for restart mode.  The physical pages behind the anonymous memory remain pinned and 
> are targets for ongoing DMA.  Post-exec qemu needs a way to find those same pages.

Is it possible with vfio to map it into multiple processes
simultaneously or does it have to only be one at a time?
Are you saying that you have no way to shut off DMA, and thus you can
never know it's safe to terminate the source process?

> >> Another big reason is a requirement to preserve anonymous memory for legacy qemu updates (via
> >> code injection which I briefly mentioned in KVM forum).  If we extend cpr to allow updates 
> >> without exec, I still need the exec option.
> > 
> > Can you explain what that code injection mechanism is for those of us
> > who didn't see that?
> 
> Sure.  Here is slide 12 from the talk.  It relies on mmap(MADV_DOEXEC) which was not
> accepted upstream.

In this series, without MADV_DOEXEC, how do you guarantee the same HVA
in source and destination - or is that not necessary?

> -----------------------------------------------------------------------------
> Legacy Live Update
> 
>  * Update legacy qemu process to latest version
>    - Inject code into legacy qemu process to perform cprsave: vmsave.so
>      . Access qemu data structures and globals
>        - eg ram_list, savevm_state, chardevs, vhost_devices
>        - dlopen does not resolve them, must get addresses via symbol lookup.
>      . Delete some vmstate handlers, register new ones (eg vfio)
>      . Call MADV_DOEXEC on guest memory. Find devices, preserve fd
>  * Hot patch a monitor function to dlopen vmsave.so, call entry point
>    - write patch to /proc/pid/mem
>    - Call the monitor function via monitor socket
>  * Send cprload to update qemu
>  * vmsave.so has binary dependency on qemu data structures and variables
>    - Build vmsave-ver.so per legacy version
>    - Indexed by qemu's gcc build-id
> 
> -----------------------------------------------------------------------------

That's hairy!
At that point isn't it easier to recompile a patched qemu against the
original sources and ptrace something in to mmap the new qemu?

Dave

> - Steve
>  
> >>>> However, the main issue is that guest ram must be backed by named shared memory, and
> >>>> we would need to add code to support shared memory for all the secondary memory objects.
> >>>> That makes it less interesting for us at this time; we care about updating legacy qemu 
> >>>> instances with anonymous guest memory.
> >>>
> >>> Thanks for explaining this more in the other sub-thread. The secondary
> >>> memory objects you mentioned are relatively small so I don't think
> >>> saving them in the traditional way is a problem.
> >>>
> >>> Two approaches for zero-copy memory migration fit into QEMU's existing
> >>> migration infrastructure:
> >>>
> >>> - Marking RAM blocks that are backed by named memory (tmpfs, hugetlbfs,
> >>>   etc) so they are not saved into the savevm file. The existing --object
> >>>   memory-backend-file syntax can be used.
> >>>
> >>> - Extending the live migration protocol to detect when file descriptor
> >>>   passing is available (i.e. UNIX domain socket migration) and using
> >>>   that for memory-backend-* objects that have fds.
> >>>
> >>> Either of these approaches would handle RAM with existing savevm/migrate
> >>> commands.
> >>
> >> Yes, but the vfio issues would still need to be solved, and we would need new
> >> command line options to back existing and future secondary memory objects with 
> >> named shared memory.
> >>
> >>> The remaining issue is how to migrate VFIO and other file descriptors
> >>> that cannot be reopened by the new process. As mentioned, QEMU already
> >>> has file descriptor passing support in the QMP monitor and support for
> >>> opening passed file descriptors (see qemu_open_internal(),
> >>> monitor_fd_param(), and socket_get_fd()).
> >>>
> >>> The advantage of integrating live update functionality into the existing
> >>> savevm/migrate commands is that it will work in more use cases with
> >>> less code duplication/maintenance/bitrot prevention than the
> >>> special-case cprsave command in this patch series.
> >>>
> >>> Maybe there is a fundamental technical reason why live update needs to
> >>> be different from QEMU's existing migration commands but I haven't
> >>> figured it out yet.
> >>
> >> vfio and anonymous memory.
> >>
> >> Regarding code duplication, I did consider whether to extend the migration
> >> syntax and implementation versus creating something new.  Those functions
> >> handle stuff like bdrv snapshot, aio, and migration which are n/a for the cpr
> >> use case, and the cpr functions handle state that is n/a for the migration case.
> >> I judged that handling both in the same functions would be less readable and
> >> maintainable.  After feedback during the V1 review, I simplified the cprsave
> >> code by by calling qemu_save_device_state, as Xen does, thus eliminating any
> >> interaction with the migration code.
> >>
> >> Regarding bit rot, I still need to add a cpr test to the test suite, when the 
> >> review is more complete and folks agree on the final form of the functionality.
> >>
> >> I do like the idea of supporting update without exec, but as a future project, 
> >> and not at the expense of dropping update with exec.
> >>
> >> - Steve
> >>
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


Re: [PATCH V3 00/22] Live Update
Posted by Alex Williamson 2 years, 11 months ago
On Tue, 18 May 2021 20:23:25 +0100
"Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:

> * Steven Sistare (steven.sistare@oracle.com) wrote:
> > On 5/18/2021 5:57 AM, Dr. David Alan Gilbert wrote:  
> > > * Steven Sistare (steven.sistare@oracle.com) wrote:  
> > >> On 5/14/2021 7:53 AM, Stefan Hajnoczi wrote:  
> > >>> On Thu, May 13, 2021 at 04:21:15PM -0400, Steven Sistare wrote:  
> > >>>> On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote:  
> > >>>>> On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote:  
> > >>>>>> Provide the cprsave and cprload commands for live update.  These save and
> > >>>>>> restore VM state, with minimal guest pause time, so that qemu may be updated
> > >>>>>> to a new version in between.
> > >>>>>>
> > >>>>>> cprsave stops the VM and saves vmstate to an ordinary file.  It supports two
> > >>>>>> modes: restart and reboot.  For restart, cprsave exec's the qemu binary (or
> > >>>>>> /usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in a
> > >>>>>> paused state and waits for the cprload command.  
> > >>>>>
> > >>>>> I think cprsave/cprload could be generalized by using QMP to stash the
> > >>>>> file descriptors. The 'getfd' QMP command already exists and QEMU code
> > >>>>> already opens fds passed using this mechanism.
> > >>>>>
> > >>>>> I haven't checked but it may be possible to drop some patches by reusing
> > >>>>> QEMU's monitor file descriptor passing since the code already knows how
> > >>>>> to open from 'getfd' fds.
> > >>>>>
> > >>>>> The reason why using QMP is interesting is because it eliminates the
> > >>>>> need for execve(2). QEMU may be unable to execute a program due to
> > >>>>> chroot, seccomp, etc.
> > >>>>>
> > >>>>> QMP would enable cprsave/cprload to work both with and without
> > >>>>> execve(2).
> > >>>>>
> > >>>>> One tricky thing with this approach might be startup ordering: how to
> > >>>>> get fds via the QMP monitor in the new process before processing the
> > >>>>> entire command-line.  
> > >>>>
> > >>>> Early on I experimented with a similar approach.  Old qemu passed descriptors to an
> > >>>> escrow process and exited; new qemu started and retrieved the descriptors from escrow.
> > >>>> vfio mostly worked after I hacked the kernel to suppress the original-pid owner check.
> > >>>> I suspect my recent vfio extensions would smooth the rough edges.  
> > >>>
> > >>> I wonder about the reason for VFIO's pid limitation, maybe because it
> > >>> pins pages from the original process?  
> > >>
> > >> The dma unmap code verifies that the requesting task is the same as the task that mapped
> > >> the pages.  We could add an ioctl that passes ownership to a new task.  We would also need
> > >> to fix locked memory accounting, which is associated with the mm of the original task.  
> > >   
> > >>> Is this VFIO pid limitation the main reason why you chose to make QEMU
> > >>> execve(2) the new binary?  
> > >>
> > >> That is one.  Plus, re-attaching to named shared memory for pc.ram causes the vfio conflict
> > >> errors I mentioned in the previous email.  We would need to suppress redundant dma map calls,
> > >> but allow legitimate dma maps and unmaps in response to the ongoing address space changes and
> > >> diff callbacks caused by some drivers. It would be messy and fragile. In general, it felt like 
> > >> I was working against vfio rather than with it.  
> > > 
> > > OK the weirdness of vfio helps explain a bit about why you're doing it
> > > this way; can you help separate some difference between restart and
> > > reboot for me though:
> > > 
> > > In 'reboot' mode; where the guest must do suspend in it's drivers, how
> > > much of these vfio requirements are needed?  I guess the memfd use
> > > for the anonymous areas isn't any use for reboot mode.  
> > 
> > Correct.  For reboot no special vfio support or fiddling is needed.
> >   
> > > You mention cprsave calls VFIO_DMA_UNMAP_FLAG_VADDR - after that does
> > > vfio still care about the currently-anonymous areas?  
> > 
> > Yes, for restart mode.  The physical pages behind the anonymous memory remain pinned and 
> > are targets for ongoing DMA.  Post-exec qemu needs a way to find those same pages.  
> 
> Is it possible with vfio to map it into multiple processes
> simultaneously or does it have to only be one at a time?

The IOMMU maps an IOVA to a physical address, what Steve is saying is
that mapping persists across the restart.  A given IOVA can only map to
a specific physical address, so mapping into multiple processes doesn't
make any sense.  The two processes need to map the same IOVA to the
same HPA, only the HVA is allowed to change.

> Are you saying that you have no way to shut off DMA, and thus you can
> never know it's safe to terminate the source process?

Stopping DMA, ex. disabling PCI bus master, would be not only visible
to the behavior of the device, but likely detrimental.  You'd need
driver or device participation to some extent to make this seamless.

> > >> Another big reason is a requirement to preserve anonymous memory for legacy qemu updates (via
> > >> code injection which I briefly mentioned in KVM forum).  If we extend cpr to allow updates 
> > >> without exec, I still need the exec option.  
> > > 
> > > Can you explain what that code injection mechanism is for those of us
> > > who didn't see that?  
> > 
> > Sure.  Here is slide 12 from the talk.  It relies on mmap(MADV_DOEXEC) which was not
> > accepted upstream.  
> 
> In this series, without MADV_DOEXEC, how do you guarantee the same HVA
> in source and destination - or is that not necessary?

It's not necessary, the HVA is used to establish the IOVA to HPA
mapping for the IOMMU.  We have patches upstream that suspend (block)
that translation for the window when the HVA is invalid and resume when
it becomes valid.  It's expected that the new HVA is equivalent to the
old HVA and that the user can only hurt themselves should they violate
this, ie. they can still only map+pin memory they own, so at worst they
create a bad translation for their own device.  Thanks,

Alex


Re: [PATCH V3 00/22] Live Update
Posted by Steven Sistare 2 years, 11 months ago
On 5/18/2021 3:23 PM, Dr. David Alan Gilbert wrote:
> * Steven Sistare (steven.sistare@oracle.com) wrote:
>> On 5/18/2021 5:57 AM, Dr. David Alan Gilbert wrote:
>>> * Steven Sistare (steven.sistare@oracle.com) wrote:
>>>> On 5/14/2021 7:53 AM, Stefan Hajnoczi wrote:
>>>>> On Thu, May 13, 2021 at 04:21:15PM -0400, Steven Sistare wrote:
>>>>>> On 5/12/2021 12:42 PM, Stefan Hajnoczi wrote:
>>>>>>> On Fri, May 07, 2021 at 05:24:58AM -0700, Steve Sistare wrote:
>>>>>>>> Provide the cprsave and cprload commands for live update.  These save and
>>>>>>>> restore VM state, with minimal guest pause time, so that qemu may be updated
>>>>>>>> to a new version in between.
>>>>>>>>
>>>>>>>> cprsave stops the VM and saves vmstate to an ordinary file.  It supports two
>>>>>>>> modes: restart and reboot.  For restart, cprsave exec's the qemu binary (or
>>>>>>>> /usr/bin/qemu-exec if it exists) with the same argv.  qemu restarts in a
>>>>>>>> paused state and waits for the cprload command.
>>>>>>>
>>>>>>> I think cprsave/cprload could be generalized by using QMP to stash the
>>>>>>> file descriptors. The 'getfd' QMP command already exists and QEMU code
>>>>>>> already opens fds passed using this mechanism.
>>>>>>>
>>>>>>> I haven't checked but it may be possible to drop some patches by reusing
>>>>>>> QEMU's monitor file descriptor passing since the code already knows how
>>>>>>> to open from 'getfd' fds.
>>>>>>>
>>>>>>> The reason why using QMP is interesting is because it eliminates the
>>>>>>> need for execve(2). QEMU may be unable to execute a program due to
>>>>>>> chroot, seccomp, etc.
>>>>>>>
>>>>>>> QMP would enable cprsave/cprload to work both with and without
>>>>>>> execve(2).
>>>>>>>
>>>>>>> One tricky thing with this approach might be startup ordering: how to
>>>>>>> get fds via the QMP monitor in the new process before processing the
>>>>>>> entire command-line.
>>>>>>
>>>>>> Early on I experimented with a similar approach.  Old qemu passed descriptors to an
>>>>>> escrow process and exited; new qemu started and retrieved the descriptors from escrow.
>>>>>> vfio mostly worked after I hacked the kernel to suppress the original-pid owner check.
>>>>>> I suspect my recent vfio extensions would smooth the rough edges.
>>>>>
>>>>> I wonder about the reason for VFIO's pid limitation, maybe because it
>>>>> pins pages from the original process?
>>>>
>>>> The dma unmap code verifies that the requesting task is the same as the task that mapped
>>>> the pages.  We could add an ioctl that passes ownership to a new task.  We would also need
>>>> to fix locked memory accounting, which is associated with the mm of the original task.
>>>
>>>>> Is this VFIO pid limitation the main reason why you chose to make QEMU
>>>>> execve(2) the new binary?
>>>>
>>>> That is one.  Plus, re-attaching to named shared memory for pc.ram causes the vfio conflict
>>>> errors I mentioned in the previous email.  We would need to suppress redundant dma map calls,
>>>> but allow legitimate dma maps and unmaps in response to the ongoing address space changes and
>>>> diff callbacks caused by some drivers. It would be messy and fragile. In general, it felt like 
>>>> I was working against vfio rather than with it.
>>>
>>> OK the weirdness of vfio helps explain a bit about why you're doing it
>>> this way; can you help separate some difference between restart and
>>> reboot for me though:
>>>
>>> In 'reboot' mode; where the guest must do suspend in it's drivers, how
>>> much of these vfio requirements are needed?  I guess the memfd use
>>> for the anonymous areas isn't any use for reboot mode.
>>
>> Correct.  For reboot no special vfio support or fiddling is needed.
>>
>>> You mention cprsave calls VFIO_DMA_UNMAP_FLAG_VADDR - after that does
>>> vfio still care about the currently-anonymous areas?
>>
>> Yes, for restart mode.  The physical pages behind the anonymous memory remain pinned and 
>> are targets for ongoing DMA.  Post-exec qemu needs a way to find those same pages.
> 
> Is it possible with vfio to map it into multiple processes
> simultaneously or does it have to only be one at a time?
> Are you saying that you have no way to shut off DMA, and thus you can
> never know it's safe to terminate the source process?
> 
>>>> Another big reason is a requirement to preserve anonymous memory for legacy qemu updates (via
>>>> code injection which I briefly mentioned in KVM forum).  If we extend cpr to allow updates 
>>>> without exec, I still need the exec option.
>>>
>>> Can you explain what that code injection mechanism is for those of us
>>> who didn't see that?
>>
>> Sure.  Here is slide 12 from the talk.  It relies on mmap(MADV_DOEXEC) which was not
>> accepted upstream.
> 
> In this series, without MADV_DOEXEC, how do you guarantee the same HVA
> in source and destination - or is that not necessary?

Not necessary.  We can safely change the HVS using the new vfio ioctls.

>> -----------------------------------------------------------------------------
>> Legacy Live Update
>>
>>  * Update legacy qemu process to latest version
>>    - Inject code into legacy qemu process to perform cprsave: vmsave.so
>>      . Access qemu data structures and globals
>>        - eg ram_list, savevm_state, chardevs, vhost_devices
>>        - dlopen does not resolve them, must get addresses via symbol lookup.
>>      . Delete some vmstate handlers, register new ones (eg vfio)
>>      . Call MADV_DOEXEC on guest memory. Find devices, preserve fd
>>  * Hot patch a monitor function to dlopen vmsave.so, call entry point
>>    - write patch to /proc/pid/mem
>>    - Call the monitor function via monitor socket
>>  * Send cprload to update qemu
>>  * vmsave.so has binary dependency on qemu data structures and variables
>>    - Build vmsave-ver.so per legacy version
>>    - Indexed by qemu's gcc build-id
>>
>> -----------------------------------------------------------------------------
> 
> That's hairy!
> At that point isn't it easier to recompile a patched qemu against the
> original sources and ptrace something in to mmap the new qemu
That could work, but safely capturing all the threads and forcing them to jump to the
mmap'd qemu is hard.

- Steve

>>>>>> However, the main issue is that guest ram must be backed by named shared memory, and
>>>>>> we would need to add code to support shared memory for all the secondary memory objects.
>>>>>> That makes it less interesting for us at this time; we care about updating legacy qemu 
>>>>>> instances with anonymous guest memory.
>>>>>
>>>>> Thanks for explaining this more in the other sub-thread. The secondary
>>>>> memory objects you mentioned are relatively small so I don't think
>>>>> saving them in the traditional way is a problem.
>>>>>
>>>>> Two approaches for zero-copy memory migration fit into QEMU's existing
>>>>> migration infrastructure:
>>>>>
>>>>> - Marking RAM blocks that are backed by named memory (tmpfs, hugetlbfs,
>>>>>   etc) so they are not saved into the savevm file. The existing --object
>>>>>   memory-backend-file syntax can be used.
>>>>>
>>>>> - Extending the live migration protocol to detect when file descriptor
>>>>>   passing is available (i.e. UNIX domain socket migration) and using
>>>>>   that for memory-backend-* objects that have fds.
>>>>>
>>>>> Either of these approaches would handle RAM with existing savevm/migrate
>>>>> commands.
>>>>
>>>> Yes, but the vfio issues would still need to be solved, and we would need new
>>>> command line options to back existing and future secondary memory objects with 
>>>> named shared memory.
>>>>
>>>>> The remaining issue is how to migrate VFIO and other file descriptors
>>>>> that cannot be reopened by the new process. As mentioned, QEMU already
>>>>> has file descriptor passing support in the QMP monitor and support for
>>>>> opening passed file descriptors (see qemu_open_internal(),
>>>>> monitor_fd_param(), and socket_get_fd()).
>>>>>
>>>>> The advantage of integrating live update functionality into the existing
>>>>> savevm/migrate commands is that it will work in more use cases with
>>>>> less code duplication/maintenance/bitrot prevention than the
>>>>> special-case cprsave command in this patch series.
>>>>>
>>>>> Maybe there is a fundamental technical reason why live update needs to
>>>>> be different from QEMU's existing migration commands but I haven't
>>>>> figured it out yet.
>>>>
>>>> vfio and anonymous memory.
>>>>
>>>> Regarding code duplication, I did consider whether to extend the migration
>>>> syntax and implementation versus creating something new.  Those functions
>>>> handle stuff like bdrv snapshot, aio, and migration which are n/a for the cpr
>>>> use case, and the cpr functions handle state that is n/a for the migration case.
>>>> I judged that handling both in the same functions would be less readable and
>>>> maintainable.  After feedback during the V1 review, I simplified the cprsave
>>>> code by by calling qemu_save_device_state, as Xen does, thus eliminating any
>>>> interaction with the migration code.
>>>>
>>>> Regarding bit rot, I still need to add a cpr test to the test suite, when the 
>>>> review is more complete and folks agree on the final form of the functionality.
>>>>
>>>> I do like the idea of supporting update without exec, but as a future project, 
>>>> and not at the expense of dropping update with exec.
>>>>
>>>> - Steve
>>>>
>>

Re: [PATCH V3 00/22] Live Update [reboot]
Posted by Dr. David Alan Gilbert 2 years, 11 months ago
Hi Steven,
  I'd like to split the discussion into reboot and restart,
so I can make sure I understand them individually.

So reboot mode;
Can you explain which parts of this series are needed for reboot mode;
I've managed to do a kexec based reboot on qemu with the current qemu -
albeit with no vfio devices, my current understanding is that for doing
reboot with vfio we just need some way of getting migrate to send the
metadata associated with vfio devices if the guest is in S3.

Is there something I'm missing and which you have in this series?

Dave

-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


Re: [PATCH V3 00/22] Live Update [reboot]
Posted by Steven Sistare 2 years, 11 months ago
On 5/20/2021 9:00 AM, Dr. David Alan Gilbert wrote:
> Hi Steven,
>   I'd like to split the discussion into reboot and restart,
> so I can make sure I understand them individually.
> 
> So reboot mode;
> Can you explain which parts of this series are needed for reboot mode;
> I've managed to do a kexec based reboot on qemu with the current qemu -
> albeit with no vfio devices, my current understanding is that for doing
> reboot with vfio we just need some way of getting migrate to send the
> metadata associated with vfio devices if the guest is in S3.
> 
> Is there something I'm missing and which you have in this series?

You are correct, this series has little special code for reboot mode, but it does allow
reboot and restart to be handled similarly, which simplifies the management layer because 
the same calls are performed for each mode. 

For vfio in reboot mode, prior to sending cprload, the manager sends the guest-suspend-ram
command to the qemu guest agent. This flushes requests and brings the guest device to a 
reset state, so there is no vfio metadata to save.  Reboot mode does not call vfio_cprsave.

There are a few unique patches to support reboot mode.  One is qemu_ram_volatile, which
is a sanity check that the writable ram blocks are backed by some form of shared memory.
Plus there are a few fragments in the "cpr" patch that handle the suspended state that
is induced by guest-suspend-ram.  See qemu_system_start_on_wake_request() and instances
of RUN_STATE_SUSPENDED in migration/cpr.c

- Steve

Re: [PATCH V3 00/22] Live Update [restart]
Posted by Dr. David Alan Gilbert 2 years, 11 months ago
On the 'restart' branch of questions; can you explain,
other than the passing of the fd's, why the outgoing side of
qemu's 'migrate exec:' doesn't work for you?

Dave

-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


Re: [PATCH V3 00/22] Live Update [restart]
Posted by Steven Sistare 2 years, 11 months ago
On 5/20/2021 9:13 AM, Dr. David Alan Gilbert wrote:
> On the 'restart' branch of questions; can you explain,
> other than the passing of the fd's, why the outgoing side of
> qemu's 'migrate exec:' doesn't work for you?

I'm not sure what I should describe.  Can you be more specific?
Do you mean: can we add the cpr specific bits to the migrate exec code?

- Steve

Re: [PATCH V3 00/22] Live Update [restart]
Posted by Dr. David Alan Gilbert 2 years, 11 months ago
* Steven Sistare (steven.sistare@oracle.com) wrote:
> On 5/20/2021 9:13 AM, Dr. David Alan Gilbert wrote:
> > On the 'restart' branch of questions; can you explain,
> > other than the passing of the fd's, why the outgoing side of
> > qemu's 'migrate exec:' doesn't work for you?
> 
> I'm not sure what I should describe.  Can you be more specific?
> Do you mean: can we add the cpr specific bits to the migrate exec code?

Yes; if possible I'd prefer to just keep the one exec mechanism.
It has an advantage of letting you specify the new command line; that
avoids the problems I'd pointed out with needing to change the command
line if a hotplug had happened.  It also means we only need one chunk of
exec code.

Dave

> - Steve
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


Re: [PATCH V3 00/22] Live Update [restart]
Posted by Steven Sistare 2 years, 10 months ago
On 5/24/2021 6:39 AM, Dr. David Alan Gilbert wrote:
> * Steven Sistare (steven.sistare@oracle.com) wrote:
>> On 5/20/2021 9:13 AM, Dr. David Alan Gilbert wrote:
>>> On the 'restart' branch of questions; can you explain,
>>> other than the passing of the fd's, why the outgoing side of
>>> qemu's 'migrate exec:' doesn't work for you?
>>
>> I'm not sure what I should describe.  Can you be more specific?
>> Do you mean: can we add the cpr specific bits to the migrate exec code?
> 
> Yes; if possible I'd prefer to just keep the one exec mechanism.
> It has an advantage of letting you specify the new command line; that
> avoids the problems I'd pointed out with needing to change the command
> line if a hotplug had happened.  It also means we only need one chunk of
> exec code.

How/where would you specify a new command line?  Are you picturing the usual migration
setup where you start a second qemu with its own arguments, plus a migrate_incoming
option or command?  That does not work for live update restart; the old qemu must exec
the new qemu.  Or something else?

We could shoehorn cpr restart into the migrate exec path by defining a new migration 
capability that the client would set before issuing the migrate command.  However, the
implementation code would be sprinkled with conditionals to suppress migrate-only bits
and call cpr-only bits.  IMO that would be less maintainable than having a separate
cprsave function.  Currently cprsave does not duplicate any migration functionality.
cprsave calls qemu_save_device_state() which is used by xen.

- Steve


Re: [PATCH V3 00/22] Live Update [restart]
Posted by Dr. David Alan Gilbert 2 years, 10 months ago
* Steven Sistare (steven.sistare@oracle.com) wrote:
> On 5/24/2021 6:39 AM, Dr. David Alan Gilbert wrote:
> > * Steven Sistare (steven.sistare@oracle.com) wrote:
> >> On 5/20/2021 9:13 AM, Dr. David Alan Gilbert wrote:
> >>> On the 'restart' branch of questions; can you explain,
> >>> other than the passing of the fd's, why the outgoing side of
> >>> qemu's 'migrate exec:' doesn't work for you?
> >>
> >> I'm not sure what I should describe.  Can you be more specific?
> >> Do you mean: can we add the cpr specific bits to the migrate exec code?
> > 
> > Yes; if possible I'd prefer to just keep the one exec mechanism.
> > It has an advantage of letting you specify the new command line; that
> > avoids the problems I'd pointed out with needing to change the command
> > line if a hotplug had happened.  It also means we only need one chunk of
> > exec code.
> 
> How/where would you specify a new command line?  Are you picturing the usual migration
> setup where you start a second qemu with its own arguments, plus a migrate_incoming
> option or command?  That does not work for live update restart; the old qemu must exec
> the new qemu.  Or something else?

The existing migration path allows an exec - originally intended to exec
something like a compressor or a store to file rather than a real
migration; i.e. you can do:

  migrate "exec:gzip > mig"

and that will save the migration stream to a compressed file called mig.
Now, I *think* we can already do:

  migrate "exec:path-to-qemu command line parameters -incoming 'hmmmmm'"
(That's probably cleaner via the QMP interface).

I'm not quite sure what I want in the incoming there, but that is
already the source execing the destination qemu - although I think we'd
probably need to check if that's actually via an intermediary.

> We could shoehorn cpr restart into the migrate exec path by defining a new migration 
> capability that the client would set before issuing the migrate command.  However, the
> implementation code would be sprinkled with conditionals to suppress migrate-only bits
> and call cpr-only bits.  IMO that would be less maintainable than having a separate
> cprsave function.  Currently cprsave does not duplicate any migration functionality.
> cprsave calls qemu_save_device_state() which is used by xen.

To me it feels like cprsave in particular is replicating more code.

It's also jumping through hoops in places to avoid changing the
commandline;  that's going to cause more pain for a lot of people - not
just because it's hacks all over for that, but because a lot of people
are actually going to need to change the commandline even in a cpr like
case (e.g. due to hotplug or changing something else about the
environment, like auth data or route to storage or networking that
changed).

There are hooks for early parameter parsing, so if we need to add extra
commandline args we can; but for example the case of QEMU_START_FREEZE
to add -S just isn't needed as soon as you let go of the idea of needing
an identical commandline.

Dave

> - Steve
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK


Re: [PATCH V3 00/22] Live Update [restart] : code replication
Posted by Steven Sistare 2 years, 10 months ago
On 6/3/2021 3:36 PM, Dr. David Alan Gilbert wrote:
> * Steven Sistare (steven.sistare@oracle.com) wrote:
>> On 5/24/2021 6:39 AM, Dr. David Alan Gilbert wrote:
>>> * Steven Sistare (steven.sistare@oracle.com) wrote:
>>>> On 5/20/2021 9:13 AM, Dr. David Alan Gilbert wrote:
>>>>> On the 'restart' branch of questions; can you explain,
>>>>> other than the passing of the fd's, why the outgoing side of
>>>>> qemu's 'migrate exec:' doesn't work for you?
>>>>
>>>> I'm not sure what I should describe.  Can you be more specific?
>>>> Do you mean: can we add the cpr specific bits to the migrate exec code?
>>>
>>> Yes; if possible I'd prefer to just keep the one exec mechanism.
>>> It has an advantage of letting you specify the new command line; that
>>> avoids the problems I'd pointed out with needing to change the command
>>> line if a hotplug had happened.  It also means we only need one chunk of
>>> exec code.
>>
>> [...]
> 
> I'm not quite sure what I want in the incoming there, but that is
> already the source execing the destination qemu - although I think we'd
> probably need to check if that's actually via an intermediary.
> 
>> We could shoehorn cpr restart into the migrate exec path by defining a new migration 
>> capability that the client would set before issuing the migrate command.  However, the
>> implementation code would be sprinkled with conditionals to suppress migrate-only bits
>> and call cpr-only bits.  IMO that would be less maintainable than having a separate
>> cprsave function.  Currently cprsave does not duplicate any migration functionality.
>> cprsave calls qemu_save_device_state() which is used by xen.
> 
> To me it feels like cprsave in particular is replicating more code. 

In the attached file I annotated lines of code that have some overlap
with migration code actions.  They include vm control, global_state_store,
and vmstate save, and cover 18 lines of 78 total.  I did not include the
body of qf_file_open because it is also called by xen.

The migration code adds capabilities, parameters, state, status, info,
precopy, postcopy, dirty bitmap, events, notifiers, 6 channel types,
blockers, pause+resume, return path, request-reply commands, throttling, colo,
blocks, phases, iteration, and threading, implemented by 20000+ lines of code.
To me it seems wrong to throw cpr into that mix to avoid adding tens of lines 
of similar code.

- Steve
  void cprsave(const char *file, CprMode mode, Error **errp)
  {
      int ret = 0;
**    QEMUFile *f;
**    int saved_vm_running = runstate_is_running();
      bool restart = (mode == CPR_MODE_RESTART);
      bool reboot = (mode == CPR_MODE_REBOOT);
  
      if (reboot && qemu_ram_volatile(errp)) {
          return;
      }
  
      if (restart && xen_enabled()) {
          error_setg(errp, "xen does not support cprsave restart");
          return;
      }
  
      if (migrate_colo_enabled()) {
          error_setg(errp, "error: cprsave does not support x-colo");
          return;
      }
  
      if (replay_mode != REPLAY_MODE_NONE) {
          error_setg(errp, "error: cprsave does not support replay");
          return;
      }
  
      f = qf_file_open(file, O_CREAT | O_WRONLY | O_TRUNC, 0600, "cprsave", errp);
      if (!f) {
          return;
      }
  
**    ret = global_state_store();
**    if (ret) {
**        error_setg(errp, "Error saving global state");
**        qemu_fclose(f);
**        return;
**    }
      if (runstate_check(RUN_STATE_SUSPENDED)) {
          /* Update timers_state before saving.  Suspend did not so do. */
          cpu_disable_ticks();
      }
**    vm_stop(RUN_STATE_SAVE_VM);
  
      cpr_is_active = true;
**    ret = qemu_save_device_state(f);
**    qemu_fclose(f);
**    if (ret < 0) {
**        error_setg(errp, "Error %d while saving VM state", ret);
**        goto err;
**    }
  
      if (reboot) {
          shutdown_action = SHUTDOWN_ACTION_POWEROFF;
          qemu_system_shutdown_request(SHUTDOWN_CAUSE_GUEST_SHUTDOWN);
      } else if (restart) {
          if (!qemu_chr_cpr_capable(errp)) {
              goto err;
          }
          if (vfio_cprsave(errp)) {
              goto err;
          }
          walkenv(FD_PREFIX, preserve_fd, 0);
          vhost_dev_reset_all();
          qemu_term_exit();
          setenv("QEMU_START_FREEZE", "", 1);
          qemu_system_exec_request();
      }
      goto done;
  
  err:
**    if (saved_vm_running) {
**        vm_start();
**    }
  done:
      cpr_is_active = false;
      return;
  }
Re: [PATCH V3 00/22] Live Update [restart] : code replication
Posted by Steven Sistare 2 years, 10 months ago
On 6/7/2021 2:08 PM, Steven Sistare wrote:
> On 6/3/2021 3:36 PM, Dr. David Alan Gilbert wrote:
>> * Steven Sistare (steven.sistare@oracle.com) wrote:
>>> On 5/24/2021 6:39 AM, Dr. David Alan Gilbert wrote:
>>>> * Steven Sistare (steven.sistare@oracle.com) wrote:
>>>>> On 5/20/2021 9:13 AM, Dr. David Alan Gilbert wrote:
>>>>>> On the 'restart' branch of questions; can you explain,
>>>>>> other than the passing of the fd's, why the outgoing side of
>>>>>> qemu's 'migrate exec:' doesn't work for you?
>>>>>
>>>>> I'm not sure what I should describe.  Can you be more specific?
>>>>> Do you mean: can we add the cpr specific bits to the migrate exec code?
>>>>
>>>> Yes; if possible I'd prefer to just keep the one exec mechanism.
>>>> It has an advantage of letting you specify the new command line; that
>>>> avoids the problems I'd pointed out with needing to change the command
>>>> line if a hotplug had happened.  It also means we only need one chunk of
>>>> exec code.
>>>
>>> [...]
>>
>> I'm not quite sure what I want in the incoming there, but that is
>> already the source execing the destination qemu - although I think we'd
>> probably need to check if that's actually via an intermediary.
>>
>>> We could shoehorn cpr restart into the migrate exec path by defining a new migration 
>>> capability that the client would set before issuing the migrate command.  However, the
>>> implementation code would be sprinkled with conditionals to suppress migrate-only bits
>>> and call cpr-only bits.  IMO that would be less maintainable than having a separate
>>> cprsave function.  Currently cprsave does not duplicate any migration functionality.
>>> cprsave calls qemu_save_device_state() which is used by xen.
>>
>> To me it feels like cprsave in particular is replicating more code. 
> 
> In the attached file I annotated lines of code that have some overlap
> with migration code actions.  They include vm control, global_state_store,
> and vmstate save, and cover 18 lines of 78 total.  I did not include the
> body of qf_file_open because it is also called by xen.
> 
> The migration code adds capabilities, parameters, state, status, info,
> precopy, postcopy, dirty bitmap, events, notifiers, 6 channel types,
> blockers, pause+resume, return path, request-reply commands, throttling, colo,
> blocks, phases, iteration, and threading, implemented by 20000+ lines of code.
> To me it seems wrong to throw cpr into that mix to avoid adding tens of lines 
> of similar code.

Hi David, what is your decision, will you accept separate cpr commands?
One last point is that Xen made a similar choice, adding the xen-save-devices-state
command which calls qemu_save_device_state instead of migration_thread.

- Steve