[libvirt RFCv11 00/33] multifd save restore prototype

Claudio Fontana posted 33 patches 1 year, 11 months ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/libvirt tags/patchew/20220607091936.7948-1-cfontana@suse.de
There is a newer version of this series
docs/manpages/virsh.rst                       |  26 +-
include/libvirt/libvirt-domain.h              |  24 +
po/POTFILES                                   |   1 +
src/libvirt_private.syms                      |   6 +
src/qemu/qemu_capabilities.c                  |   4 +
src/qemu/qemu_capabilities.h                  |   2 +
src/qemu/qemu_driver.c                        | 146 ++--
src/qemu/qemu_migration.c                     | 160 ++--
src/qemu/qemu_migration.h                     |  16 +-
src/qemu/qemu_migration_params.c              |  71 +-
src/qemu/qemu_migration_params.h              |  15 +
src/qemu/qemu_process.c                       |   3 +-
src/qemu/qemu_process.h                       |   5 +-
src/qemu/qemu_saveimage.c                     | 703 +++++++++++++-----
src/qemu/qemu_saveimage.h                     |  69 +-
src/qemu/qemu_snapshot.c                      |   6 +-
src/util/iohelper.c                           |   3 +
src/util/meson.build                          |  16 +
src/util/multifd-helper.c                     | 359 +++++++++
src/util/virfile.c                            | 391 +++++++---
src/util/virfile.h                            |  11 +
.../caps_4.0.0.aarch64.xml                    |   1 +
.../qemucapabilitiesdata/caps_4.0.0.ppc64.xml |   1 +
.../caps_4.0.0.riscv32.xml                    |   1 +
.../caps_4.0.0.riscv64.xml                    |   1 +
.../qemucapabilitiesdata/caps_4.0.0.s390x.xml |   1 +
.../caps_4.0.0.x86_64.xml                     |   1 +
.../caps_4.1.0.x86_64.xml                     |   1 +
.../caps_4.2.0.aarch64.xml                    |   1 +
.../qemucapabilitiesdata/caps_4.2.0.ppc64.xml |   1 +
.../qemucapabilitiesdata/caps_4.2.0.s390x.xml |   1 +
.../caps_4.2.0.x86_64.xml                     |   1 +
.../caps_5.0.0.aarch64.xml                    |   2 +
.../qemucapabilitiesdata/caps_5.0.0.ppc64.xml |   2 +
.../caps_5.0.0.riscv64.xml                    |   2 +
.../caps_5.0.0.x86_64.xml                     |   2 +
.../qemucapabilitiesdata/caps_5.1.0.sparc.xml |   2 +
.../caps_5.1.0.x86_64.xml                     |   2 +
.../caps_5.2.0.aarch64.xml                    |   2 +
.../qemucapabilitiesdata/caps_5.2.0.ppc64.xml |   2 +
.../caps_5.2.0.riscv64.xml                    |   2 +
.../qemucapabilitiesdata/caps_5.2.0.s390x.xml |   2 +
.../caps_5.2.0.x86_64.xml                     |   2 +
.../caps_6.0.0.aarch64.xml                    |   2 +
.../qemucapabilitiesdata/caps_6.0.0.s390x.xml |   2 +
.../caps_6.0.0.x86_64.xml                     |   2 +
.../caps_6.1.0.x86_64.xml                     |   2 +
.../caps_6.2.0.aarch64.xml                    |   2 +
.../qemucapabilitiesdata/caps_6.2.0.ppc64.xml |   2 +
.../caps_6.2.0.x86_64.xml                     |   2 +
.../caps_7.0.0.aarch64.xml                    |   2 +
.../qemucapabilitiesdata/caps_7.0.0.ppc64.xml |   2 +
.../caps_7.0.0.x86_64.xml                     |   2 +
.../caps_7.1.0.x86_64.xml                     |   2 +
tools/virsh-domain.c                          | 101 ++-
55 files changed, 1748 insertions(+), 445 deletions(-)
create mode 100644 src/util/multifd-helper.c
[libvirt RFCv11 00/33] multifd save restore prototype
Posted by Claudio Fontana 1 year, 11 months ago
This is v11 of the multifd save prototype, which focuses on saving
to a single file instead of requiring multiple separate files for
multifd channels.

This series demonstrates a way to save and restore from a single file
by using interleaved channels of a size equal to the transfer buffer
size (1MB), and relying on UNIX holes to avoid wasting physical disk
space due to channels of different size.

KNOWN ISSUES:

a) still applies only to save/restore (no managed save etc)

b) this is not not done in QEMU, where it could be possible to teach
   QEMU to migrate directly to a file or block device in a
   block-aligned way by altering the migration stream code and the
   state migration code of all devices.


changes from v10:

* virfile: add new API virFileDiskCopyChannel, which extends the
  existing virFileDiskCopy to work with parallel channels in the file.

* drop use of virthread API, use GLIB for threads.

* pass only a single FD to the multifd-helper, which will then open
  additional FDs as required for the multithreaded I/O.

* simplify virQEMUSaveFd API, separating the initialization from the
  addition of extra channels.

* adapt all documentation to mention a single file instead of multiple.

* remove the "Lim" versions of virFileDirectRead and Write, they are
  not needed.

---

changes from v9:

* exposed virFileDirectAlign

* separated the >= 2 QEMU_SAVE_VERSION change in own patch

* reworked the write code to add the alignment padding to the
  data_len, making the on disk format compatible when loaded
  from an older libvirt.

* reworked the read code to use direct I/O APIs only for actual
  direct I/O file descriptors, so as to make old images work
  with newer libvirt.

---

changes from v8:

* rebased on master

* reordered patches to add more upstreamable content at the start

* split introduction of virQEMUSaveFd, so the first part is multifd-free

* new virQEMUSaveDataRead as a mirror of virQEMUSaveDataWrite

* introduced virFileDirect API, using it in virFileDisk operations and
  for virQEMUSaveRead and virQEMUSaveWrite

---

changes from v7:

* [ base params API and iohelper refactoring upstreamed ]

* extended the QEMU save image format more, to record the nr
  of multifd channels on save. Made the data header struct packed.

* removed --parallel-connections from the restore command, as now
  it is useless due to QEMU save image format extension.

* separate out patches to expose migration_params APIs to saveimage,
  including qemuMigrationParamsSetString, SetCap, SetInt.

* fixed bugs in the ImageOpen patch (missing saveFd init), removed
  some whitespace, and fixed some convoluted code paths for return
  value -3.

---

changes from v6:

* improved error path handling, with error messages and especially
  cancellation of qemu process on error during restore.

* split patches more and reordered them to keep general refactoring
  at the beginning before the --parallel stuff is introduced.

* improved multifd compression support, including adding an enum
  and extending the QEMU save image format to record the compression
  used on save, and pick it up automatically on restore.

---

changes from v4:

* runIO renamed to virFileDiskCopy and rethought arguments

* renamed new APIs from ...ParametersFlags to ...Params

* introduce the new virDomainSaveParams and virDomainRestoreParams
  without any additional parameters, so they can be upstreamed first.

* solved the issue in the gendispatch.pl script generating code that
  was missing the conn parameter.

---

changes from v3:

* reordered series to have all helper-related change at the start

* solved all reported issues from ninja test, including documentation

* fixed most broken migration capabilities code (still imperfect likely)

* added G_GNUC_UNUSED as needed

* after multifd restore, added what I think were the missing operations:

  qemuProcessRefreshState(),
  qemuProcessStartCPUs() - most importantly,
  virDomainObjSave()

  The domain now starts running after restore without further encouragement

* removed the sleep(10) from the multifd-helper

---

changes from v2:

* added ability to restore the VM from disk using multifd

* fixed the multifd-helper to work in both directions,
  assuming the need to listen for save, and connect for restore.

* fixed a large number of bugs, and probably introduced some :-)

---

Claudio Fontana (33):
  virfile: introduce virFileDirect APIs
  virfile: use virFileDirect API in runIOCopy
  qemu: saveimage: rework image read/write to be O_DIRECT friendly
  qemu: saveimage: assume future formats will also support compression
  virfile: virFileDiskCopy: prepare for O_DIRECT files without wrapper
  qemu: saveimage: introduce virQEMUSaveFd
  qemu: saveimage: convert qemuSaveImageCreate to use virQEMUSaveFd
  qemu: saveimage: convert qemuSaveImageOpen to use virQEMUSaveFd
  tools: prepare doSave to use parameters
  tools: prepare cmdRestore to use parameters
  libvirt: add new VIR_DOMAIN_SAVE_PARALLEL flag and parameter
  qemu: add stub support for VIR_DOMAIN_SAVE_PARALLEL in save
  qemu: add stub support for VIR_DOMAIN_SAVE_PARALLEL in restore
  virfile: add new API virFileDiskCopyChannel
  multifd-helper: new helper for parallel save/restore
  qemu: saveimage: update virQEMUSaveFd struct for parallel save
  qemu: saveimage: wire up saveimage code with the multifd helper
  qemu: capabilities: add multifd to the probed migration capabilities
  qemu: saveimage: add multifd related fields to save format
  qemu: migration_params: add APIs to set Int and Cap
  qemu: migration: implement qemuMigrationSrcToFilesMultiFd for save
  qemu: add parameter to qemuMigrationDstRun to skip waiting
  qemu: implement qemuSaveImageLoadMultiFd for restore
  tools: add parallel parameter to virsh save command
  tools: add parallel parameter to virsh restore command
  qemu: add migration parameter multifd-compression
  libvirt: add new VIR_DOMAIN_SAVE_PARAM_PARALLEL_COMPRESSION
  qemu: saveimage: add parallel compression argument to ImageCreate
  qemu: saveimage: add stub support for multifd compression parameter
  qemu: migration: expose qemuMigrationParamsSetString
  qemu: saveimage: implement multifd-compression in parallel save
  qemu: saveimage: restore compressed parallel images
  tools: add parallel-compression parameter to virsh save command

 docs/manpages/virsh.rst                       |  26 +-
 include/libvirt/libvirt-domain.h              |  24 +
 po/POTFILES                                   |   1 +
 src/libvirt_private.syms                      |   6 +
 src/qemu/qemu_capabilities.c                  |   4 +
 src/qemu/qemu_capabilities.h                  |   2 +
 src/qemu/qemu_driver.c                        | 146 ++--
 src/qemu/qemu_migration.c                     | 160 ++--
 src/qemu/qemu_migration.h                     |  16 +-
 src/qemu/qemu_migration_params.c              |  71 +-
 src/qemu/qemu_migration_params.h              |  15 +
 src/qemu/qemu_process.c                       |   3 +-
 src/qemu/qemu_process.h                       |   5 +-
 src/qemu/qemu_saveimage.c                     | 703 +++++++++++++-----
 src/qemu/qemu_saveimage.h                     |  69 +-
 src/qemu/qemu_snapshot.c                      |   6 +-
 src/util/iohelper.c                           |   3 +
 src/util/meson.build                          |  16 +
 src/util/multifd-helper.c                     | 359 +++++++++
 src/util/virfile.c                            | 391 +++++++---
 src/util/virfile.h                            |  11 +
 .../caps_4.0.0.aarch64.xml                    |   1 +
 .../qemucapabilitiesdata/caps_4.0.0.ppc64.xml |   1 +
 .../caps_4.0.0.riscv32.xml                    |   1 +
 .../caps_4.0.0.riscv64.xml                    |   1 +
 .../qemucapabilitiesdata/caps_4.0.0.s390x.xml |   1 +
 .../caps_4.0.0.x86_64.xml                     |   1 +
 .../caps_4.1.0.x86_64.xml                     |   1 +
 .../caps_4.2.0.aarch64.xml                    |   1 +
 .../qemucapabilitiesdata/caps_4.2.0.ppc64.xml |   1 +
 .../qemucapabilitiesdata/caps_4.2.0.s390x.xml |   1 +
 .../caps_4.2.0.x86_64.xml                     |   1 +
 .../caps_5.0.0.aarch64.xml                    |   2 +
 .../qemucapabilitiesdata/caps_5.0.0.ppc64.xml |   2 +
 .../caps_5.0.0.riscv64.xml                    |   2 +
 .../caps_5.0.0.x86_64.xml                     |   2 +
 .../qemucapabilitiesdata/caps_5.1.0.sparc.xml |   2 +
 .../caps_5.1.0.x86_64.xml                     |   2 +
 .../caps_5.2.0.aarch64.xml                    |   2 +
 .../qemucapabilitiesdata/caps_5.2.0.ppc64.xml |   2 +
 .../caps_5.2.0.riscv64.xml                    |   2 +
 .../qemucapabilitiesdata/caps_5.2.0.s390x.xml |   2 +
 .../caps_5.2.0.x86_64.xml                     |   2 +
 .../caps_6.0.0.aarch64.xml                    |   2 +
 .../qemucapabilitiesdata/caps_6.0.0.s390x.xml |   2 +
 .../caps_6.0.0.x86_64.xml                     |   2 +
 .../caps_6.1.0.x86_64.xml                     |   2 +
 .../caps_6.2.0.aarch64.xml                    |   2 +
 .../qemucapabilitiesdata/caps_6.2.0.ppc64.xml |   2 +
 .../caps_6.2.0.x86_64.xml                     |   2 +
 .../caps_7.0.0.aarch64.xml                    |   2 +
 .../qemucapabilitiesdata/caps_7.0.0.ppc64.xml |   2 +
 .../caps_7.0.0.x86_64.xml                     |   2 +
 .../caps_7.1.0.x86_64.xml                     |   2 +
 tools/virsh-domain.c                          | 101 ++-
 55 files changed, 1748 insertions(+), 445 deletions(-)
 create mode 100644 src/util/multifd-helper.c

-- 
2.26.2
Re: [libvirt RFCv11 00/33] multifd save restore prototype
Posted by Claudio Fontana 6 months, 3 weeks ago
Hello Daniel and all @libvirt,

reviving this old thread for multifd save/restore (for high performance migration of a large VM to an nvme disk and viceversa),
now that we have made substantial progress on point b) below :-)

There is a lot of ongoing work in QEMU spearheaded by Fabiano to migrate in an optimized way now to/from disk by leveraging among others multifd and file offsets.

A couple questions that come to mind for the libvirt side though.

In terms of our use case, we would need to trigger these migrations from virsh save, restore, managedsave / start.

1) Can you confirm this is still a good target?
It would seem right from my perspective to hook up save/restore first, and then reuse the same mechanism for managedsave / start.


2) Do we expect to pass filename or file descriptor from libvirt into QEMU?


As is, libvirt today generally passes an already opened file descriptor to QEMU for migrations, roughly:

{"execute": "getfd", "arguments": {"fdname":"migrate"}} (passing the already open fd from libvirt f.e. 10)
{"execute": "migrate", "arguments": {"detach":true,"blk":false,"inc":false,"uri":"fd:migrate"}}'

Do we want to change libvirt to migrate to a file: URI ? Does this have consequence for "labeling" / security sandboxing?

Or would it be better to continue opening the fd in libvirt, writing the libvirt header, and then passing the existing open fd to QEMU, using QMP command "getfd",
followed by "migrate"? In this second case we would need to inform QEMU of the offset into the already open fd.

Internally, the QEMU multifd code just reads and writes using pread, pwrite, so there is in any case just one fd to worry about,
but who should own it, libvirt or QEMU?


3) How do we deal with O_DIRECT? In the prototype we were setting the O_DIRECT on the fd from libvirt in response to the user request for --bypass-cache,
which is needed 99% of the time with large VMs. I think I remember that we plan to write from libvirt normally (without O_DIRECT) and then set the flag later,
but should libvirt or QEMU set the O_DIRECT flag? This likely depends on who owns the fd?


Thank you for your thoughts from the libvirt side and your help in resurrecting this topic,

Claudio



On 6/7/22 11:19, Claudio Fontana wrote:
> This is v11 of the multifd save prototype, which focuses on saving
> to a single file instead of requiring multiple separate files for
> multifd channels.
> 
> This series demonstrates a way to save and restore from a single file
> by using interleaved channels of a size equal to the transfer buffer
> size (1MB), and relying on UNIX holes to avoid wasting physical disk
> space due to channels of different size.
> 
> KNOWN ISSUES:
> 
> a) still applies only to save/restore (no managed save etc)
> 
> b) this is not not done in QEMU, where it could be possible to teach
>    QEMU to migrate directly to a file or block device in a
>    block-aligned way by altering the migration stream code and the
>    state migration code of all devices.
> 
> 
> changes from v10:
> 
> * virfile: add new API virFileDiskCopyChannel, which extends the
>   existing virFileDiskCopy to work with parallel channels in the file.
> 
> * drop use of virthread API, use GLIB for threads.
> 
> * pass only a single FD to the multifd-helper, which will then open
>   additional FDs as required for the multithreaded I/O.
> 
> * simplify virQEMUSaveFd API, separating the initialization from the
>   addition of extra channels.
> 
> * adapt all documentation to mention a single file instead of multiple.
> 
> * remove the "Lim" versions of virFileDirectRead and Write, they are
>   not needed.
> 
> ---
> 
> changes from v9:
> 
> * exposed virFileDirectAlign
> 
> * separated the >= 2 QEMU_SAVE_VERSION change in own patch
> 
> * reworked the write code to add the alignment padding to the
>   data_len, making the on disk format compatible when loaded
>   from an older libvirt.
> 
> * reworked the read code to use direct I/O APIs only for actual
>   direct I/O file descriptors, so as to make old images work
>   with newer libvirt.
> 
> ---
> 
> changes from v8:
> 
> * rebased on master
> 
> * reordered patches to add more upstreamable content at the start
> 
> * split introduction of virQEMUSaveFd, so the first part is multifd-free
> 
> * new virQEMUSaveDataRead as a mirror of virQEMUSaveDataWrite
> 
> * introduced virFileDirect API, using it in virFileDisk operations and
>   for virQEMUSaveRead and virQEMUSaveWrite
> 
> ---
> 
> changes from v7:
> 
> * [ base params API and iohelper refactoring upstreamed ]
> 
> * extended the QEMU save image format more, to record the nr
>   of multifd channels on save. Made the data header struct packed.
> 
> * removed --parallel-connections from the restore command, as now
>   it is useless due to QEMU save image format extension.
> 
> * separate out patches to expose migration_params APIs to saveimage,
>   including qemuMigrationParamsSetString, SetCap, SetInt.
> 
> * fixed bugs in the ImageOpen patch (missing saveFd init), removed
>   some whitespace, and fixed some convoluted code paths for return
>   value -3.
> 
> ---
> 
> changes from v6:
> 
> * improved error path handling, with error messages and especially
>   cancellation of qemu process on error during restore.
> 
> * split patches more and reordered them to keep general refactoring
>   at the beginning before the --parallel stuff is introduced.
> 
> * improved multifd compression support, including adding an enum
>   and extending the QEMU save image format to record the compression
>   used on save, and pick it up automatically on restore.
> 
> ---
> 
> changes from v4:
> 
> * runIO renamed to virFileDiskCopy and rethought arguments
> 
> * renamed new APIs from ...ParametersFlags to ...Params
> 
> * introduce the new virDomainSaveParams and virDomainRestoreParams
>   without any additional parameters, so they can be upstreamed first.
> 
> * solved the issue in the gendispatch.pl script generating code that
>   was missing the conn parameter.
> 
> ---
> 
> changes from v3:
> 
> * reordered series to have all helper-related change at the start
> 
> * solved all reported issues from ninja test, including documentation
> 
> * fixed most broken migration capabilities code (still imperfect likely)
> 
> * added G_GNUC_UNUSED as needed
> 
> * after multifd restore, added what I think were the missing operations:
> 
>   qemuProcessRefreshState(),
>   qemuProcessStartCPUs() - most importantly,
>   virDomainObjSave()
> 
>   The domain now starts running after restore without further encouragement
> 
> * removed the sleep(10) from the multifd-helper
> 
> ---
> 
> changes from v2:
> 
> * added ability to restore the VM from disk using multifd
> 
> * fixed the multifd-helper to work in both directions,
>   assuming the need to listen for save, and connect for restore.
> 
> * fixed a large number of bugs, and probably introduced some :-)
> 
> ---
> 
> Claudio Fontana (33):
>   virfile: introduce virFileDirect APIs
>   virfile: use virFileDirect API in runIOCopy
>   qemu: saveimage: rework image read/write to be O_DIRECT friendly
>   qemu: saveimage: assume future formats will also support compression
>   virfile: virFileDiskCopy: prepare for O_DIRECT files without wrapper
>   qemu: saveimage: introduce virQEMUSaveFd
>   qemu: saveimage: convert qemuSaveImageCreate to use virQEMUSaveFd
>   qemu: saveimage: convert qemuSaveImageOpen to use virQEMUSaveFd
>   tools: prepare doSave to use parameters
>   tools: prepare cmdRestore to use parameters
>   libvirt: add new VIR_DOMAIN_SAVE_PARALLEL flag and parameter
>   qemu: add stub support for VIR_DOMAIN_SAVE_PARALLEL in save
>   qemu: add stub support for VIR_DOMAIN_SAVE_PARALLEL in restore
>   virfile: add new API virFileDiskCopyChannel
>   multifd-helper: new helper for parallel save/restore
>   qemu: saveimage: update virQEMUSaveFd struct for parallel save
>   qemu: saveimage: wire up saveimage code with the multifd helper
>   qemu: capabilities: add multifd to the probed migration capabilities
>   qemu: saveimage: add multifd related fields to save format
>   qemu: migration_params: add APIs to set Int and Cap
>   qemu: migration: implement qemuMigrationSrcToFilesMultiFd for save
>   qemu: add parameter to qemuMigrationDstRun to skip waiting
>   qemu: implement qemuSaveImageLoadMultiFd for restore
>   tools: add parallel parameter to virsh save command
>   tools: add parallel parameter to virsh restore command
>   qemu: add migration parameter multifd-compression
>   libvirt: add new VIR_DOMAIN_SAVE_PARAM_PARALLEL_COMPRESSION
>   qemu: saveimage: add parallel compression argument to ImageCreate
>   qemu: saveimage: add stub support for multifd compression parameter
>   qemu: migration: expose qemuMigrationParamsSetString
>   qemu: saveimage: implement multifd-compression in parallel save
>   qemu: saveimage: restore compressed parallel images
>   tools: add parallel-compression parameter to virsh save command
> 
>  docs/manpages/virsh.rst                       |  26 +-
>  include/libvirt/libvirt-domain.h              |  24 +
>  po/POTFILES                                   |   1 +
>  src/libvirt_private.syms                      |   6 +
>  src/qemu/qemu_capabilities.c                  |   4 +
>  src/qemu/qemu_capabilities.h                  |   2 +
>  src/qemu/qemu_driver.c                        | 146 ++--
>  src/qemu/qemu_migration.c                     | 160 ++--
>  src/qemu/qemu_migration.h                     |  16 +-
>  src/qemu/qemu_migration_params.c              |  71 +-
>  src/qemu/qemu_migration_params.h              |  15 +
>  src/qemu/qemu_process.c                       |   3 +-
>  src/qemu/qemu_process.h                       |   5 +-
>  src/qemu/qemu_saveimage.c                     | 703 +++++++++++++-----
>  src/qemu/qemu_saveimage.h                     |  69 +-
>  src/qemu/qemu_snapshot.c                      |   6 +-
>  src/util/iohelper.c                           |   3 +
>  src/util/meson.build                          |  16 +
>  src/util/multifd-helper.c                     | 359 +++++++++
>  src/util/virfile.c                            | 391 +++++++---
>  src/util/virfile.h                            |  11 +
>  .../caps_4.0.0.aarch64.xml                    |   1 +
>  .../qemucapabilitiesdata/caps_4.0.0.ppc64.xml |   1 +
>  .../caps_4.0.0.riscv32.xml                    |   1 +
>  .../caps_4.0.0.riscv64.xml                    |   1 +
>  .../qemucapabilitiesdata/caps_4.0.0.s390x.xml |   1 +
>  .../caps_4.0.0.x86_64.xml                     |   1 +
>  .../caps_4.1.0.x86_64.xml                     |   1 +
>  .../caps_4.2.0.aarch64.xml                    |   1 +
>  .../qemucapabilitiesdata/caps_4.2.0.ppc64.xml |   1 +
>  .../qemucapabilitiesdata/caps_4.2.0.s390x.xml |   1 +
>  .../caps_4.2.0.x86_64.xml                     |   1 +
>  .../caps_5.0.0.aarch64.xml                    |   2 +
>  .../qemucapabilitiesdata/caps_5.0.0.ppc64.xml |   2 +
>  .../caps_5.0.0.riscv64.xml                    |   2 +
>  .../caps_5.0.0.x86_64.xml                     |   2 +
>  .../qemucapabilitiesdata/caps_5.1.0.sparc.xml |   2 +
>  .../caps_5.1.0.x86_64.xml                     |   2 +
>  .../caps_5.2.0.aarch64.xml                    |   2 +
>  .../qemucapabilitiesdata/caps_5.2.0.ppc64.xml |   2 +
>  .../caps_5.2.0.riscv64.xml                    |   2 +
>  .../qemucapabilitiesdata/caps_5.2.0.s390x.xml |   2 +
>  .../caps_5.2.0.x86_64.xml                     |   2 +
>  .../caps_6.0.0.aarch64.xml                    |   2 +
>  .../qemucapabilitiesdata/caps_6.0.0.s390x.xml |   2 +
>  .../caps_6.0.0.x86_64.xml                     |   2 +
>  .../caps_6.1.0.x86_64.xml                     |   2 +
>  .../caps_6.2.0.aarch64.xml                    |   2 +
>  .../qemucapabilitiesdata/caps_6.2.0.ppc64.xml |   2 +
>  .../caps_6.2.0.x86_64.xml                     |   2 +
>  .../caps_7.0.0.aarch64.xml                    |   2 +
>  .../qemucapabilitiesdata/caps_7.0.0.ppc64.xml |   2 +
>  .../caps_7.0.0.x86_64.xml                     |   2 +
>  .../caps_7.1.0.x86_64.xml                     |   2 +
>  tools/virsh-domain.c                          | 101 ++-
>  55 files changed, 1748 insertions(+), 445 deletions(-)
>  create mode 100644 src/util/multifd-helper.c
>
Re: [libvirt RFCv11 00/33] multifd save restore prototype
Posted by Daniel P. Berrangé 6 months, 3 weeks ago
On Wed, Oct 11, 2023 at 03:46:59PM +0200, Claudio Fontana wrote:
> In terms of our use case, we would need to trigger these migrations from virsh save, restore, managedsave / start.
> 
> 1) Can you confirm this is still a good target?

IIRC the 'dump' command also has a codepath that can exercise
the migrate-to-file logic too.

> It would seem right from my perspective to hook up save/restore first, and then reuse the same mechanism for managedsave / start.

All of save, restore, managedsave, start, dump end up calling
into the same internal helper methods. So once you update these
helpers, you essentially get all the commands converted in one
go.

> 2) Do we expect to pass filename or file descriptor from libvirt into QEMU?
> 
> 
> As is, libvirt today generally passes an already opened file descriptor to QEMU for migrations, roughly:
> 
> {"execute": "getfd", "arguments": {"fdname":"migrate"}} (passing the already open fd from libvirt f.e. 10)
> {"execute": "migrate", "arguments": {"detach":true,"blk":false,"inc":false,"uri":"fd:migrate"}}'
> 
> Do we want to change libvirt to migrate to a file: URI ? Does this have consequence for "labeling" / security sandboxing?
> 
> Or would it be better to continue opening the fd in libvirt, writing the libvirt header, and then passing the existing open fd to QEMU, using QMP command "getfd",
> followed by "migrate"? In this second case we would need to inform QEMU of the offset into the already open fd.

How about both :-)

The current migration 'fd' protocol technically can cope with
any type of FD being passed on. QEMU doesn't try to interpret
the FD type right to any significant degree.

The 'file' protocol is explicitly providing a migration transport
supporting random access I/O to storage. As such we can specify
the offset too.

Now the neat trick is that 'file' protocol impl uses
qio_channel_file and this in turn uses qemu_open,
which supports FD passing.

Instead of using 'getfd' though we have to use 'add-fd'.

Anyway, this lets us do FD passing as normal, whle also
letting us specify the offset.

 {"execute": "add-fd", "arguments": {"fdset-id":"migrate"}}
 {"execute": "migrate", "arguments": {"detach":true,"blk":false,"inc":false,"uri":"file:/dev/fdset/migrate,offset=124456"}}'

> Internally, the QEMU multifd code just reads and writes using pread, pwrite, so there is in any case just one fd to worry about,
> but who should own it, libvirt or QEMU?

How about both :-)

Libvirt will open the file, in order to write its header.
Then libvirt passes the open FD to QEMU, specifying the
offset, and QEMU does its thing with vmstate, etc and
closes the FD when its done. libvirt's copy of the FD
is still open, and libvirt can finalize its header and
close the FD.

> 3) How do we deal with O_DIRECT? In the prototype we were setting the O_DIRECT on the fd from libvirt in response to the user request for --bypass-cache,
> which is needed 99% of the time with large VMs. I think I remember that we plan to write from libvirt normally (without O_DIRECT) and then set the flag later,
> but should libvirt or QEMU set the O_DIRECT flag? This likely depends on who owns the fd?

For O_DIRECT, the 'file' protocol should gain a new parameter
'bypass_cache: bool'. If this is set to 'true' then QEMU can
set O_DIRECT on the FD it opens or receives from libvirt.

Libvirt probably just has to be careful to unset O_DIRECT
at the end before it finalizes the header.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
Re: [libvirt RFCv11 00/33] multifd save restore prototype
Posted by Claudio Fontana 6 months, 3 weeks ago
Hi Daniel,

thanks for your answer,

On 10/11/23 16:05, Daniel P. Berrangé wrote:
> On Wed, Oct 11, 2023 at 03:46:59PM +0200, Claudio Fontana wrote:
>> In terms of our use case, we would need to trigger these migrations from virsh save, restore, managedsave / start.
>>
>> 1) Can you confirm this is still a good target?
> 
> IIRC the 'dump' command also has a codepath that can exercise
> the migrate-to-file logic too.
> 
>> It would seem right from my perspective to hook up save/restore first, and then reuse the same mechanism for managedsave / start.
> 
> All of save, restore, managedsave, start, dump end up calling
> into the same internal helper methods. So once you update these
> helpers, you essentially get all the commands converted in one
> go.

ok

> 
>> 2) Do we expect to pass filename or file descriptor from libvirt into QEMU?
>>
>>
>> As is, libvirt today generally passes an already opened file descriptor to QEMU for migrations, roughly:
>>
>> {"execute": "getfd", "arguments": {"fdname":"migrate"}} (passing the already open fd from libvirt f.e. 10)
>> {"execute": "migrate", "arguments": {"detach":true,"blk":false,"inc":false,"uri":"fd:migrate"}}'
>>
>> Do we want to change libvirt to migrate to a file: URI ? Does this have consequence for "labeling" / security sandboxing?
>>
>> Or would it be better to continue opening the fd in libvirt, writing the libvirt header, and then passing the existing open fd to QEMU, using QMP command "getfd",
>> followed by "migrate"? In this second case we would need to inform QEMU of the offset into the already open fd.
> 
> How about both :-)
> 
> The current migration 'fd' protocol technically can cope with
> any type of FD being passed on. QEMU doesn't try to interpret
> the FD type right to any significant degree.
> 
> The 'file' protocol is explicitly providing a migration transport
> supporting random access I/O to storage. As such we can specify
> the offset too.
> 
> Now the neat trick is that 'file' protocol impl uses
> qio_channel_file and this in turn uses qemu_open,
> which supports FD passing.

Interesting!

> 
> Instead of using 'getfd' though we have to use 'add-fd'.
> 
> Anyway, this lets us do FD passing as normal, whle also
> letting us specify the offset.
> 
>  {"execute": "add-fd", "arguments": {"fdset-id":"migrate"}}
>  {"execute": "migrate", "arguments": {"detach":true,"blk":false,"inc":false,"uri":"file:/dev/fdset/migrate,offset=124456"}}'
> 
>> Internally, the QEMU multifd code just reads and writes using pread, pwrite, so there is in any case just one fd to worry about,
>> but who should own it, libvirt or QEMU?
> 
> How about both :-)

I need to familiarize a bit with this, there are pieces I am missing. Can you correct here?

OPTION 1)

libvirt opens the file and has the FD, writes the header, marks the offset,
then we dup the FD in libvirt for the benefit of QEMU, optionally set the flags of the dup to "O_DIRECT" (the usual case) depending on --bypass-cache,
pass the duped FD to QEMU,
QEMU does all the pread/pwrite on it with the correct offset (since it knows it from the file:// URI optional offset parameter),
then libvirt closes the duped fd
libvirt rewrites the header using the original fd (needed to update the metadata),
libvirt closes the original fd


OPTION 2)

libvirt opens the file and has the FD, writes the header, marks the offset,
then we pass the FD to QEMU,
QEMU dups the FD and sets it as "O_DIRECT" depending on a passed parameter,
QEMU does all the pread/pwrite on it with the correct offset (since it knows it from the file:// URI optional offset parameter),
QEMU closes the duped FD,
libvirt rewrites the header using the original fd (needed to update the metadata),
libvirt closes the original fd


I don't remember if QEMU changes for the file offsets optimization are already "block friendly" ie they operate correctly whatever the state of O_DIRECT or ~O_DIRECT,
I think so. They have been thought with O_DIRECT in mind.

So I would tend to see OPTION 1) as more attractive as QEMU does not need to care about another parameter, whatever has been chosen in libvirt in terms of bypass cache is handled in libvirt.

Please correct my understanding where needed, thanks!

Claudio

> 
> Libvirt will open the file, in order to write its header.
> Then libvirt passes the open FD to QEMU, specifying the
> offset, and QEMU does its thing with vmstate, etc and
> closes the FD when its done. libvirt's copy of the FD
> is still open, and libvirt can finalize its header and
> close the FD.
> 
>> 3) How do we deal with O_DIRECT? In the prototype we were setting the O_DIRECT on the fd from libvirt in response to the user request for --bypass-cache,
>> which is needed 99% of the time with large VMs. I think I remember that we plan to write from libvirt normally (without O_DIRECT) and then set the flag later,
>> but should libvirt or QEMU set the O_DIRECT flag? This likely depends on who owns the fd?
> 
> For O_DIRECT, the 'file' protocol should gain a new parameter
> 'bypass_cache: bool'. If this is set to 'true' then QEMU can
> set O_DIRECT on the FD it opens or receives from libvirt.
> 
> Libvirt probably just has to be careful to unset O_DIRECT
> at the end before it finalizes the header.
> 
> With regards,
> Daniel
Re: [libvirt RFCv11 00/33] multifd save restore prototype
Posted by Daniel P. Berrangé 6 months, 3 weeks ago
On Wed, Oct 11, 2023 at 04:56:12PM +0200, Claudio Fontana wrote:
> 
> On 10/11/23 16:05, Daniel P. Berrangé wrote:
> > 
> > Instead of using 'getfd' though we have to use 'add-fd'.
> > 
> > Anyway, this lets us do FD passing as normal, whle also
> > letting us specify the offset.
> > 
> >  {"execute": "add-fd", "arguments": {"fdset-id":"migrate"}}
> >  {"execute": "migrate", "arguments": {"detach":true,"blk":false,"inc":false,"uri":"file:/dev/fdset/migrate,offset=124456"}}'
> > 
> >> Internally, the QEMU multifd code just reads and writes using pread, pwrite, so there is in any case just one fd to worry about,
> >> but who should own it, libvirt or QEMU?
> > 
> > How about both :-)
> 
> I need to familiarize a bit with this, there are pieces I am missing. Can you correct here?
> 
> OPTION 1)
> 
> libvirt opens the file and has the FD, writes the header, marks the offset,
> then we dup the FD in libvirt for the benefit of QEMU, optionally set the flags of the dup to "O_DIRECT" (the usual case) depending on --bypass-cache,
> pass the duped FD to QEMU,
> QEMU does all the pread/pwrite on it with the correct offset (since it knows it from the file:// URI optional offset parameter),
> then libvirt closes the duped fd
> libvirt rewrites the header using the original fd (needed to update the metadata),
> libvirt closes the original fd
> 
> 
> OPTION 2)
> 
> libvirt opens the file and has the FD, writes the header, marks the offset,
> then we pass the FD to QEMU,
> QEMU dups the FD and sets it as "O_DIRECT" depending on a passed parameter,
> QEMU does all the pread/pwrite on it with the correct offset (since it knows it from the file:// URI optional offset parameter),
> QEMU closes the duped FD,
> libvirt rewrites the header using the original fd (needed to update the metadata),
> libvirt closes the original fd
> 
> 
> I don't remember if QEMU changes for the file offsets optimization are already "block friendly" ie they operate correctly whatever the state of O_DIRECT or ~O_DIRECT,
> I think so. They have been thought with O_DIRECT in mind.

The 'file' protocol as it exists currently is not O_DIRECT
capable. It is not writing aligned buffers to aligned offsets
in the file. It is still running the regular old migration
stream format over the file, not taking advantage of it being
random access.

What's needed is the followup "fixed ram" format adaptation.
Use of that format should imply O_DIRECT, so in fact we
don't need an explicit 'bypass_cache' parameter in QAPI,
just a way to ask for the 'fixed ram' format.

> So I would tend to see OPTION 1) as more attractive as QEMU does not need to care about another parameter, whatever has been chosen in libvirt in terms of bypass cache is handled in libvirt.

The 'fixed ram' format will only take care of I/O for the
main RAM blocks which are nicely aligned and can be written
to aligned file offsets. The general device vmstate I/O
probably can't be assumed to be aligned. While we could
futz around with QEMUFile so that it bounce buffers vmstate
to an aligned region and flushes it in page sized chunks
that's probably too much of a pain.

IOW, actually I think what QEMU would likely want to
do is

 1. qemu_open  -> get a FD *without* O_DIRECT set
 2. write some vmstate stuff
 3. turn on O_DIRECT
 4. write RAM in fixed locations
 5. turn off O_DIRECT
 6. write remaining vmstate

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [libvirt RFCv11 00/33] multifd save restore prototype
Posted by Claudio Fontana 6 months, 1 week ago
On 10/11/23 17:29, Daniel P. Berrangé wrote:
> On Wed, Oct 11, 2023 at 04:56:12PM +0200, Claudio Fontana wrote:
>>
>> On 10/11/23 16:05, Daniel P. Berrangé wrote:
>>>
>>> Instead of using 'getfd' though we have to use 'add-fd'.
>>>
>>> Anyway, this lets us do FD passing as normal, whle also
>>> letting us specify the offset.
>>>
>>>  {"execute": "add-fd", "arguments": {"fdset-id":"migrate"}}
>>>  {"execute": "migrate", "arguments": {"detach":true,"blk":false,"inc":false,"uri":"file:/dev/fdset/migrate,offset=124456"}}'


Hi Daniel,

the "add-fd" is the part that I don't understand at all,

should we actually pass an fd there like with fd-get, already open with the savevm file?
Something in pseudocode like:

virsh qemu-monitor-command --pass-fds 10 --cmd='{"execute": "add-fd", "arguments": {"fdset-id":10}} ?

should we use "opaque" instead of "fdset-id" if you want to actually set it to "migrate"?
And how to reference it later?

virsh qemu-monitor-command --cmd='{"execute": "migrate", "arguments": {"detach":true,"blk":false,"inc":false,"uri":"file:/dev/fdset/migrate,offset=124456"}}

?

"opaque" does not seem to get me a reachable /dev/fdset/migrate though.

I can currently trigger the migration to the URI file:/mnt/nvme/savevm so that seems to work fine,
it's the file:/dev/fdset part that I am still unable to glue together.

Thanks for any idea,

Claudio


>>>
>>>> Internally, the QEMU multifd code just reads and writes using pread, pwrite, so there is in any case just one fd to worry about,
>>>> but who should own it, libvirt or QEMU?
>>>
>>> How about both :-)
>>
>> I need to familiarize a bit with this, there are pieces I am missing. Can you correct here?
>>
>> OPTION 1)
>>
>> libvirt opens the file and has the FD, writes the header, marks the offset,
>> then we dup the FD in libvirt for the benefit of QEMU, optionally set the flags of the dup to "O_DIRECT" (the usual case) depending on --bypass-cache,
>> pass the duped FD to QEMU,
>> QEMU does all the pread/pwrite on it with the correct offset (since it knows it from the file:// URI optional offset parameter),
>> then libvirt closes the duped fd
>> libvirt rewrites the header using the original fd (needed to update the metadata),
>> libvirt closes the original fd
>>
>>
>> OPTION 2)
>>
>> libvirt opens the file and has the FD, writes the header, marks the offset,
>> then we pass the FD to QEMU,
>> QEMU dups the FD and sets it as "O_DIRECT" depending on a passed parameter,
>> QEMU does all the pread/pwrite on it with the correct offset (since it knows it from the file:// URI optional offset parameter),
>> QEMU closes the duped FD,
>> libvirt rewrites the header using the original fd (needed to update the metadata),
>> libvirt closes the original fd
>>
>>
>> I don't remember if QEMU changes for the file offsets optimization are already "block friendly" ie they operate correctly whatever the state of O_DIRECT or ~O_DIRECT,
>> I think so. They have been thought with O_DIRECT in mind.
> 
> The 'file' protocol as it exists currently is not O_DIRECT
> capable. It is not writing aligned buffers to aligned offsets
> in the file. It is still running the regular old migration
> stream format over the file, not taking advantage of it being
> random access.
> 
> What's needed is the followup "fixed ram" format adaptation.
> Use of that format should imply O_DIRECT, so in fact we
> don't need an explicit 'bypass_cache' parameter in QAPI,
> just a way to ask for the 'fixed ram' format.
> 
>> So I would tend to see OPTION 1) as more attractive as QEMU does not need to care about another parameter, whatever has been chosen in libvirt in terms of bypass cache is handled in libvirt.
> 
> The 'fixed ram' format will only take care of I/O for the
> main RAM blocks which are nicely aligned and can be written
> to aligned file offsets. The general device vmstate I/O
> probably can't be assumed to be aligned. While we could
> futz around with QEMUFile so that it bounce buffers vmstate
> to an aligned region and flushes it in page sized chunks
> that's probably too much of a pain.
> 
> IOW, actually I think what QEMU would likely want to
> do is
> 
>  1. qemu_open  -> get a FD *without* O_DIRECT set
>  2. write some vmstate stuff
>  3. turn on O_DIRECT
>  4. write RAM in fixed locations
>  5. turn off O_DIRECT
>  6. write remaining vmstate
> 
> With regards,
> Daniel
Re: [libvirt RFCv11 00/33] multifd save restore prototype
Posted by Daniel P. Berrangé 6 months, 1 week ago
On Mon, Oct 23, 2023 at 04:06:53PM +0200, Claudio Fontana wrote:
> On 10/11/23 17:29, Daniel P. Berrangé wrote:
> > On Wed, Oct 11, 2023 at 04:56:12PM +0200, Claudio Fontana wrote:
> >>
> >> On 10/11/23 16:05, Daniel P. Berrangé wrote:
> >>>
> >>> Instead of using 'getfd' though we have to use 'add-fd'.
> >>>
> >>> Anyway, this lets us do FD passing as normal, whle also
> >>> letting us specify the offset.
> >>>
> >>>  {"execute": "add-fd", "arguments": {"fdset-id":"migrate"}}
> >>>  {"execute": "migrate", "arguments": {"detach":true,"blk":false,"inc":false,"uri":"file:/dev/fdset/migrate,offset=124456"}}'
> 
> 
> Hi Daniel,
> 
> the "add-fd" is the part that I don't understand at all,
> 
> should we actually pass an fd there like with fd-get, already open with the savevm file?
> Something in pseudocode like:
> 
> virsh qemu-monitor-command --pass-fds 10 --cmd='{"execute": "add-fd", "arguments": {"fdset-id":10}} ?
> 
> should we use "opaque" instead of "fdset-id" if you want to actually set it to "migrate"?
> And how to reference it later?
> 
> virsh qemu-monitor-command --cmd='{"execute": "migrate", "arguments": {"detach":true,"blk":false,"inc":false,"uri":"file:/dev/fdset/migrate,offset=124456"}}
> 
> ?
> 
> "opaque" does not seem to get me a reachable /dev/fdset/migrate though.
> 
> I can currently trigger the migration to the URI file:/mnt/nvme/savevm so that seems to work fine,
> it's the file:/dev/fdset part that I am still unable to glue together.

Sorry, I was mis-remembering the fdset usage.  They have to be used via
a numeric ID value, not a named set. The numeric IDs don't match the FD
numbers either. So I think it would be :

 virsh qemu-monitor-command --pass-fds 10 --cmd='{"execute": "add-fd", "arguments": {"fdset-id":7}} ? 
 virsh qemu-monitor-command --cmd='{"execute": "migrate", "arguments": {"detach":true,"blk":false,"inc":false,"uri":"file:/dev/fdset/7,offset=124456"}}
 

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|