[libvirt RFCv8 00/27] multifd save restore prototype

Claudio Fontana posted 27 patches 1 week, 1 day ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/libvirt tags/patchew/20220507134320.6289-1-cfontana@suse.de
docs/manpages/virsh.rst                       |  39 +-
include/libvirt/libvirt-domain.h              |  29 +
po/POTFILES.in                                |   1 +
src/libvirt_private.syms                      |   1 +
src/qemu/qemu_capabilities.c                  |   6 +
src/qemu/qemu_capabilities.h                  |   4 +
src/qemu/qemu_driver.c                        | 140 +++--
src/qemu/qemu_migration.c                     | 160 +++--
src/qemu/qemu_migration.h                     |  16 +-
src/qemu/qemu_migration_params.c              |  71 ++-
src/qemu/qemu_migration_params.h              |  15 +
src/qemu/qemu_process.c                       |   3 +-
src/qemu/qemu_process.h                       |   5 +-
src/qemu/qemu_saveimage.c                     | 557 ++++++++++++++----
src/qemu/qemu_saveimage.h                     |  57 +-
src/qemu/qemu_snapshot.c                      |   6 +-
src/util/meson.build                          |  16 +
src/util/multifd-helper.c                     | 249 ++++++++
src/util/virthread.c                          |   5 +
src/util/virthread.h                          |   1 +
.../caps_4.0.0.aarch64.xml                    |   1 +
.../qemucapabilitiesdata/caps_4.0.0.ppc64.xml |   1 +
.../caps_4.0.0.riscv32.xml                    |   1 +
.../caps_4.0.0.riscv64.xml                    |   1 +
.../qemucapabilitiesdata/caps_4.0.0.s390x.xml |   1 +
.../caps_4.0.0.x86_64.xml                     |   1 +
.../caps_4.1.0.x86_64.xml                     |   1 +
.../caps_4.2.0.aarch64.xml                    |   1 +
.../qemucapabilitiesdata/caps_4.2.0.ppc64.xml |   1 +
.../qemucapabilitiesdata/caps_4.2.0.s390x.xml |   1 +
.../caps_4.2.0.x86_64.xml                     |   1 +
.../caps_5.0.0.aarch64.xml                    |   2 +
.../qemucapabilitiesdata/caps_5.0.0.ppc64.xml |   2 +
.../caps_5.0.0.riscv64.xml                    |   2 +
.../caps_5.0.0.x86_64.xml                     |   2 +
.../qemucapabilitiesdata/caps_5.1.0.sparc.xml |   2 +
.../caps_5.1.0.x86_64.xml                     |   2 +
.../caps_5.2.0.aarch64.xml                    |   2 +
.../qemucapabilitiesdata/caps_5.2.0.ppc64.xml |   2 +
.../caps_5.2.0.riscv64.xml                    |   2 +
.../qemucapabilitiesdata/caps_5.2.0.s390x.xml |   2 +
.../caps_5.2.0.x86_64.xml                     |   2 +
.../caps_6.0.0.aarch64.xml                    |   2 +
.../qemucapabilitiesdata/caps_6.0.0.s390x.xml |   2 +
.../caps_6.0.0.x86_64.xml                     |   2 +
.../caps_6.1.0.x86_64.xml                     |   2 +
.../caps_6.2.0.aarch64.xml                    |   2 +
.../qemucapabilitiesdata/caps_6.2.0.ppc64.xml |   2 +
.../caps_6.2.0.x86_64.xml                     |   2 +
.../caps_7.0.0.aarch64.xml                    |   2 +
.../qemucapabilitiesdata/caps_7.0.0.ppc64.xml |   2 +
.../caps_7.0.0.x86_64.xml                     |   2 +
tools/virsh-domain.c                          | 101 +++-
53 files changed, 1254 insertions(+), 281 deletions(-)
create mode 100644 src/util/multifd-helper.c
[libvirt RFCv8 00/27] multifd save restore prototype
Posted by Claudio Fontana 1 week, 1 day ago
This is v8 of the multifd save prototype, which fixes a few bugs,
adds a few more code splits, and records the number of channels
as well as the compression algorithm, so the restore command is
more user-friendly.

It is now possible to just say:

virsh save mydomain /mnt/saves/mysave --parallel

virsh restore /mnt/saves/mysave --parallel

and things work with the default of 2 channels, no compression.

It is also possible to say of course:

virsh save mydomain /mnt/saves/mysave --parallel
      --parallel-connections 16 --parallel-compression zstd

virsh restore /mnt/saves/mysave --parallel

and things also work fine, due to channels and compression
being stored in the main save file.

---

changes from v7:

* [ base params API and iohelper refactoring upstreamed ]

* extended the QEMU save image format more, to record the nr
  of multifd channels on save. Made the data header struct packed.

* removed --parallel-connections from the restore command, as now
  it is useless due to QEMU save image format extension.

* separate out patches to expose migration_params APIs to saveimage,
  including qemuMigrationParamsSetString, SetCap, SetInt.

* fixed bugs in the ImageOpen patch (missing saveFd init), removed
  some whitespace, and fixed some convoluted code paths for return
  value -3.


Claudio Fontana (27):
  multifd-helper: new helper for parallel save/restore
  tools: prepare doSave to use parameters
  tools: prepare cmdRestore to use parameters
  libvirt: add new VIR_DOMAIN_SAVE_PARALLEL flag and parameter
  qemu: add stub support for VIR_DOMAIN_SAVE_PARALLEL in save
  qemu: add stub support for VIR_DOMAIN_SAVE_PARALLEL in restore
  qemu: saveimage: introduce virQEMUSaveFd
  qemu: saveimage: convert qemuSaveImageCreate to use virQEMUSaveFd
  qemu: saveimage: convert qemuSaveImageOpen to use virQEMUSaveFd
  qemu: saveimage: add virQEMUSaveFd APIs for multifd
  qemu: saveimage: wire up saveimage code with the multifd helper
  qemu: capabilities: add multifd to the probed migration capabilities
  qemu: saveimage: add multifd related fields to save format
  qemu: migration_params: add APIs to set Int and Cap
  qemu: migration: implement qemuMigrationSrcToFilesMultiFd
  qemu: add parameter to qemuMigrationDstRun to skip waiting
  qemu: implement qemuSaveImageLoadMultiFd
  tools: add parallel parameter to virsh save command
  tools: add parallel parameter to virsh restore command
  qemu: add migration parameter multifd-compression
  libvirt: add new VIR_SAVE_PARAM_PARALLEL_COMPRESSION
  qemu: saveimage: add parallel compression argument to ImageCreate
  qemu: saveimage: add stub support for multifd compression parameter
  qemu: migration: expose qemuMigrationParamsSetString
  qemu: saveimage: implement multifd-compression in parallel save
  qemu: saveimage: restore compressed parallel images
  tools: add parallel-compression parameter to virsh save command

 docs/manpages/virsh.rst                       |  39 +-
 include/libvirt/libvirt-domain.h              |  29 +
 po/POTFILES.in                                |   1 +
 src/libvirt_private.syms                      |   1 +
 src/qemu/qemu_capabilities.c                  |   6 +
 src/qemu/qemu_capabilities.h                  |   4 +
 src/qemu/qemu_driver.c                        | 140 +++--
 src/qemu/qemu_migration.c                     | 160 +++--
 src/qemu/qemu_migration.h                     |  16 +-
 src/qemu/qemu_migration_params.c              |  71 ++-
 src/qemu/qemu_migration_params.h              |  15 +
 src/qemu/qemu_process.c                       |   3 +-
 src/qemu/qemu_process.h                       |   5 +-
 src/qemu/qemu_saveimage.c                     | 557 ++++++++++++++----
 src/qemu/qemu_saveimage.h                     |  57 +-
 src/qemu/qemu_snapshot.c                      |   6 +-
 src/util/meson.build                          |  16 +
 src/util/multifd-helper.c                     | 249 ++++++++
 src/util/virthread.c                          |   5 +
 src/util/virthread.h                          |   1 +
 .../caps_4.0.0.aarch64.xml                    |   1 +
 .../qemucapabilitiesdata/caps_4.0.0.ppc64.xml |   1 +
 .../caps_4.0.0.riscv32.xml                    |   1 +
 .../caps_4.0.0.riscv64.xml                    |   1 +
 .../qemucapabilitiesdata/caps_4.0.0.s390x.xml |   1 +
 .../caps_4.0.0.x86_64.xml                     |   1 +
 .../caps_4.1.0.x86_64.xml                     |   1 +
 .../caps_4.2.0.aarch64.xml                    |   1 +
 .../qemucapabilitiesdata/caps_4.2.0.ppc64.xml |   1 +
 .../qemucapabilitiesdata/caps_4.2.0.s390x.xml |   1 +
 .../caps_4.2.0.x86_64.xml                     |   1 +
 .../caps_5.0.0.aarch64.xml                    |   2 +
 .../qemucapabilitiesdata/caps_5.0.0.ppc64.xml |   2 +
 .../caps_5.0.0.riscv64.xml                    |   2 +
 .../caps_5.0.0.x86_64.xml                     |   2 +
 .../qemucapabilitiesdata/caps_5.1.0.sparc.xml |   2 +
 .../caps_5.1.0.x86_64.xml                     |   2 +
 .../caps_5.2.0.aarch64.xml                    |   2 +
 .../qemucapabilitiesdata/caps_5.2.0.ppc64.xml |   2 +
 .../caps_5.2.0.riscv64.xml                    |   2 +
 .../qemucapabilitiesdata/caps_5.2.0.s390x.xml |   2 +
 .../caps_5.2.0.x86_64.xml                     |   2 +
 .../caps_6.0.0.aarch64.xml                    |   2 +
 .../qemucapabilitiesdata/caps_6.0.0.s390x.xml |   2 +
 .../caps_6.0.0.x86_64.xml                     |   2 +
 .../caps_6.1.0.x86_64.xml                     |   2 +
 .../caps_6.2.0.aarch64.xml                    |   2 +
 .../qemucapabilitiesdata/caps_6.2.0.ppc64.xml |   2 +
 .../caps_6.2.0.x86_64.xml                     |   2 +
 .../caps_7.0.0.aarch64.xml                    |   2 +
 .../qemucapabilitiesdata/caps_7.0.0.ppc64.xml |   2 +
 .../caps_7.0.0.x86_64.xml                     |   2 +
 tools/virsh-domain.c                          | 101 +++-
 53 files changed, 1254 insertions(+), 281 deletions(-)
 create mode 100644 src/util/multifd-helper.c

-- 
2.35.3
Re: [libvirt RFCv8 00/27] multifd save restore prototype
Posted by Daniel P. Berrangé 5 days, 16 hours ago
On Sat, May 07, 2022 at 03:42:53PM +0200, Claudio Fontana wrote:
> This is v8 of the multifd save prototype, which fixes a few bugs,
> adds a few more code splits, and records the number of channels
> as well as the compression algorithm, so the restore command is
> more user-friendly.
> 
> It is now possible to just say:
> 
> virsh save mydomain /mnt/saves/mysave --parallel
> 
> virsh restore /mnt/saves/mysave --parallel
> 
> and things work with the default of 2 channels, no compression.
> 
> It is also possible to say of course:
> 
> virsh save mydomain /mnt/saves/mysave --parallel
>       --parallel-connections 16 --parallel-compression zstd
> 
> virsh restore /mnt/saves/mysave --parallel
> 
> and things also work fine, due to channels and compression
> being stored in the main save file.

For the sake of people following along, the above commands will
result in creation of multiple files

  /mnt/saves/mysave
  /mnt/saves/mysave.0
  /mnt/saves/mysave.1
  ....
  /mnt/saves/mysave.n

Where 'n' is the number of threads used.

Overall I'm not very happy with the approach of doing any of this
on the libvirt side.

Backing up, we know that QEMU can directly save to disk faster than
libvirt can. We mitigated alot of that overhead with previous patches
to increase the pipe buffer size, but some still remains due to the
extra copies inherant in handing this off to libvirt.

Using multifd on the libvirt side, IIUC, gets us better performance
than QEMU can manage if doing non-multifd write to file directly,
but we still have the extra copies in there due to the hand off
to libvirt. If QEMU were to be directly capable to writing to
disk with multifd, it should beat us again.

As a result of how we integrate with QEMU multifd, we're taking the
approach of saving the state across multiple files, because it is
easier than trying to get multiple threads writing to the same file.
It could be solved by using file range locking on the save file.
eg a thread can reserve say 500 MB of space, fill it up, and then
reserve another 500 MB, etc, etc. It is a bit tedious though and
won't align nicely. eg a 1 GB huge page, would be 1 GB + a few
bytes of QEMU RAM ave state header.

The other downside of multiple files is that it complicates life
for both libvirt and apps using libvirt. They need to be aware of
multiple files and move them around together. This is not a simple
as it might sound. For example, IIRC OpenStack would upload a save
image state into a glance bucket for later use. Well, now it needs
multiple distinct buckets and keep track of them all. It also means
we're forced to use the same concurrency level when restoring, which
is not neccessarily desirable if the host environment is different
when restoring. ie The original host might have had 8 CPUs, but the
new host might only have 4 available, or vica-verca.


I know it is appealing to do something on the libvirt side, because
it is quicker than getting an enhancement into new QEMU release. We
have been down this route before with the migration support in libvirt
in the past though, when we introduced the tunnelled live migration
in order to workaround QEMU's inability to do TLS encryption. I very
much regret that we ever did this, because tunnelled migration was
inherantly limited, so for example failed to work with multifd,
and failed to work with NBD based disk migration. In the end I did
what I should have done at the beginning and just added TLS support
to QEMU, making tunnelled migration obsolete, except we still have
to carry the code around in libvirt indefinitely due to apps using
it.

So I'm very concerned about not having history repeat itself and
give us a long term burden for  a solution that turns out to be a
evolutionary dead end.

I like the idea of parallel saving, but I really think we need to
implement this directly in QEMU, not libvirt. As previously
mentioned I think QEMU needs to get a 'file' migration protocol,
along with ability to directly map RAM  segments into fixed
positions in the file. The benefits are many

 - It will save & restore faster because we're eliminating data
   copies that libvirt imposes via the iohelper
 
 - It is simple for libvirt & mgmt apps as we still only
   have one file to manage

 - It is space efficient because if a guest dirties a
   memory page, we just overwrite the existing contents
   at the fixed location in the file, instead of appending
   new contents to the file

 - It will restore faster too because we only restore
   each memory page once, due to always overwriting the
   file in-place when the guest dirtied a page during save

 - It can save and restore with differing number of threads,
   and can even dynamically change the number of threads
   in the middle of the save/restore operation 

As David G has pointed out the impl is not trivial on the QEMU
side, but from what I understand of the migration code, it is
certainly viable. Most importantly I think it puts us in a
better position for long term feature enhancements later by
taking the middle man (libvirt) out of the equation, letting
QEMU directly know what medium it is saving/restoring to/from.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
Re: [libvirt RFCv8 00/27] multifd save restore prototype
Posted by Christophe Marie Francois Dupont de Dinechin 5 days, 2 hours ago

> On 10 May 2022, at 20:38, Daniel P. Berrangé <berrange@redhat.com> wrote:
> 
> On Sat, May 07, 2022 at 03:42:53PM +0200, Claudio Fontana wrote:
>> This is v8 of the multifd save prototype, which fixes a few bugs,
>> adds a few more code splits, and records the number of channels
>> as well as the compression algorithm, so the restore command is
>> more user-friendly.
>> 
>> It is now possible to just say:
>> 
>> virsh save mydomain /mnt/saves/mysave --parallel
>> 
>> virsh restore /mnt/saves/mysave --parallel
>> 
>> and things work with the default of 2 channels, no compression.
>> 
>> It is also possible to say of course:
>> 
>> virsh save mydomain /mnt/saves/mysave --parallel
>>      --parallel-connections 16 --parallel-compression zstd
>> 
>> virsh restore /mnt/saves/mysave --parallel
>> 
>> and things also work fine, due to channels and compression
>> being stored in the main save file.
> 
> For the sake of people following along, the above commands will
> result in creation of multiple files
> 
>  /mnt/saves/mysave
>  /mnt/saves/mysave.0
>  /mnt/saves/mysave.1
>  ....
>  /mnt/saves/mysave.n
> 
> Where 'n' is the number of threads used.
> 
> Overall I'm not very happy with the approach of doing any of this
> on the libvirt side.
> 
> Backing up, we know that QEMU can directly save to disk faster than
> libvirt can. We mitigated alot of that overhead with previous patches
> to increase the pipe buffer size, but some still remains due to the
> extra copies inherant in handing this off to libvirt.
> 
> Using multifd on the libvirt side, IIUC, gets us better performance
> than QEMU can manage if doing non-multifd write to file directly,
> but we still have the extra copies in there due to the hand off
> to libvirt. If QEMU were to be directly capable to writing to
> disk with multifd, it should beat us again.
> 
> As a result of how we integrate with QEMU multifd, we're taking the
> approach of saving the state across multiple files, because it is
> easier than trying to get multiple threads writing to the same file.
> It could be solved by using file range locking on the save file.
> eg a thread can reserve say 500 MB of space, fill it up, and then
> reserve another 500 MB, etc, etc. It is a bit tedious though and
> won't align nicely. eg a 1 GB huge page, would be 1 GB + a few
> bytes of QEMU RAM ave state header.

First, I do not understand why you would write things that are
not page-aligned to start with? (As an aside, I don’t know
how any dirty tracking would work if you do not keep things
page-aligned).

Could uffd_register_memory accept a memory range that is
not aligned? If so, when? Should that be specified in the
interface?

Second, instead of creating multiple files, why not write blocks
at a location determined by an variable that you increment using
atomic operations each time you need a new block? If you want to
keep the blocks page-aligned in the file as well (which might help
if you want to mmap the file at some point), then you need to
build a map of the blocks that you tack at the end of the file.

There may be good reasons not to do it that way, of course, but
I am not familiar enough with the problem to know them.

> 
> The other downside of multiple files is that it complicates life
> for both libvirt and apps using libvirt. They need to be aware of
> multiple files and move them around together. This is not a simple
> as it might sound. For example, IIRC OpenStack would upload a save
> image state into a glance bucket for later use. Well, now it needs
> multiple distinct buckets and keep track of them all. It also means
> we're forced to use the same concurrency level when restoring, which
> is not neccessarily desirable if the host environment is different
> when restoring. ie The original host might have had 8 CPUs, but the
> new host might only have 4 available, or vica-verca.
> 
> 
> I know it is appealing to do something on the libvirt side, because
> it is quicker than getting an enhancement into new QEMU release. We
> have been down this route before with the migration support in libvirt
> in the past though, when we introduced the tunnelled live migration
> in order to workaround QEMU's inability to do TLS encryption. I very
> much regret that we ever did this, because tunnelled migration was
> inherantly limited, so for example failed to work with multifd,
> and failed to work with NBD based disk migration. In the end I did
> what I should have done at the beginning and just added TLS support
> to QEMU, making tunnelled migration obsolete, except we still have
> to carry the code around in libvirt indefinitely due to apps using
> it.
> 
> So I'm very concerned about not having history repeat itself and
> give us a long term burden for  a solution that turns out to be a
> evolutionary dead end.
> 
> I like the idea of parallel saving, but I really think we need to
> implement this directly in QEMU, not libvirt. As previously
> mentioned I think QEMU needs to get a 'file' migration protocol,
> along with ability to directly map RAM  segments into fixed
> positions in the file. The benefits are many
> 
> - It will save & restore faster because we're eliminating data
>   copies that libvirt imposes via the iohelper
> 
> - It is simple for libvirt & mgmt apps as we still only
>   have one file to manage
> 
> - It is space efficient because if a guest dirties a
>   memory page, we just overwrite the existing contents
>   at the fixed location in the file, instead of appending
>   new contents to the file
> 
> - It will restore faster too because we only restore
>   each memory page once, due to always overwriting the
>   file in-place when the guest dirtied a page during save
> 
> - It can save and restore with differing number of threads,
>   and can even dynamically change the number of threads
>   in the middle of the save/restore operation 
> 
> As David G has pointed out the impl is not trivial on the QEMU
> side, but from what I understand of the migration code, it is
> certainly viable. Most importantly I think it puts us in a
> better position for long term feature enhancements later by
> taking the middle man (libvirt) out of the equation, letting
> QEMU directly know what medium it is saving/restoring to/from.
> 
> 
> With regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
> 
Re: [libvirt RFCv8 00/27] multifd save restore prototype
Posted by Dr. David Alan Gilbert 3 days, 17 hours ago
* Christophe Marie Francois Dupont de Dinechin (cdupontd@redhat.com) wrote:
> 
> 
> > On 10 May 2022, at 20:38, Daniel P. Berrangé <berrange@redhat.com> wrote:
> > 
> > On Sat, May 07, 2022 at 03:42:53PM +0200, Claudio Fontana wrote:
> >> This is v8 of the multifd save prototype, which fixes a few bugs,
> >> adds a few more code splits, and records the number of channels
> >> as well as the compression algorithm, so the restore command is
> >> more user-friendly.
> >> 
> >> It is now possible to just say:
> >> 
> >> virsh save mydomain /mnt/saves/mysave --parallel
> >> 
> >> virsh restore /mnt/saves/mysave --parallel
> >> 
> >> and things work with the default of 2 channels, no compression.
> >> 
> >> It is also possible to say of course:
> >> 
> >> virsh save mydomain /mnt/saves/mysave --parallel
> >>      --parallel-connections 16 --parallel-compression zstd
> >> 
> >> virsh restore /mnt/saves/mysave --parallel
> >> 
> >> and things also work fine, due to channels and compression
> >> being stored in the main save file.
> > 
> > For the sake of people following along, the above commands will
> > result in creation of multiple files
> > 
> >  /mnt/saves/mysave
> >  /mnt/saves/mysave.0
> >  /mnt/saves/mysave.1
> >  ....
> >  /mnt/saves/mysave.n
> > 
> > Where 'n' is the number of threads used.
> > 
> > Overall I'm not very happy with the approach of doing any of this
> > on the libvirt side.
> > 
> > Backing up, we know that QEMU can directly save to disk faster than
> > libvirt can. We mitigated alot of that overhead with previous patches
> > to increase the pipe buffer size, but some still remains due to the
> > extra copies inherant in handing this off to libvirt.
> > 
> > Using multifd on the libvirt side, IIUC, gets us better performance
> > than QEMU can manage if doing non-multifd write to file directly,
> > but we still have the extra copies in there due to the hand off
> > to libvirt. If QEMU were to be directly capable to writing to
> > disk with multifd, it should beat us again.
> > 
> > As a result of how we integrate with QEMU multifd, we're taking the
> > approach of saving the state across multiple files, because it is
> > easier than trying to get multiple threads writing to the same file.
> > It could be solved by using file range locking on the save file.
> > eg a thread can reserve say 500 MB of space, fill it up, and then
> > reserve another 500 MB, etc, etc. It is a bit tedious though and
> > won't align nicely. eg a 1 GB huge page, would be 1 GB + a few
> > bytes of QEMU RAM ave state header.
> 
> First, I do not understand why you would write things that are
> not page-aligned to start with? (As an aside, I don’t know
> how any dirty tracking would work if you do not keep things
> page-aligned).
> 
> Could uffd_register_memory accept a memory range that is
> not aligned? If so, when? Should that be specified in the
> interface?

This isn't about alignment in RAM, this is about alignment in the file.
We always write chunks of RAM that are aligned but where they end up in
the file is typically not aligned because the file is:
<header> <pages> <header> <pages>

especially in the non-multifd world.

> Second, instead of creating multiple files, why not write blocks
> at a location determined by an variable that you increment using
> atomic operations each time you need a new block? If you want to
> keep the blocks page-aligned in the file as well (which might help
> if you want to mmap the file at some point), then you need to
> build a map of the blocks that you tack at the end of the file.

If you're going to make it that complicated you may as well go back
to the old thing of writing into a qcow2.

Dave

> There may be good reasons not to do it that way, of course, but
> I am not familiar enough with the problem to know them.
> 
> > 
> > The other downside of multiple files is that it complicates life
> > for both libvirt and apps using libvirt. They need to be aware of
> > multiple files and move them around together. This is not a simple
> > as it might sound. For example, IIRC OpenStack would upload a save
> > image state into a glance bucket for later use. Well, now it needs
> > multiple distinct buckets and keep track of them all. It also means
> > we're forced to use the same concurrency level when restoring, which
> > is not neccessarily desirable if the host environment is different
> > when restoring. ie The original host might have had 8 CPUs, but the
> > new host might only have 4 available, or vica-verca.
> > 
> > 
> > I know it is appealing to do something on the libvirt side, because
> > it is quicker than getting an enhancement into new QEMU release. We
> > have been down this route before with the migration support in libvirt
> > in the past though, when we introduced the tunnelled live migration
> > in order to workaround QEMU's inability to do TLS encryption. I very
> > much regret that we ever did this, because tunnelled migration was
> > inherantly limited, so for example failed to work with multifd,
> > and failed to work with NBD based disk migration. In the end I did
> > what I should have done at the beginning and just added TLS support
> > to QEMU, making tunnelled migration obsolete, except we still have
> > to carry the code around in libvirt indefinitely due to apps using
> > it.
> > 
> > So I'm very concerned about not having history repeat itself and
> > give us a long term burden for  a solution that turns out to be a
> > evolutionary dead end.
> > 
> > I like the idea of parallel saving, but I really think we need to
> > implement this directly in QEMU, not libvirt. As previously
> > mentioned I think QEMU needs to get a 'file' migration protocol,
> > along with ability to directly map RAM  segments into fixed
> > positions in the file. The benefits are many
> > 
> > - It will save & restore faster because we're eliminating data
> >   copies that libvirt imposes via the iohelper
> > 
> > - It is simple for libvirt & mgmt apps as we still only
> >   have one file to manage
> > 
> > - It is space efficient because if a guest dirties a
> >   memory page, we just overwrite the existing contents
> >   at the fixed location in the file, instead of appending
> >   new contents to the file
> > 
> > - It will restore faster too because we only restore
> >   each memory page once, due to always overwriting the
> >   file in-place when the guest dirtied a page during save
> > 
> > - It can save and restore with differing number of threads,
> >   and can even dynamically change the number of threads
> >   in the middle of the save/restore operation 
> > 
> > As David G has pointed out the impl is not trivial on the QEMU
> > side, but from what I understand of the migration code, it is
> > certainly viable. Most importantly I think it puts us in a
> > better position for long term feature enhancements later by
> > taking the middle man (libvirt) out of the equation, letting
> > QEMU directly know what medium it is saving/restoring to/from.
> > 
> > 
> > With regards,
> > Daniel
> > -- 
> > |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> > |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> > |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
> > 
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Re: [libvirt RFCv8 00/27] multifd save restore prototype
Posted by Claudio Fontana 4 days, 23 hours ago
On 5/11/22 10:27 AM, Christophe Marie Francois Dupont de Dinechin wrote:
> 
> 
>> On 10 May 2022, at 20:38, Daniel P. Berrangé <berrange@redhat.com> wrote:
>>
>> On Sat, May 07, 2022 at 03:42:53PM +0200, Claudio Fontana wrote:
>>> This is v8 of the multifd save prototype, which fixes a few bugs,
>>> adds a few more code splits, and records the number of channels
>>> as well as the compression algorithm, so the restore command is
>>> more user-friendly.
>>>
>>> It is now possible to just say:
>>>
>>> virsh save mydomain /mnt/saves/mysave --parallel
>>>
>>> virsh restore /mnt/saves/mysave --parallel
>>>
>>> and things work with the default of 2 channels, no compression.
>>>
>>> It is also possible to say of course:
>>>
>>> virsh save mydomain /mnt/saves/mysave --parallel
>>>      --parallel-connections 16 --parallel-compression zstd
>>>
>>> virsh restore /mnt/saves/mysave --parallel
>>>
>>> and things also work fine, due to channels and compression
>>> being stored in the main save file.
>>
>> For the sake of people following along, the above commands will
>> result in creation of multiple files
>>
>>  /mnt/saves/mysave
>>  /mnt/saves/mysave.0
>>  /mnt/saves/mysave.1
>>  ....
>>  /mnt/saves/mysave.n
>>
>> Where 'n' is the number of threads used.
>>
>> Overall I'm not very happy with the approach of doing any of this
>> on the libvirt side.
>>
>> Backing up, we know that QEMU can directly save to disk faster than
>> libvirt can. We mitigated alot of that overhead with previous patches
>> to increase the pipe buffer size, but some still remains due to the
>> extra copies inherant in handing this off to libvirt.
>>
>> Using multifd on the libvirt side, IIUC, gets us better performance
>> than QEMU can manage if doing non-multifd write to file directly,
>> but we still have the extra copies in there due to the hand off
>> to libvirt. If QEMU were to be directly capable to writing to
>> disk with multifd, it should beat us again.
>>
>> As a result of how we integrate with QEMU multifd, we're taking the
>> approach of saving the state across multiple files, because it is
>> easier than trying to get multiple threads writing to the same file.
>> It could be solved by using file range locking on the save file.
>> eg a thread can reserve say 500 MB of space, fill it up, and then
>> reserve another 500 MB, etc, etc. It is a bit tedious though and
>> won't align nicely. eg a 1 GB huge page, would be 1 GB + a few
>> bytes of QEMU RAM ave state header.


I am not familiar enough to know if this approach would work with multifd without breaking
the existing format, maybe David could answer this.

> 
> First, I do not understand why you would write things that are
> not page-aligned to start with? (As an aside, I don’t know
> how any dirty tracking would work if you do not keep things
> page-aligned).

Yes, alignment is one issue I encountered, and that in my view would _still_ need to be solved,
and that is _whatever_ we put inside QEMU in the future,
as it breaks also any attempt to be more efficient (using alternative APIs to read/write etc),

and is the reason why iohelper is still needed in my patchset at all for the main file, causing one extra copy for the main channel.

The libvirt header, including metadata, domain xml etc, that wraps the QEMU VM ends at an arbitrary address, f.e:

00000000: 4c69 6276 6972 7451 656d 7564 5361 7665  LibvirtQemudSave
00000010: 0300 0000 5b13 0100 0100 0000 0000 0000  ....[...........
00000020: 3613 0000 0200 0000 0000 0000 0000 0000  6...............
00000030: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000040: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000050: 0000 0000 0000 0000 0000 0000 3c64 6f6d  ............<dom
00000060: 6169 6e20 7479 7065 3d27 6b76 6d27 3e0a  ain type='kvm'>.



000113a0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000113b0: 0000 0000 0000 0051 4556 4d00 0000 0307  .......QEVM.....
000113c0: 0000 000d 7063 2d69 3434 3066 782d 362e  ....pc-i440fx-6.
000113d0: 3201 0000 0003 0372 616d 0000 0000 0000  2......ram......
000113e0: 0004 0000 0008 c00c 2004 0670 632e 7261  ........ ..pc.ra
000113f0: 6d00 0000 08c0 0000 0014 2f72 6f6d 4065  m........./rom@e
00011400: 7463 2f61 6370 692f 7461 626c 6573 0000  tc/acpi/tables..
00011410: 0000 0002 0000 0770 632e 6269 6f73 0000  .......pc.bios..
00011420: 0000 0004 0000 1f30 3030 303a 3030 3a30  .......0000:00:0
00011430: 322e 302f 7669 7274 696f 2d6e 6574 2d70  2.0/virtio-net-p
00011440: 6369 2e72 6f6d 0000 0000 0004 0000 0670  ci.rom.........p
00011450: 632e 726f 6d00 0000 0000 0200 0015 2f72  c.rom........./r
00011460: 6f6d 4065 7463 2f74 6162 6c65 2d6c 6f61  om@etc/table-loa
00011470: 6465 7200 0000 0000 0010 0012 2f72 6f6d  der........./rom
00011480: 4065 7463 2f61 6370 692f 7273 6470 0000  @etc/acpi/rsdp..
00011490: 0000 0000 1000 0000 0000 0000 0010 7e00  ..............~.
000114a0: 0000 0302 0000 0003 0000 0000 0000 2002  .............. .
000114b0: 0670 632e 7261 6d00 0000 0000 0000 3022  .pc.ram.......0"


in my view at the minimum we have to start by adding enough padding before starting the QEMU VM (QEVM magic)
to be at a page-aligned address.

I would add one patch to this effect to my prototype, as this should not be very controversial I think.

Regarding migrating the channels to a single file, with the suggestion of Daniel or some other method,
the obvious comment from me is if we have some way to know in advance the size of each channel that would be feasible,
but especially considering compression that seems pretty hard to know beforehand, so some trick is needed.


> 
> Could uffd_register_memory accept a memory range that is
> not aligned? If so, when? Should that be specified in the
> interface?
> 
> Second, instead of creating multiple files, why not write blocks
> at a location determined by an variable that you increment using
> atomic operations each time you need a new block? If you want to
> keep the blocks page-aligned in the file as well (which might help
> if you want to mmap the file at some point), then you need to
> build a map of the blocks that you tack at the end of the file.


Just wanted to throw the simplest idea in the basket,
where we could interleave the file with each channel writing, for example, 4 MB at a time at
a channel-specific offset, and again, starting at a nicely aligned address.

4MB being just an example here, it would need to be determined by the best balance considering nvme architecture and performance,
and could even be a parameter. Storage gurus could advise on this part.

The biggest issue there would be that the main channel does not have the same requirements like the other channels,
so likely we would waste space reserving for the main channel, as the "main" channel is much, much smaller than the others.

(Dave, could the size of the main channel be determined before the transfer roughly? Based on guest ram size?)

So the layout for a --parallel --parallel-connections 2 --parallel-interleave 4MB save _could_ be something like:

libvirt header 0 to 4MB-aligned address (most likely 0 to 4MB)
main channel ( 4MB to 20MB)
channel 0    (20MB to 24MB)
channel 1    (24MB to 28MB)
channel 0    (28MB to 32MB)
channel 1    (32MB to 36MB)
...

for example. The multifd helper could do this and feed the channels properly during save and restore...

Thanks,

CLaudio

> 
> There may be good reasons not to do it that way, of course, but
> I am not familiar enough with the problem to know them.
> 
>>
>> The other downside of multiple files is that it complicates life
>> for both libvirt and apps using libvirt. They need to be aware of
>> multiple files and move them around together. This is not a simple
>> as it might sound. For example, IIRC OpenStack would upload a save
>> image state into a glance bucket for later use. Well, now it needs
>> multiple distinct buckets and keep track of them all. It also means
>> we're forced to use the same concurrency level when restoring, which
>> is not neccessarily desirable if the host environment is different
>> when restoring. ie The original host might have had 8 CPUs, but the
>> new host might only have 4 available, or vica-verca.
>>
>>
>> I know it is appealing to do something on the libvirt side, because
>> it is quicker than getting an enhancement into new QEMU release. We
>> have been down this route before with the migration support in libvirt
>> in the past though, when we introduced the tunnelled live migration
>> in order to workaround QEMU's inability to do TLS encryption. I very
>> much regret that we ever did this, because tunnelled migration was
>> inherantly limited, so for example failed to work with multifd,
>> and failed to work with NBD based disk migration. In the end I did
>> what I should have done at the beginning and just added TLS support
>> to QEMU, making tunnelled migration obsolete, except we still have
>> to carry the code around in libvirt indefinitely due to apps using
>> it.
>>
>> So I'm very concerned about not having history repeat itself and
>> give us a long term burden for  a solution that turns out to be a
>> evolutionary dead end.
>>
>> I like the idea of parallel saving, but I really think we need to
>> implement this directly in QEMU, not libvirt. As previously
>> mentioned I think QEMU needs to get a 'file' migration protocol,
>> along with ability to directly map RAM  segments into fixed
>> positions in the file. The benefits are many
>>
>> - It will save & restore faster because we're eliminating data
>>   copies that libvirt imposes via the iohelper
>>
>> - It is simple for libvirt & mgmt apps as we still only
>>   have one file to manage
>>
>> - It is space efficient because if a guest dirties a
>>   memory page, we just overwrite the existing contents
>>   at the fixed location in the file, instead of appending
>>   new contents to the file
>>
>> - It will restore faster too because we only restore
>>   each memory page once, due to always overwriting the
>>   file in-place when the guest dirtied a page during save
>>
>> - It can save and restore with differing number of threads,
>>   and can even dynamically change the number of threads
>>   in the middle of the save/restore operation 
>>
>> As David G has pointed out the impl is not trivial on the QEMU
>> side, but from what I understand of the migration code, it is
>> certainly viable. Most importantly I think it puts us in a
>> better position for long term feature enhancements later by
>> taking the middle man (libvirt) out of the equation, letting
>> QEMU directly know what medium it is saving/restoring to/from.
>>
>>
>> With regards,
>> Daniel
>> -- 
>> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
>> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
>> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
>>
> 
Re: [libvirt RFCv8 00/27] multifd save restore prototype
Posted by Daniel P. Berrangé 4 days, 22 hours ago
On Wed, May 11, 2022 at 01:47:13PM +0200, Claudio Fontana wrote:
> On 5/11/22 10:27 AM, Christophe Marie Francois Dupont de Dinechin wrote:
> > 
> > 
> >> On 10 May 2022, at 20:38, Daniel P. Berrangé <berrange@redhat.com> wrote:
> >>
> >> On Sat, May 07, 2022 at 03:42:53PM +0200, Claudio Fontana wrote:
> >>> This is v8 of the multifd save prototype, which fixes a few bugs,
> >>> adds a few more code splits, and records the number of channels
> >>> as well as the compression algorithm, so the restore command is
> >>> more user-friendly.
> >>>
> >>> It is now possible to just say:
> >>>
> >>> virsh save mydomain /mnt/saves/mysave --parallel
> >>>
> >>> virsh restore /mnt/saves/mysave --parallel
> >>>
> >>> and things work with the default of 2 channels, no compression.
> >>>
> >>> It is also possible to say of course:
> >>>
> >>> virsh save mydomain /mnt/saves/mysave --parallel
> >>>      --parallel-connections 16 --parallel-compression zstd
> >>>
> >>> virsh restore /mnt/saves/mysave --parallel
> >>>
> >>> and things also work fine, due to channels and compression
> >>> being stored in the main save file.
> >>
> >> For the sake of people following along, the above commands will
> >> result in creation of multiple files
> >>
> >>  /mnt/saves/mysave
> >>  /mnt/saves/mysave.0
> >>  /mnt/saves/mysave.1
> >>  ....
> >>  /mnt/saves/mysave.n
> >>
> >> Where 'n' is the number of threads used.
> >>
> >> Overall I'm not very happy with the approach of doing any of this
> >> on the libvirt side.
> >>
> >> Backing up, we know that QEMU can directly save to disk faster than
> >> libvirt can. We mitigated alot of that overhead with previous patches
> >> to increase the pipe buffer size, but some still remains due to the
> >> extra copies inherant in handing this off to libvirt.
> >>
> >> Using multifd on the libvirt side, IIUC, gets us better performance
> >> than QEMU can manage if doing non-multifd write to file directly,
> >> but we still have the extra copies in there due to the hand off
> >> to libvirt. If QEMU were to be directly capable to writing to
> >> disk with multifd, it should beat us again.
> >>
> >> As a result of how we integrate with QEMU multifd, we're taking the
> >> approach of saving the state across multiple files, because it is
> >> easier than trying to get multiple threads writing to the same file.
> >> It could be solved by using file range locking on the save file.
> >> eg a thread can reserve say 500 MB of space, fill it up, and then
> >> reserve another 500 MB, etc, etc. It is a bit tedious though and
> >> won't align nicely. eg a 1 GB huge page, would be 1 GB + a few
> >> bytes of QEMU RAM ave state header.
> 
> 
> I am not familiar enough to know if this approach would work with multifd without breaking
> the existing format, maybe David could answer this.
> 
> > 
> > First, I do not understand why you would write things that are
> > not page-aligned to start with? (As an aside, I don’t know
> > how any dirty tracking would work if you do not keep things
> > page-aligned).
> 
> Yes, alignment is one issue I encountered, and that in my view would _still_ need to be solved,
> and that is _whatever_ we put inside QEMU in the future,
> as it breaks also any attempt to be more efficient (using alternative APIs to read/write etc),
> 
> and is the reason why iohelper is still needed in my patchset at all for the main file, causing one extra copy for the main channel.
> 
> The libvirt header, including metadata, domain xml etc, that wraps the QEMU VM ends at an arbitrary address, f.e:
> 
> 00000000: 4c69 6276 6972 7451 656d 7564 5361 7665  LibvirtQemudSave
> 00000010: 0300 0000 5b13 0100 0100 0000 0000 0000  ....[...........
> 00000020: 3613 0000 0200 0000 0000 0000 0000 0000  6...............
> 00000030: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00000040: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 00000050: 0000 0000 0000 0000 0000 0000 3c64 6f6d  ............<dom
> 00000060: 6169 6e20 7479 7065 3d27 6b76 6d27 3e0a  ain type='kvm'>.
> 
> 
> 
> 000113a0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
> 000113b0: 0000 0000 0000 0051 4556 4d00 0000 0307  .......QEVM.....
> 000113c0: 0000 000d 7063 2d69 3434 3066 782d 362e  ....pc-i440fx-6.
> 000113d0: 3201 0000 0003 0372 616d 0000 0000 0000  2......ram......
> 000113e0: 0004 0000 0008 c00c 2004 0670 632e 7261  ........ ..pc.ra
> 000113f0: 6d00 0000 08c0 0000 0014 2f72 6f6d 4065  m........./rom@e
> 00011400: 7463 2f61 6370 692f 7461 626c 6573 0000  tc/acpi/tables..
> 00011410: 0000 0002 0000 0770 632e 6269 6f73 0000  .......pc.bios..
> 00011420: 0000 0004 0000 1f30 3030 303a 3030 3a30  .......0000:00:0
> 00011430: 322e 302f 7669 7274 696f 2d6e 6574 2d70  2.0/virtio-net-p
> 00011440: 6369 2e72 6f6d 0000 0000 0004 0000 0670  ci.rom.........p
> 00011450: 632e 726f 6d00 0000 0000 0200 0015 2f72  c.rom........./r
> 00011460: 6f6d 4065 7463 2f74 6162 6c65 2d6c 6f61  om@etc/table-loa
> 00011470: 6465 7200 0000 0000 0010 0012 2f72 6f6d  der........./rom
> 00011480: 4065 7463 2f61 6370 692f 7273 6470 0000  @etc/acpi/rsdp..
> 00011490: 0000 0000 1000 0000 0000 0000 0010 7e00  ..............~.
> 000114a0: 0000 0302 0000 0003 0000 0000 0000 2002  .............. .
> 000114b0: 0670 632e 7261 6d00 0000 0000 0000 3022  .pc.ram.......0"
> 
> 
> in my view at the minimum we have to start by adding enough padding before starting the QEMU VM (QEVM magic)
> to be at a page-aligned address.
> 
> I would add one patch to this effect to my prototype, as this should not be very controversial I think.

We already add padding before the QEMU migration stream begins, but 
we're just doing a fixed 64kb. The intent was to allow us to edit
the embedded XML. It could easily round this upto to a sensible
boundary if needed.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [libvirt RFCv8 00/27] multifd save restore prototype
Posted by Daniel P. Berrangé 5 days, 1 hour ago
On Wed, May 11, 2022 at 10:27:30AM +0200, Christophe Marie Francois Dupont de Dinechin wrote:
> 
> 
> > On 10 May 2022, at 20:38, Daniel P. Berrangé <berrange@redhat.com> wrote:
> > 
> > On Sat, May 07, 2022 at 03:42:53PM +0200, Claudio Fontana wrote:
> >> This is v8 of the multifd save prototype, which fixes a few bugs,
> >> adds a few more code splits, and records the number of channels
> >> as well as the compression algorithm, so the restore command is
> >> more user-friendly.
> >> 
> >> It is now possible to just say:
> >> 
> >> virsh save mydomain /mnt/saves/mysave --parallel
> >> 
> >> virsh restore /mnt/saves/mysave --parallel
> >> 
> >> and things work with the default of 2 channels, no compression.
> >> 
> >> It is also possible to say of course:
> >> 
> >> virsh save mydomain /mnt/saves/mysave --parallel
> >>      --parallel-connections 16 --parallel-compression zstd
> >> 
> >> virsh restore /mnt/saves/mysave --parallel
> >> 
> >> and things also work fine, due to channels and compression
> >> being stored in the main save file.
> > 
> > For the sake of people following along, the above commands will
> > result in creation of multiple files
> > 
> >  /mnt/saves/mysave
> >  /mnt/saves/mysave.0
> >  /mnt/saves/mysave.1
> >  ....
> >  /mnt/saves/mysave.n
> > 
> > Where 'n' is the number of threads used.
> > 
> > Overall I'm not very happy with the approach of doing any of this
> > on the libvirt side.
> > 
> > Backing up, we know that QEMU can directly save to disk faster than
> > libvirt can. We mitigated alot of that overhead with previous patches
> > to increase the pipe buffer size, but some still remains due to the
> > extra copies inherant in handing this off to libvirt.
> > 
> > Using multifd on the libvirt side, IIUC, gets us better performance
> > than QEMU can manage if doing non-multifd write to file directly,
> > but we still have the extra copies in there due to the hand off
> > to libvirt. If QEMU were to be directly capable to writing to
> > disk with multifd, it should beat us again.
> > 
> > As a result of how we integrate with QEMU multifd, we're taking the
> > approach of saving the state across multiple files, because it is
> > easier than trying to get multiple threads writing to the same file.
> > It could be solved by using file range locking on the save file.
> > eg a thread can reserve say 500 MB of space, fill it up, and then
> > reserve another 500 MB, etc, etc. It is a bit tedious though and
> > won't align nicely. eg a 1 GB huge page, would be 1 GB + a few
> > bytes of QEMU RAM ave state header.
> 
> First, I do not understand why you would write things that are
> not page-aligned to start with? (As an aside, I don’t know
> how any dirty tracking would work if you do not keep things
> page-aligned).
> 
> Could uffd_register_memory accept a memory range that is
> not aligned? If so, when? Should that be specified in the
> interface?
> 
> Second, instead of creating multiple files, why not write blocks
> at a location determined by an variable that you increment using
> atomic operations each time you need a new block? If you want to
> keep the blocks page-aligned in the file as well (which might help
> if you want to mmap the file at some point), then you need to
> build a map of the blocks that you tack at the end of the file.
> 
> There may be good reasons not to do it that way, of course, but
> I am not familiar enough with the problem to know them.

This is all because QEMU is not actually writing to a file. From
QEMU's POV it just thinks it is migrating to another QEMU via
a pipe. So the questions of page alignment, position are irrelevant
to QEMU's needs - it just has a stream.

Libvirt is just capturing this raw migration stream and writing
its contents out to a file. The contents of the stream are completely
opaque to libvirt, and we don't want be be unpacking this stream to
do anything more clever. It is better to invest it making QEMU
know that it is writing to a file directly.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [libvirt RFCv8 00/27] multifd save restore prototype
Posted by Claudio Fontana 5 days, 3 hours ago
Hi Daniel,

thanks for looking at this,

On 5/10/22 8:38 PM, Daniel P. Berrangé wrote:
> On Sat, May 07, 2022 at 03:42:53PM +0200, Claudio Fontana wrote:
>> This is v8 of the multifd save prototype, which fixes a few bugs,
>> adds a few more code splits, and records the number of channels
>> as well as the compression algorithm, so the restore command is
>> more user-friendly.
>>
>> It is now possible to just say:
>>
>> virsh save mydomain /mnt/saves/mysave --parallel
>>
>> virsh restore /mnt/saves/mysave --parallel
>>
>> and things work with the default of 2 channels, no compression.
>>
>> It is also possible to say of course:
>>
>> virsh save mydomain /mnt/saves/mysave --parallel
>>       --parallel-connections 16 --parallel-compression zstd
>>
>> virsh restore /mnt/saves/mysave --parallel
>>
>> and things also work fine, due to channels and compression
>> being stored in the main save file.
> 
> For the sake of people following along, the above commands will
> result in creation of multiple files
> 
>   /mnt/saves/mysave
>   /mnt/saves/mysave.0

just minor correction, there is no .0

>   /mnt/saves/mysave.1
>   ....
>   /mnt/saves/mysave.n
> 
> Where 'n' is the number of threads used.
> 
> Overall I'm not very happy with the approach of doing any of this
> on the libvirt side.


Ok I understand your concern.

> 
> Backing up, we know that QEMU can directly save to disk faster than
> libvirt can. We mitigated alot of that overhead with previous patches
> to increase the pipe buffer size, but some still remains due to the
> extra copies inherant in handing this off to libvirt.

Right;
still the performance we get is insufficient for the use case we are trying to address,
even without libvirt in the picture.

Instead, with parallel save + compression we can make the numbers add up.
For parallel save using multifd, the overhead of libvirt is negligible.

> 
> Using multifd on the libvirt side, IIUC, gets us better performance
> than QEMU can manage if doing non-multifd write to file directly,
> but we still have the extra copies in there due to the hand off
> to libvirt. If QEMU were to be directly capable to writing to
> disk with multifd, it should beat us again.

Hmm I am thinking about this point, and at first glance I don't think this is 100% accurate;

if we do parallel save like in this series with multifd,
the overhead of libvirt is almost non-existent in my view compared with doing it with qemu only, skipping libvirt,
it is limited to the one iohelper for the main channel (which is the smallest of the transfers),
and maybe this could be removed as well.

This is because even without libvirt in the picture, we are still migrating to a socket, and something needs to
transfer data from that socket to a file. At that point I think both libvirt and a custom made script are in the same position.

> 
> As a result of how we integrate with QEMU multifd, we're taking the
> approach of saving the state across multiple files, because it is
> easier than trying to get multiple threads writing to the same file.
> It could be solved by using file range locking on the save file.
> eg a thread can reserve say 500 MB of space, fill it up, and then
> reserve another 500 MB, etc, etc. It is a bit tedious though and
> won't align nicely. eg a 1 GB huge page, would be 1 GB + a few
> bytes of QEMU RAM ave state header.
> 
> The other downside of multiple files is that it complicates life
> for both libvirt and apps using libvirt. They need to be aware of
> multiple files and move them around together. This is not a simple
> as it might sound. For example, IIRC OpenStack would upload a save
> image state into a glance bucket for later use. Well, now it needs
> multiple distinct buckets and keep track of them all. It also means
> we're forced to use the same concurrency level when restoring, which
> is not neccessarily desirable if the host environment is different
> when restoring. ie The original host might have had 8 CPUs, but the
> new host might only have 4 available, or vica-verca.
> 
> 
> I know it is appealing to do something on the libvirt side, because
> it is quicker than getting an enhancement into new QEMU release. We
> have been down this route before with the migration support in libvirt
> in the past though, when we introduced the tunnelled live migration
> in order to workaround QEMU's inability to do TLS encryption. I very
> much regret that we ever did this, because tunnelled migration was
> inherantly limited, so for example failed to work with multifd,
> and failed to work with NBD based disk migration. In the end I did
> what I should have done at the beginning and just added TLS support
> to QEMU, making tunnelled migration obsolete, except we still have
> to carry the code around in libvirt indefinitely due to apps using
> it.
> 
> So I'm very concerned about not having history repeat itself and
> give us a long term burden for  a solution that turns out to be a
> evolutionary dead end.
> 
> I like the idea of parallel saving, but I really think we need to
> implement this directly in QEMU, not libvirt. As previously
> mentioned I think QEMU needs to get a 'file' migration protocol,
> along with ability to directly map RAM  segments into fixed
> positions in the file. The benefits are many
> 
>  - It will save & restore faster because we're eliminating data
>    copies that libvirt imposes via the iohelper
>  
>  - It is simple for libvirt & mgmt apps as we still only
>    have one file to manage
> 
>  - It is space efficient because if a guest dirties a
>    memory page, we just overwrite the existing contents
>    at the fixed location in the file, instead of appending
>    new contents to the file
> 
>  - It will restore faster too because we only restore
>    each memory page once, due to always overwriting the
>    file in-place when the guest dirtied a page during save
> 
>  - It can save and restore with differing number of threads,
>    and can even dynamically change the number of threads
>    in the middle of the save/restore operation 
> 
> As David G has pointed out the impl is not trivial on the QEMU
> side, but from what I understand of the migration code, it is
> certainly viable. Most importantly I think it puts us in a
> better position for long term feature enhancements later by
> taking the middle man (libvirt) out of the equation, letting
> QEMU directly know what medium it is saving/restoring to/from.
> 
> 
> With regards,
> Daniel
> 

It's probably possible to do this in QEMU, with extensive changes, in my view possibly to the migration stream itself,
to have a more block-friendly, parallel migration stream to a file.

Thanks,

Claudio
Re: [libvirt RFCv8 00/27] multifd save restore prototype
Posted by Daniel P. Berrangé 5 days, 1 hour ago
On Wed, May 11, 2022 at 09:26:10AM +0200, Claudio Fontana wrote:
> Hi Daniel,
> 
> thanks for looking at this,
> 
> On 5/10/22 8:38 PM, Daniel P. Berrangé wrote:
> > On Sat, May 07, 2022 at 03:42:53PM +0200, Claudio Fontana wrote:
> >> This is v8 of the multifd save prototype, which fixes a few bugs,
> >> adds a few more code splits, and records the number of channels
> >> as well as the compression algorithm, so the restore command is
> >> more user-friendly.
> >>
> >> It is now possible to just say:
> >>
> >> virsh save mydomain /mnt/saves/mysave --parallel
> >>
> >> virsh restore /mnt/saves/mysave --parallel
> >>
> >> and things work with the default of 2 channels, no compression.
> >>
> >> It is also possible to say of course:
> >>
> >> virsh save mydomain /mnt/saves/mysave --parallel
> >>       --parallel-connections 16 --parallel-compression zstd
> >>
> >> virsh restore /mnt/saves/mysave --parallel
> >>
> >> and things also work fine, due to channels and compression
> >> being stored in the main save file.
> > 
> > For the sake of people following along, the above commands will
> > result in creation of multiple files
> > 
> >   /mnt/saves/mysave
> >   /mnt/saves/mysave.0
> 
> just minor correction, there is no .0

Heh, off-by-1

> 
> >   /mnt/saves/mysave.1
> >   ....
> >   /mnt/saves/mysave.n
> > 
> > Where 'n' is the number of threads used.
> > 
> > Overall I'm not very happy with the approach of doing any of this
> > on the libvirt side.
> 
> 
> Ok I understand your concern.
> 
> > 
> > Backing up, we know that QEMU can directly save to disk faster than
> > libvirt can. We mitigated alot of that overhead with previous patches
> > to increase the pipe buffer size, but some still remains due to the
> > extra copies inherant in handing this off to libvirt.
> 
> Right;
> still the performance we get is insufficient for the use case we are trying to address,
> even without libvirt in the picture.
> 
> Instead, with parallel save + compression we can make the numbers add up.
> For parallel save using multifd, the overhead of libvirt is negligible.
> 
> > 
> > Using multifd on the libvirt side, IIUC, gets us better performance
> > than QEMU can manage if doing non-multifd write to file directly,
> > but we still have the extra copies in there due to the hand off
> > to libvirt. If QEMU were to be directly capable to writing to
> > disk with multifd, it should beat us again.
> 
> Hmm I am thinking about this point, and at first glance I don't
> think this is 100% accurate;
> 
> if we do parallel save like in this series with multifd,
> the overhead of libvirt is almost non-existent in my view
> compared with doing it with qemu only, skipping libvirt,
> it is limited to the one iohelper for the main channel
> (which is the smallest of the transfers),
> and maybe this could be removed as well.

Libvirt adds overhead due to the multiple data copies in
the save process. Using multifd doesn't get rid of this
overhead, it merely distributes the overhead across many
CPUs. The overall wallclock time is reduced but in aggregate
the CPUs still have the same amount of total work todo
copying data around.

I don't recall the scale of the libvirt overhead that remains
after the pipe buffer optimizations, but whatever is less is
still taking up host CPU time that can be used for other guests.

It also just ocurred to me that currently our save/restore
approach is bypassing all resource limits applied to the
guest. eg block I/O rate limits, CPU affinity controls,
etc, because most of the work is done in the iohelper.
If we had this done in QEMU, then the save/restore process
is confined by the existing CPU affinity / I/o limits
applied to the guest. This mean we would not negatively
impact other co-hosted guests to the same extent.

> This is because even without libvirt in the picture, we
> are still migrating to a socket, and something needs to
> transfer data from that socket to a file. At that point
> I think both libvirt and a custom made script are in the
> same position.

If QEMU had explicit support for a "file" backend, there
would be no socket involved at all. QEMU would be copying
guest RAM directly to a file with no intermediate steps.
If QEMU mmap'd the save state file, then saving of the
guest RAM could even possibly reduce to a mere 'memcpy()'

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [libvirt RFCv8 00/27] multifd save restore prototype
Posted by Claudio Fontana 4 days, 23 hours ago
On 5/11/22 11:51 AM, Daniel P. Berrangé wrote:
> On Wed, May 11, 2022 at 09:26:10AM +0200, Claudio Fontana wrote:
>> Hi Daniel,
>>
>> thanks for looking at this,
>>
>> On 5/10/22 8:38 PM, Daniel P. Berrangé wrote:
>>> On Sat, May 07, 2022 at 03:42:53PM +0200, Claudio Fontana wrote:
>>>> This is v8 of the multifd save prototype, which fixes a few bugs,
>>>> adds a few more code splits, and records the number of channels
>>>> as well as the compression algorithm, so the restore command is
>>>> more user-friendly.
>>>>
>>>> It is now possible to just say:
>>>>
>>>> virsh save mydomain /mnt/saves/mysave --parallel
>>>>
>>>> virsh restore /mnt/saves/mysave --parallel
>>>>
>>>> and things work with the default of 2 channels, no compression.
>>>>
>>>> It is also possible to say of course:
>>>>
>>>> virsh save mydomain /mnt/saves/mysave --parallel
>>>>       --parallel-connections 16 --parallel-compression zstd
>>>>
>>>> virsh restore /mnt/saves/mysave --parallel
>>>>
>>>> and things also work fine, due to channels and compression
>>>> being stored in the main save file.
>>>
>>> For the sake of people following along, the above commands will
>>> result in creation of multiple files
>>>
>>>   /mnt/saves/mysave
>>>   /mnt/saves/mysave.0
>>
>> just minor correction, there is no .0
> 
> Heh, off-by-1
> 
>>
>>>   /mnt/saves/mysave.1
>>>   ....
>>>   /mnt/saves/mysave.n
>>>
>>> Where 'n' is the number of threads used.
>>>
>>> Overall I'm not very happy with the approach of doing any of this
>>> on the libvirt side.
>>
>>
>> Ok I understand your concern.
>>
>>>
>>> Backing up, we know that QEMU can directly save to disk faster than
>>> libvirt can. We mitigated alot of that overhead with previous patches
>>> to increase the pipe buffer size, but some still remains due to the
>>> extra copies inherant in handing this off to libvirt.
>>
>> Right;
>> still the performance we get is insufficient for the use case we are trying to address,
>> even without libvirt in the picture.
>>
>> Instead, with parallel save + compression we can make the numbers add up.
>> For parallel save using multifd, the overhead of libvirt is negligible.
>>
>>>
>>> Using multifd on the libvirt side, IIUC, gets us better performance
>>> than QEMU can manage if doing non-multifd write to file directly,
>>> but we still have the extra copies in there due to the hand off
>>> to libvirt. If QEMU were to be directly capable to writing to
>>> disk with multifd, it should beat us again.
>>
>> Hmm I am thinking about this point, and at first glance I don't
>> think this is 100% accurate;
>>
>> if we do parallel save like in this series with multifd,
>> the overhead of libvirt is almost non-existent in my view
>> compared with doing it with qemu only, skipping libvirt,
>> it is limited to the one iohelper for the main channel
>> (which is the smallest of the transfers),
>> and maybe this could be removed as well.
> 
> Libvirt adds overhead due to the multiple data copies in
> the save process. Using multifd doesn't get rid of this
> overhead, it merely distributes the overhead across many
> CPUs. The overall wallclock time is reduced but in aggregate
> the CPUs still have the same amount of total work todo
> copying data around.
> 
> I don't recall the scale of the libvirt overhead that remains
> after the pipe buffer optimizations, but whatever is less is
> still taking up host CPU time that can be used for other guests.
> 
> It also just ocurred to me that currently our save/restore
> approach is bypassing all resource limits applied to the
> guest. eg block I/O rate limits, CPU affinity controls,
> etc, because most of the work is done in the iohelper.
> If we had this done in QEMU, then the save/restore process
> is confined by the existing CPU affinity / I/o limits
> applied to the guest. This mean we would not negatively
> impact other co-hosted guests to the same extent.
> 
>> This is because even without libvirt in the picture, we
>> are still migrating to a socket, and something needs to
>> transfer data from that socket to a file. At that point
>> I think both libvirt and a custom made script are in the
>> same position.
> 
> If QEMU had explicit support for a "file" backend, there
> would be no socket involved at all. QEMU would be copying
> guest RAM directly to a file with no intermediate steps.
> If QEMU mmap'd the save state file, then saving of the
> guest RAM could even possibly reduce to a mere 'memcpy()'

Agree, but still, to align with your requirement to have only one file,
libvirt would need to add some padding after the libvirt header and before the QEMU VM starts in the file,
so that the QEMU VM starts at a block-friendly address.

Thanks,

Claudio
Re: [libvirt RFCv8 00/27] multifd save restore prototype
Posted by Daniel P. Berrangé 4 days, 22 hours ago
On Wed, May 11, 2022 at 01:52:05PM +0200, Claudio Fontana wrote:
> On 5/11/22 11:51 AM, Daniel P. Berrangé wrote:
> > On Wed, May 11, 2022 at 09:26:10AM +0200, Claudio Fontana wrote:
> >> Hi Daniel,
> >>
> >> thanks for looking at this,
> >>
> >> On 5/10/22 8:38 PM, Daniel P. Berrangé wrote:
> >>> On Sat, May 07, 2022 at 03:42:53PM +0200, Claudio Fontana wrote:
> >>>> This is v8 of the multifd save prototype, which fixes a few bugs,
> >>>> adds a few more code splits, and records the number of channels
> >>>> as well as the compression algorithm, so the restore command is
> >>>> more user-friendly.
> >>>>
> >>>> It is now possible to just say:
> >>>>
> >>>> virsh save mydomain /mnt/saves/mysave --parallel
> >>>>
> >>>> virsh restore /mnt/saves/mysave --parallel
> >>>>
> >>>> and things work with the default of 2 channels, no compression.
> >>>>
> >>>> It is also possible to say of course:
> >>>>
> >>>> virsh save mydomain /mnt/saves/mysave --parallel
> >>>>       --parallel-connections 16 --parallel-compression zstd
> >>>>
> >>>> virsh restore /mnt/saves/mysave --parallel
> >>>>
> >>>> and things also work fine, due to channels and compression
> >>>> being stored in the main save file.
> >>>
> >>> For the sake of people following along, the above commands will
> >>> result in creation of multiple files
> >>>
> >>>   /mnt/saves/mysave
> >>>   /mnt/saves/mysave.0
> >>
> >> just minor correction, there is no .0
> > 
> > Heh, off-by-1
> > 
> >>
> >>>   /mnt/saves/mysave.1
> >>>   ....
> >>>   /mnt/saves/mysave.n
> >>>
> >>> Where 'n' is the number of threads used.
> >>>
> >>> Overall I'm not very happy with the approach of doing any of this
> >>> on the libvirt side.
> >>
> >>
> >> Ok I understand your concern.
> >>
> >>>
> >>> Backing up, we know that QEMU can directly save to disk faster than
> >>> libvirt can. We mitigated alot of that overhead with previous patches
> >>> to increase the pipe buffer size, but some still remains due to the
> >>> extra copies inherant in handing this off to libvirt.
> >>
> >> Right;
> >> still the performance we get is insufficient for the use case we are trying to address,
> >> even without libvirt in the picture.
> >>
> >> Instead, with parallel save + compression we can make the numbers add up.
> >> For parallel save using multifd, the overhead of libvirt is negligible.
> >>
> >>>
> >>> Using multifd on the libvirt side, IIUC, gets us better performance
> >>> than QEMU can manage if doing non-multifd write to file directly,
> >>> but we still have the extra copies in there due to the hand off
> >>> to libvirt. If QEMU were to be directly capable to writing to
> >>> disk with multifd, it should beat us again.
> >>
> >> Hmm I am thinking about this point, and at first glance I don't
> >> think this is 100% accurate;
> >>
> >> if we do parallel save like in this series with multifd,
> >> the overhead of libvirt is almost non-existent in my view
> >> compared with doing it with qemu only, skipping libvirt,
> >> it is limited to the one iohelper for the main channel
> >> (which is the smallest of the transfers),
> >> and maybe this could be removed as well.
> > 
> > Libvirt adds overhead due to the multiple data copies in
> > the save process. Using multifd doesn't get rid of this
> > overhead, it merely distributes the overhead across many
> > CPUs. The overall wallclock time is reduced but in aggregate
> > the CPUs still have the same amount of total work todo
> > copying data around.
> > 
> > I don't recall the scale of the libvirt overhead that remains
> > after the pipe buffer optimizations, but whatever is less is
> > still taking up host CPU time that can be used for other guests.
> > 
> > It also just ocurred to me that currently our save/restore
> > approach is bypassing all resource limits applied to the
> > guest. eg block I/O rate limits, CPU affinity controls,
> > etc, because most of the work is done in the iohelper.
> > If we had this done in QEMU, then the save/restore process
> > is confined by the existing CPU affinity / I/o limits
> > applied to the guest. This mean we would not negatively
> > impact other co-hosted guests to the same extent.
> > 
> >> This is because even without libvirt in the picture, we
> >> are still migrating to a socket, and something needs to
> >> transfer data from that socket to a file. At that point
> >> I think both libvirt and a custom made script are in the
> >> same position.
> > 
> > If QEMU had explicit support for a "file" backend, there
> > would be no socket involved at all. QEMU would be copying
> > guest RAM directly to a file with no intermediate steps.
> > If QEMU mmap'd the save state file, then saving of the
> > guest RAM could even possibly reduce to a mere 'memcpy()'
> 
> Agree, but still, to align with your requirement to have only one file,
> libvirt would need to add some padding after the libvirt header and before the QEMU VM starts in the file,
> so that the QEMU VM starts at a block-friendly address.

That's trivial, as we already add padding in this place.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [libvirt RFCv8 00/27] multifd save restore prototype
Posted by Claudio Fontana 4 days, 17 hours ago
On 5/11/22 2:02 PM, Daniel P. Berrangé wrote:
> On Wed, May 11, 2022 at 01:52:05PM +0200, Claudio Fontana wrote:
>> On 5/11/22 11:51 AM, Daniel P. Berrangé wrote:
>>> On Wed, May 11, 2022 at 09:26:10AM +0200, Claudio Fontana wrote:
>>>> Hi Daniel,
>>>>
>>>> thanks for looking at this,
>>>>
>>>> On 5/10/22 8:38 PM, Daniel P. Berrangé wrote:
>>>>> On Sat, May 07, 2022 at 03:42:53PM +0200, Claudio Fontana wrote:
>>>>>> This is v8 of the multifd save prototype, which fixes a few bugs,
>>>>>> adds a few more code splits, and records the number of channels
>>>>>> as well as the compression algorithm, so the restore command is
>>>>>> more user-friendly.
>>>>>>
>>>>>> It is now possible to just say:
>>>>>>
>>>>>> virsh save mydomain /mnt/saves/mysave --parallel
>>>>>>
>>>>>> virsh restore /mnt/saves/mysave --parallel
>>>>>>
>>>>>> and things work with the default of 2 channels, no compression.
>>>>>>
>>>>>> It is also possible to say of course:
>>>>>>
>>>>>> virsh save mydomain /mnt/saves/mysave --parallel
>>>>>>       --parallel-connections 16 --parallel-compression zstd
>>>>>>
>>>>>> virsh restore /mnt/saves/mysave --parallel
>>>>>>
>>>>>> and things also work fine, due to channels and compression
>>>>>> being stored in the main save file.
>>>>>
>>>>> For the sake of people following along, the above commands will
>>>>> result in creation of multiple files
>>>>>
>>>>>   /mnt/saves/mysave
>>>>>   /mnt/saves/mysave.0
>>>>
>>>> just minor correction, there is no .0
>>>
>>> Heh, off-by-1
>>>
>>>>
>>>>>   /mnt/saves/mysave.1
>>>>>   ....
>>>>>   /mnt/saves/mysave.n
>>>>>
>>>>> Where 'n' is the number of threads used.
>>>>>
>>>>> Overall I'm not very happy with the approach of doing any of this
>>>>> on the libvirt side.
>>>>
>>>>
>>>> Ok I understand your concern.
>>>>
>>>>>
>>>>> Backing up, we know that QEMU can directly save to disk faster than
>>>>> libvirt can. We mitigated alot of that overhead with previous patches
>>>>> to increase the pipe buffer size, but some still remains due to the
>>>>> extra copies inherant in handing this off to libvirt.
>>>>
>>>> Right;
>>>> still the performance we get is insufficient for the use case we are trying to address,
>>>> even without libvirt in the picture.
>>>>
>>>> Instead, with parallel save + compression we can make the numbers add up.
>>>> For parallel save using multifd, the overhead of libvirt is negligible.
>>>>
>>>>>
>>>>> Using multifd on the libvirt side, IIUC, gets us better performance
>>>>> than QEMU can manage if doing non-multifd write to file directly,
>>>>> but we still have the extra copies in there due to the hand off
>>>>> to libvirt. If QEMU were to be directly capable to writing to
>>>>> disk with multifd, it should beat us again.
>>>>
>>>> Hmm I am thinking about this point, and at first glance I don't
>>>> think this is 100% accurate;
>>>>
>>>> if we do parallel save like in this series with multifd,
>>>> the overhead of libvirt is almost non-existent in my view
>>>> compared with doing it with qemu only, skipping libvirt,
>>>> it is limited to the one iohelper for the main channel
>>>> (which is the smallest of the transfers),
>>>> and maybe this could be removed as well.
>>>
>>> Libvirt adds overhead due to the multiple data copies in
>>> the save process. Using multifd doesn't get rid of this
>>> overhead, it merely distributes the overhead across many
>>> CPUs. The overall wallclock time is reduced but in aggregate
>>> the CPUs still have the same amount of total work todo
>>> copying data around.
>>>
>>> I don't recall the scale of the libvirt overhead that remains
>>> after the pipe buffer optimizations, but whatever is less is
>>> still taking up host CPU time that can be used for other guests.
>>>
>>> It also just ocurred to me that currently our save/restore
>>> approach is bypassing all resource limits applied to the
>>> guest. eg block I/O rate limits, CPU affinity controls,
>>> etc, because most of the work is done in the iohelper.
>>> If we had this done in QEMU, then the save/restore process
>>> is confined by the existing CPU affinity / I/o limits
>>> applied to the guest. This mean we would not negatively
>>> impact other co-hosted guests to the same extent.
>>>
>>>> This is because even without libvirt in the picture, we
>>>> are still migrating to a socket, and something needs to
>>>> transfer data from that socket to a file. At that point
>>>> I think both libvirt and a custom made script are in the
>>>> same position.
>>>
>>> If QEMU had explicit support for a "file" backend, there
>>> would be no socket involved at all. QEMU would be copying
>>> guest RAM directly to a file with no intermediate steps.
>>> If QEMU mmap'd the save state file, then saving of the
>>> guest RAM could even possibly reduce to a mere 'memcpy()'
>>
>> Agree, but still, to align with your requirement to have only one file,
>> libvirt would need to add some padding after the libvirt header and before the QEMU VM starts in the file,
>> so that the QEMU VM starts at a block-friendly address.
> 
> That's trivial, as we already add padding in this place.
> 

That's great, I love when things are simple.

If indeed we want to remove the copy in libvirt (which will also mean explicitly fsyncing elsewhere, as the iohelper would not be there anymore to do that for us on image creation),
with QEMU having a "file" protocol support for migration,

do we plan to have libvirt and QEMU both open the file for writing concurrently, with QEMU opening O_DIRECT?

The alternative being having libvirt open the file with O_DIRECT, write some libvirt stuff in a new, O_DIRECT-friendly format, and then pass the fd to qemu to migrate to,
and QEMU sending its new O_DIRECT friendly stream there.

In any case, the expectation here is to have a new "file://pathname" or "file:://fdname" as an added feature in QEMU,
where QEMU would write a new O_DIRECT friendly stream directly into the file, taking care of both optional parallelization and compression.

Is that the gist of it? Seems a lot of work, just trying to roughly figure out the boundaries of this.

Thanks,

Claudio
Re: [libvirt RFCv8 00/27] multifd save restore prototype
Posted by Daniel P. Berrangé 4 days, 17 hours ago
On Wed, May 11, 2022 at 07:31:45PM +0200, Claudio Fontana wrote:
> That's great, I love when things are simple.
> 
> If indeed we want to remove the copy in libvirt (which will also mean explicitly fsyncing elsewhere, as the iohelper would not be there anymore to do that for us on image creation),
> with QEMU having a "file" protocol support for migration,
> 
> do we plan to have libvirt and QEMU both open the file for writing concurrently, with QEMU opening O_DIRECT?

For non-libvirt users, I expect QEMU would open the
file directly . For libvirt usage, it is likely
preferrable to pass the pre-opened FD, because that
simplifies file permission handling.

> The alternative being having libvirt open the file with
> O_DIRECT, write some libvirt stuff in a new, O_DIRECT-
> friendly format, and then pass the fd to qemu to migrate to,
> and QEMU sending its new O_DIRECT friendly stream there.

Yep.

> In any case, the expectation here is to have a new
> "file://pathname" or "file:://fdname" as an added feature in QEMU,
> where QEMU would write a new O_DIRECT friendly stream
> directly into the file, taking care of both optional
> parallelization and compression.

I could see several distinct building blocks

  * First a "file:/some/path" migration protocol
    that can just do "normal" I/O, but still writing
    in the traditional migration data stream

  * Modify existing 'fd:' protocol so that it fstat()s
    and passes over to the 'file' protocol handler if
    it sees the FD is not a socket/pipe

  * Add a migration capability "direct-mapped" to
    indicate we want the RAM data written/read directly
    to/from fixed positions in the file, as opposed to
    a stream. Obviously only valid with a sub-set
    of migration protocols (file, and fd: if a seekable
    FD).

  * Add a migration capability "bypass-cache" to
    indicate we want O_DIRECT to bypass host I/O
    cache.  Again limited to some migration protocols 


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
Re: [libvirt RFCv8 00/27] multifd save restore prototype
Posted by Dr. David Alan Gilbert 3 days, 17 hours ago
* Daniel P. Berrangé (berrange@redhat.com) wrote:
> On Wed, May 11, 2022 at 07:31:45PM +0200, Claudio Fontana wrote:
> > That's great, I love when things are simple.
> > 
> > If indeed we want to remove the copy in libvirt (which will also mean explicitly fsyncing elsewhere, as the iohelper would not be there anymore to do that for us on image creation),
> > with QEMU having a "file" protocol support for migration,
> > 
> > do we plan to have libvirt and QEMU both open the file for writing concurrently, with QEMU opening O_DIRECT?
> 
> For non-libvirt users, I expect QEMU would open the
> file directly . For libvirt usage, it is likely
> preferrable to pass the pre-opened FD, because that
> simplifies file permission handling.
> 
> > The alternative being having libvirt open the file with
> > O_DIRECT, write some libvirt stuff in a new, O_DIRECT-
> > friendly format, and then pass the fd to qemu to migrate to,
> > and QEMU sending its new O_DIRECT friendly stream there.
> 
> Yep.
> 
> > In any case, the expectation here is to have a new
> > "file://pathname" or "file:://fdname" as an added feature in QEMU,
> > where QEMU would write a new O_DIRECT friendly stream
> > directly into the file, taking care of both optional
> > parallelization and compression.
> 
> I could see several distinct building blocks
> 
>   * First a "file:/some/path" migration protocol
>     that can just do "normal" I/O, but still writing
>     in the traditional migration data stream
> 
>   * Modify existing 'fd:' protocol so that it fstat()s
>     and passes over to the 'file' protocol handler if
>     it sees the FD is not a socket/pipe

We used to have that at one point.

>   * Add a migration capability "direct-mapped" to
>     indicate we want the RAM data written/read directly
>     to/from fixed positions in the file, as opposed to
>     a stream. Obviously only valid with a sub-set
>     of migration protocols (file, and fd: if a seekable
>     FD).

This worries me about how you're going to cleanly glue this into the
migration code;  it sounds like what you want it to do is very different
to what it currently does.

Dave

>   * Add a migration capability "bypass-cache" to
>     indicate we want O_DIRECT to bypass host I/O
>     cache.  Again limited to some migration protocols 
> 
> 
> With regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Re: [libvirt RFCv8 00/27] multifd save restore prototype
Posted by Daniel P. Berrangé 3 days, 17 hours ago
On Thu, May 12, 2022 at 05:58:46PM +0100, Dr. David Alan Gilbert wrote:
> * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > On Wed, May 11, 2022 at 07:31:45PM +0200, Claudio Fontana wrote:
> > > That's great, I love when things are simple.
> > > 
> > > If indeed we want to remove the copy in libvirt (which will also mean explicitly fsyncing elsewhere, as the iohelper would not be there anymore to do that for us on image creation),
> > > with QEMU having a "file" protocol support for migration,
> > > 
> > > do we plan to have libvirt and QEMU both open the file for writing concurrently, with QEMU opening O_DIRECT?
> > 
> > For non-libvirt users, I expect QEMU would open the
> > file directly . For libvirt usage, it is likely
> > preferrable to pass the pre-opened FD, because that
> > simplifies file permission handling.
> > 
> > > The alternative being having libvirt open the file with
> > > O_DIRECT, write some libvirt stuff in a new, O_DIRECT-
> > > friendly format, and then pass the fd to qemu to migrate to,
> > > and QEMU sending its new O_DIRECT friendly stream there.
> > 
> > Yep.
> > 
> > > In any case, the expectation here is to have a new
> > > "file://pathname" or "file:://fdname" as an added feature in QEMU,
> > > where QEMU would write a new O_DIRECT friendly stream
> > > directly into the file, taking care of both optional
> > > parallelization and compression.
> > 
> > I could see several distinct building blocks
> > 
> >   * First a "file:/some/path" migration protocol
> >     that can just do "normal" I/O, but still writing
> >     in the traditional migration data stream
> > 
> >   * Modify existing 'fd:' protocol so that it fstat()s
> >     and passes over to the 'file' protocol handler if
> >     it sees the FD is not a socket/pipe
> 
> We used to have that at one point.
> 
> >   * Add a migration capability "direct-mapped" to
> >     indicate we want the RAM data written/read directly
> >     to/from fixed positions in the file, as opposed to
> >     a stream. Obviously only valid with a sub-set
> >     of migration protocols (file, and fd: if a seekable
> >     FD).
> 
> This worries me about how you're going to cleanly glue this into the
> migration code;  it sounds like what you want it to do is very different
> to what it currently does.

I've only investigated it lightly, but I see the key bit of code
is this method which emits the header + ram page content:


static int save_normal_page(RAMState *rs, RAMBlock *block, ram_addr_t offset,
                            uint8_t *buf, bool async)
{
    ram_transferred_add(save_page_header(rs, rs->f, block,
                                         offset | RAM_SAVE_FLAG_PAGE));
    if (async) {
        qemu_put_buffer_async(rs->f, buf, TARGET_PAGE_SIZE,
                              migrate_release_ram() &&
                              migration_in_postcopy());
    } else {
        qemu_put_buffer(rs->f, buf, TARGET_PAGE_SIZE);
    }
    ram_transferred_add(TARGET_PAGE_SIZE);
    ram_counters.normal++;
    return 1;
}


my (perhaps wishful) thinking was that we just have an alternative
impl of this which doesn't save the page header, and puts the
page content at a fixed offset.

I'm fuzzy on how we figure out the right offset - I was hoping
that "RAMState" or "RAMBlock" somehow gives us enough info to figure
out a deterministic mapping to a file location.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [libvirt RFCv8 00/27] multifd save restore prototype
Posted by Dr. David Alan Gilbert 3 days, 16 hours ago
* Daniel P. Berrangé (berrange@redhat.com) wrote:
> On Thu, May 12, 2022 at 05:58:46PM +0100, Dr. David Alan Gilbert wrote:
> > * Daniel P. Berrangé (berrange@redhat.com) wrote:
> > > On Wed, May 11, 2022 at 07:31:45PM +0200, Claudio Fontana wrote:
> > > > That's great, I love when things are simple.
> > > > 
> > > > If indeed we want to remove the copy in libvirt (which will also mean explicitly fsyncing elsewhere, as the iohelper would not be there anymore to do that for us on image creation),
> > > > with QEMU having a "file" protocol support for migration,
> > > > 
> > > > do we plan to have libvirt and QEMU both open the file for writing concurrently, with QEMU opening O_DIRECT?
> > > 
> > > For non-libvirt users, I expect QEMU would open the
> > > file directly . For libvirt usage, it is likely
> > > preferrable to pass the pre-opened FD, because that
> > > simplifies file permission handling.
> > > 
> > > > The alternative being having libvirt open the file with
> > > > O_DIRECT, write some libvirt stuff in a new, O_DIRECT-
> > > > friendly format, and then pass the fd to qemu to migrate to,
> > > > and QEMU sending its new O_DIRECT friendly stream there.
> > > 
> > > Yep.
> > > 
> > > > In any case, the expectation here is to have a new
> > > > "file://pathname" or "file:://fdname" as an added feature in QEMU,
> > > > where QEMU would write a new O_DIRECT friendly stream
> > > > directly into the file, taking care of both optional
> > > > parallelization and compression.
> > > 
> > > I could see several distinct building blocks
> > > 
> > >   * First a "file:/some/path" migration protocol
> > >     that can just do "normal" I/O, but still writing
> > >     in the traditional migration data stream
> > > 
> > >   * Modify existing 'fd:' protocol so that it fstat()s
> > >     and passes over to the 'file' protocol handler if
> > >     it sees the FD is not a socket/pipe
> > 
> > We used to have that at one point.
> > 
> > >   * Add a migration capability "direct-mapped" to
> > >     indicate we want the RAM data written/read directly
> > >     to/from fixed positions in the file, as opposed to
> > >     a stream. Obviously only valid with a sub-set
> > >     of migration protocols (file, and fd: if a seekable
> > >     FD).
> > 
> > This worries me about how you're going to cleanly glue this into the
> > migration code;  it sounds like what you want it to do is very different
> > to what it currently does.
> 
> I've only investigated it lightly, but I see the key bit of code
> is this method which emits the header + ram page content:
> 
> 
> static int save_normal_page(RAMState *rs, RAMBlock *block, ram_addr_t offset,
>                             uint8_t *buf, bool async)
> {
>     ram_transferred_add(save_page_header(rs, rs->f, block,
>                                          offset | RAM_SAVE_FLAG_PAGE));
>     if (async) {
>         qemu_put_buffer_async(rs->f, buf, TARGET_PAGE_SIZE,
>                               migrate_release_ram() &&
>                               migration_in_postcopy());
>     } else {
>         qemu_put_buffer(rs->f, buf, TARGET_PAGE_SIZE);
>     }
>     ram_transferred_add(TARGET_PAGE_SIZE);
>     ram_counters.normal++;
>     return 1;
> }
> 
> 
> my (perhaps wishful) thinking was that we just have an alternative
> impl of this which doesn't save the page header, and puts the
> page content at a fixed offset.

Hmm OK, probably can; note I think the multifd is separate code
(and currently much cleaner - which you'd make more complex again).

> I'm fuzzy on how we figure out the right offset - I was hoping
> that "RAMState" or "RAMBlock" somehow gives us enough info to figure
> out a deterministic mapping to a file location.

I think that's probably the ram_addr_t type, RAMBlock->offset + the
index intot he ramblock; that gets you the same thing as the dirty
bitmap (hmm although we don't have a single one of those any more).

Dave

> With regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
> 
-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Re: [libvirt RFCv8 00/27] multifd save restore prototype
Posted by Claudio Fontana 3 days, 21 hours ago
On 5/11/22 7:46 PM, Daniel P. Berrangé wrote:
> On Wed, May 11, 2022 at 07:31:45PM +0200, Claudio Fontana wrote:
>> That's great, I love when things are simple.
>>
>> If indeed we want to remove the copy in libvirt (which will also mean explicitly fsyncing elsewhere, as the iohelper would not be there anymore to do that for us on image creation),
>> with QEMU having a "file" protocol support for migration,
>>
>> do we plan to have libvirt and QEMU both open the file for writing concurrently, with QEMU opening O_DIRECT?
> 
> For non-libvirt users, I expect QEMU would open the
> file directly . For libvirt usage, it is likely
> preferrable to pass the pre-opened FD, because that
> simplifies file permission handling.
> 
>> The alternative being having libvirt open the file with
>> O_DIRECT, write some libvirt stuff in a new, O_DIRECT-
>> friendly format, and then pass the fd to qemu to migrate to,
>> and QEMU sending its new O_DIRECT friendly stream there.
> 
> Yep.

Currently I am working on this part, and I can use my prototype to test that the Direct part of the I/O works.
I can split things more to be able to provide more generally useful parts to the series,
that we can use to test out things while the qemu stuff is hopefully also tackled.

Thanks,

Claudio

> 
>> In any case, the expectation here is to have a new
>> "file://pathname" or "file:://fdname" as an added feature in QEMU,
>> where QEMU would write a new O_DIRECT friendly stream
>> directly into the file, taking care of both optional
>> parallelization and compression.
> 
> I could see several distinct building blocks
> 
>   * First a "file:/some/path" migration protocol
>     that can just do "normal" I/O, but still writing
>     in the traditional migration data stream
> 
>   * Modify existing 'fd:' protocol so that it fstat()s
>     and passes over to the 'file' protocol handler if
>     it sees the FD is not a socket/pipe
> 
>   * Add a migration capability "direct-mapped" to
>     indicate we want the RAM data written/read directly
>     to/from fixed positions in the file, as opposed to
>     a stream. Obviously only valid with a sub-set
>     of migration protocols (file, and fd: if a seekable
>     FD).
> 
>   * Add a migration capability "bypass-cache" to
>     indicate we want O_DIRECT to bypass host I/O
>     cache.  Again limited to some migration protocols 
> 
> 
> With regards,
> Daniel
> 
Re: [libvirt RFCv8 00/27] multifd save restore prototype
Posted by Claudio Fontana 3 days, 2 hours ago
On 5/12/22 3:38 PM, Claudio Fontana wrote:
> On 5/11/22 7:46 PM, Daniel P. Berrangé wrote:
>> On Wed, May 11, 2022 at 07:31:45PM +0200, Claudio Fontana wrote:
>>> That's great, I love when things are simple.
>>>
>>> If indeed we want to remove the copy in libvirt (which will also mean explicitly fsyncing elsewhere, as the iohelper would not be there anymore to do that for us on image creation),
>>> with QEMU having a "file" protocol support for migration,
>>>
>>> do we plan to have libvirt and QEMU both open the file for writing concurrently, with QEMU opening O_DIRECT?
>>
>> For non-libvirt users, I expect QEMU would open the
>> file directly . For libvirt usage, it is likely
>> preferrable to pass the pre-opened FD, because that
>> simplifies file permission handling.
>>
>>> The alternative being having libvirt open the file with
>>> O_DIRECT, write some libvirt stuff in a new, O_DIRECT-
>>> friendly format, and then pass the fd to qemu to migrate to,
>>> and QEMU sending its new O_DIRECT friendly stream there.
>>
>> Yep.
> 
> Currently I am working on this part, and I can use my prototype to test that the Direct part of the I/O works.


Just as a follow up and heads up that I am reworking the Image Write and Read code in qemu_saveimage.c
and while doing that I see quite a few things that need improvement, especially missing validations on lengths etc.

Ciao,

Claudio

> I can split things more to be able to provide more generally useful parts to the series,
> that we can use to test out things while the qemu stuff is hopefully also tackled.
> 
> Thanks,
> 
> Claudio
> 
>>
>>> In any case, the expectation here is to have a new
>>> "file://pathname" or "file:://fdname" as an added feature in QEMU,
>>> where QEMU would write a new O_DIRECT friendly stream
>>> directly into the file, taking care of both optional
>>> parallelization and compression.
>>
>> I could see several distinct building blocks
>>
>>   * First a "file:/some/path" migration protocol
>>     that can just do "normal" I/O, but still writing
>>     in the traditional migration data stream
>>
>>   * Modify existing 'fd:' protocol so that it fstat()s
>>     and passes over to the 'file' protocol handler if
>>     it sees the FD is not a socket/pipe
>>
>>   * Add a migration capability "direct-mapped" to
>>     indicate we want the RAM data written/read directly
>>     to/from fixed positions in the file, as opposed to
>>     a stream. Obviously only valid with a sub-set
>>     of migration protocols (file, and fd: if a seekable
>>     FD).
>>
>>   * Add a migration capability "bypass-cache" to
>>     indicate we want O_DIRECT to bypass host I/O
>>     cache.  Again limited to some migration protocols 
>>
>>
>> With regards,
>> Daniel
>>
>