[v2] Live update: cpr-exec

[PATCH V2 00/11] Live update: cpr-exec

Posted by Steve Sistare 2 months, 1 week ago

What?

This patch series adds the live migration cpr-exec mode, which allows
the user to update QEMU with minimal guest pause time, by preserving
guest RAM in place, albeit with new virtual addresses in new QEMU, and
by preserving device file descriptors.

The new user-visible interfaces are:
  * cpr-exec (MigMode migration parameter)
  * cpr-exec-command (migration parameter)
  * anon-alloc (command-line option for -machine)

The user sets the mode parameter before invoking the migrate command.
In this mode, the user issues the migrate command to old QEMU, which
stops the VM and saves state to the migration channels.  Old QEMU then
exec's new QEMU, replacing the original process while retaining its PID.
The user specifies the command to exec new QEMU in the migration parameter
cpr-exec-command.  The command must pass all old QEMU arguments to new
QEMU, plus the -incoming option.  Execution resumes in new QEMU.

Memory-backend objects must have the share=on attribute, but
memory-backend-epc is not supported.  The VM must be started
with the '-machine anon-alloc=memfd' option, which allows anonymous
memory to be transferred in place to the new process.

Why?

This mode has less impact on the guest than any other method of updating
in place.  The pause time is much lower, because devices need not be torn
down and recreated, DMA does not need to be drained and quiesced, and minimal
state is copied to new QEMU.  Further, there are no constraints on the guest.
By contrast, cpr-reboot mode requires the guest to support S3 suspend-to-ram,
and suspending plus resuming vfio devices adds multiple seconds to the
guest pause time.  Lastly, there is no loss of connectivity to the guest,
because chardev descriptors remain open and connected.

These benefits all derive from the core design principle of this mode,
which is preserving open descriptors.  This approach is very general and
can be used to support a wide variety of devices that do not have hardware
support for live migration, including but not limited to: vfio, chardev,
vhost, vdpa, and iommufd.  Some devices need new kernel software interfaces
to allow a descriptor to be used in a process that did not originally open it.

In a containerized QEMU environment, cpr-exec reuses an existing QEMU
container and its assigned resources.  By contrast, consider a design in
which a new container is created on the same host as the target of the
CPR operation.  Resources must be reserved for the new container, while
the old container still reserves resources until the operation completes.
Avoiding over commitment requires extra work in the management layer.
This is one reason why a cloud provider may prefer cpr-exec.  A second reason
is that the container may include agents with their own connections to the
outside world, and such connections remain intact if the container is reused.

How?

All memory that is mapped by the guest is preserved in place.  Indeed,
it must be, because it may be the target of DMA requests, which are not
quiesced during cpr-exec.  All such memory must be mmap'able in new QEMU.
This is easy for named memory-backend objects, as long as they are mapped
shared, because they are visible in the file system in both old and new QEMU.
Anonymous memory must be allocated using memfd_create rather than MAP_ANON,
so the memfd's can be sent to new QEMU.  Pages that were locked in memory
for DMA in old QEMU remain locked in new QEMU, because the descriptor of
the device that locked them remains open.

cpr-exec preserves descriptors across exec by clearing the CLOEXEC flag,
and by sending the unique name and value of each descriptor to new QEMU
via CPR state.

For device descriptors, new QEMU reuses the descriptor when creating the
device, rather than opening it again.  The same holds for chardevs.  For
memfd descriptors, new QEMU mmap's the preserved memfd when a ramblock
is created.

CPR state cannot be sent over the normal migration channel, because devices
and backends are created prior to reading the channel, so this mode sends
CPR state over a second migration channel that is not visible to the user.
New QEMU reads the second channel prior to creating devices or backends.

The exec itself is trivial.  After writing to the migration channels, the
migration code calls a new main-loop hook to perform the exec.

Example:

In this example, we simply restart the same version of QEMU, but in
a real scenario one would use a new QEMU binary path in cpr-exec-command.

  # qemu-kvm -monitor stdio -object
  memory-backend-file,id=ram0,size=4G,mem-path=/dev/shm/ram0,share=on
  -m 4G -machine anon-alloc=memfd ...

  QEMU 9.1.50 monitor - type 'help' for more information
  (qemu) info status
  VM status: running
  (qemu) migrate_set_parameter mode cpr-exec
  (qemu) migrate_set_parameter cpr-exec-command qemu-kvm ... -incoming file:vm.state
  (qemu) migrate -d file:vm.state
  (qemu) QEMU 9.1.50 monitor - type 'help' for more information
  (qemu) info status
  VM status: running

This patch series implements a minimal version of cpr-exec.  Additional
series are ready to be posted to deliver the complete vision described
above, including
  * vfio
  * chardev
  * vhost
  * blockers
  * hostmem-memfd
  * migration-test cases

Works in progress include:
  * vdpa
  * iommufd
  * cpr-transfer mode

Changes since V1:
  * Dropped precreate and factory patches.  Added CPR state instead.
  * Dropped patches that refactor ramblock allocation
  * Dropped vmstate_info_void patch (peter)
  * Dropped patch "seccomp: cpr-exec blocker" (Daniel)
  * Redefined memfd-alloc option as anon-alloc
  * No longer preserve ramblock fields, except for fd (peter)
  * Added fd preservation functions in CPR state
  * Hoisted cpr code out of migrate_fd_cleanup (fabiano)
  * Revised migration.json docs (markus)
  * Fixed qtest failures (fabiano)
  * Renamed SAVEVM_FOREACH macros (fabiano)
  * Renamed cpr-exec-args as cpr-exec-command (markus)

The first 6 patches below are foundational and are needed for both cpr-exec
mode and cpr-transfer mode.  The last 5 patches are specific to cpr-exec
and implement the mechanisms for sharing state across exec.

Steve Sistare (11):
  machine: alloc-anon option
  migration: cpr-state
  migration: save cpr mode
  migration: stop vm earlier for cpr
  physmem: preserve ram blocks for cpr
  migration: fix mismatched GPAs during cpr
  oslib: qemu_clear_cloexec
  vl: helper to request exec
  migration: cpr-exec-command parameter
  migration: cpr-exec save and load
  migration: cpr-exec mode

 hmp-commands.hx                |   2 +-
 hw/core/machine.c              |  24 +++++
 include/exec/memory.h          |  12 +++
 include/hw/boards.h            |   1 +
 include/migration/cpr.h        |  35 ++++++
 include/qemu/osdep.h           |   9 ++
 include/sysemu/runstate.h      |   3 +
 migration/cpr-exec.c           | 180 +++++++++++++++++++++++++++++++
 migration/cpr.c                | 238 +++++++++++++++++++++++++++++++++++++++++
 migration/meson.build          |   2 +
 migration/migration-hmp-cmds.c |  25 +++++
 migration/migration.c          |  43 ++++++--
 migration/options.c            |  23 +++-
 migration/ram.c                |  17 +--
 migration/trace-events         |   5 +
 qapi/machine.json              |  14 +++
 qapi/migration.json            |  45 +++++++-
 qemu-options.hx                |  13 +++
 system/memory.c                |  22 +++-
 system/physmem.c               |  61 ++++++++++-
 system/runstate.c              |  29 +++++
 system/trace-events            |   3 +
 system/vl.c                    |   3 +
 util/oslib-posix.c             |   9 ++
 util/oslib-win32.c             |   4 +
 25 files changed, 792 insertions(+), 30 deletions(-)
 create mode 100644 include/migration/cpr.h
 create mode 100644 migration/cpr-exec.c
 create mode 100644 migration/cpr.c

-- 
1.8.3.1

Re: [PATCH V2 00/11] Live update: cpr-exec

Posted by Peter Xu 1 month, 3 weeks ago

Steve,

On Sun, Jun 30, 2024 at 12:40:23PM -0700, Steve Sistare wrote:
> What?

Thanks for trying out with the cpr-transfer series.  I saw that that series
missed most of the cc list here, so I'm attaching the link here:

https://lore.kernel.org/r/1719776648-435073-1-git-send-email-steven.sistare@oracle.com

I think most of my previous questions for exec() solution still are there,
I'll try to summarize them all in this reply as much as I can.

> 
> This patch series adds the live migration cpr-exec mode, which allows
> the user to update QEMU with minimal guest pause time, by preserving
> guest RAM in place, albeit with new virtual addresses in new QEMU, and
> by preserving device file descriptors.
> 
> The new user-visible interfaces are:
>   * cpr-exec (MigMode migration parameter)
>   * cpr-exec-command (migration parameter)

I really, really hope we can avoid this..

It's super cumbersome to pass in a qemu cmdline in a qemu migration
parameter.. if we can do that with generic live migration ways, I hope we
stick with the clean approach.

>   * anon-alloc (command-line option for -machine)

Igor questioned this, and I second his opinion..  We can leave the
discussion there for this one.

> 
> The user sets the mode parameter before invoking the migrate command.
> In this mode, the user issues the migrate command to old QEMU, which
> stops the VM and saves state to the migration channels.  Old QEMU then
> exec's new QEMU, replacing the original process while retaining its PID.
> The user specifies the command to exec new QEMU in the migration parameter
> cpr-exec-command.  The command must pass all old QEMU arguments to new
> QEMU, plus the -incoming option.  Execution resumes in new QEMU.
> 
> Memory-backend objects must have the share=on attribute, but
> memory-backend-epc is not supported.  The VM must be started
> with the '-machine anon-alloc=memfd' option, which allows anonymous
> memory to be transferred in place to the new process.
> 
> Why?
> 
> This mode has less impact on the guest than any other method of updating
> in place.

So I wonder whether there's comparison between exec() and transfer mode
that you recently proposed.

I'm asking because exec() (besides all the rest of things that I dislike on
it in this approach..) should be simply slower, logically, due to the
serialized operation to (1) tearing down the old mm, (2) reload the new
ELF, then (3) runs through the QEMU init process.

If with a generic migration solution, the dest QEMU can start running (2+3)
concurrently without even need to run (1).

In this whole process, I doubt (2) could be relatively fast, (3) I donno,
maybe it could be slow but I never measured; Paolo may have good idea as I
know he used to work on qboot.

For (1), I also doubt in your test cases it's fast, but it may not always
be fast.  Consider the guest has a huge TBs of shared mem, even if the
memory will be completely shared between src/dst QEMUs, the pgtable won't!
It means if the TBs are mapped in PAGE_SIZE tearing down the src QEMU
pgtable alone can even take time, and that will be accounted in step (1)
and further in exec() request.

All these fuss will be avoided if you use a generic live migration model
like cpr-transfer you proposed.  That's also cleaner.

> The pause time is much lower, because devices need not be torn
> down and recreated, DMA does not need to be drained and quiesced, and minimal
> state is copied to new QEMU.  Further, there are no constraints on the guest.
> By contrast, cpr-reboot mode requires the guest to support S3 suspend-to-ram,
> and suspending plus resuming vfio devices adds multiple seconds to the
> guest pause time.  Lastly, there is no loss of connectivity to the guest,
> because chardev descriptors remain open and connected.

Again, I raised the question on why this would matter, as after all mgmt
app will need to coop with reconnections due to the fact they'll need to
support a generic live migration, in which case reconnection is a must.

So far it doesn't sound like a performance critical path, for example, to
do the mgmt reconnects on the ports.  So this might be an optimization that
most mgmt apps may not care much?

> 
> These benefits all derive from the core design principle of this mode,
> which is preserving open descriptors.  This approach is very general and
> can be used to support a wide variety of devices that do not have hardware
> support for live migration, including but not limited to: vfio, chardev,
> vhost, vdpa, and iommufd.  Some devices need new kernel software interfaces
> to allow a descriptor to be used in a process that did not originally open it.

Yes, I still think this is a great idea.  It just can also be built on top
of something else than exec().

> 
> In a containerized QEMU environment, cpr-exec reuses an existing QEMU
> container and its assigned resources.  By contrast, consider a design in
> which a new container is created on the same host as the target of the
> CPR operation.  Resources must be reserved for the new container, while
> the old container still reserves resources until the operation completes.

Note that if we need to share RAM anyway, the resources consumption should
be minimal, as mem should IMHO be the major concern (except CPU, but CPU
isn't a concern in this scenario) in container world and here the shared
guest mem shouldn't be accounted to the dest container.  So IMHO it's about
the metadata QEMU/KVM needs to do the hypervisor work, it seems to me, and
that should be relatively small.

In that case I don't yet see it a huge improvement, if the dest container
is cheap to initiate.

> Avoiding over commitment requires extra work in the management layer.

So it would be nice to know what needs to be overcommitted here.  I confess
I don't know much on containerized VMs, so maybe the page cache can be a
problem even if shared.  But I hope we can spell that out.  Logically IIUC
memcg shouldn't account those page cache if preallocated, because memcg
accounting should be done at folio allocations, at least, where the page
cache should miss first (so not this case..).

> This is one reason why a cloud provider may prefer cpr-exec.  A second reason
> is that the container may include agents with their own connections to the
> outside world, and such connections remain intact if the container is reused.
> 
> How?
> 
> All memory that is mapped by the guest is preserved in place.  Indeed,
> it must be, because it may be the target of DMA requests, which are not
> quiesced during cpr-exec.  All such memory must be mmap'able in new QEMU.
> This is easy for named memory-backend objects, as long as they are mapped
> shared, because they are visible in the file system in both old and new QEMU.
> Anonymous memory must be allocated using memfd_create rather than MAP_ANON,
> so the memfd's can be sent to new QEMU.  Pages that were locked in memory
> for DMA in old QEMU remain locked in new QEMU, because the descriptor of
> the device that locked them remains open.
> 
> cpr-exec preserves descriptors across exec by clearing the CLOEXEC flag,
> and by sending the unique name and value of each descriptor to new QEMU
> via CPR state.
> 
> For device descriptors, new QEMU reuses the descriptor when creating the
> device, rather than opening it again.  The same holds for chardevs.  For
> memfd descriptors, new QEMU mmap's the preserved memfd when a ramblock
> is created.
> 
> CPR state cannot be sent over the normal migration channel, because devices
> and backends are created prior to reading the channel, so this mode sends
> CPR state over a second migration channel that is not visible to the user.
> New QEMU reads the second channel prior to creating devices or backends.

Oh, maybe this is the reason that cpr-transfer will need a separate uri..

Thanks,

-- 
Peter Xu

Re: [PATCH V2 00/11] Live update: cpr-exec

Posted by Daniel P. Berrangé 1 month ago

On Thu, Jul 18, 2024 at 11:56:33AM -0400, Peter Xu wrote:
> Steve,
> 
> On Sun, Jun 30, 2024 at 12:40:23PM -0700, Steve Sistare wrote:
> > What?
> 
> Thanks for trying out with the cpr-transfer series.  I saw that that series
> missed most of the cc list here, so I'm attaching the link here:
> 
> https://lore.kernel.org/r/1719776648-435073-1-git-send-email-steven.sistare@oracle.com
> 
> I think most of my previous questions for exec() solution still are there,
> I'll try to summarize them all in this reply as much as I can.
> 
> > 
> > This patch series adds the live migration cpr-exec mode, which allows
> > the user to update QEMU with minimal guest pause time, by preserving
> > guest RAM in place, albeit with new virtual addresses in new QEMU, and
> > by preserving device file descriptors.
> > 
> > The new user-visible interfaces are:
> >   * cpr-exec (MigMode migration parameter)
> >   * cpr-exec-command (migration parameter)
> 
> I really, really hope we can avoid this..
> 
> It's super cumbersome to pass in a qemu cmdline in a qemu migration
> parameter.. if we can do that with generic live migration ways, I hope we
> stick with the clean approach.

A further issue I have is that it presumes the QEMU configuration is
fully captured by the command line. We have a long term design goal
in QEMU to get away from specifying configuration on the command
line, and move entirely to configuring QEMU via a series of QMP
commands.

This proposed command is introducing the concept of command line argv
as a formal part of the QEMU API and IMHO that is undesirable. Even
today we have backend configuration steps only done via QMP, and I'm
wondering how it would fit in with how mgmt apps currently doing live
migration.

The flipside, however, is that localhost migration via 2 separate QEMU
processes has issues where both QEMUs want to be opening the very same
file, and only 1 of them can ever have them open.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH V2 00/11] Live update: cpr-exec

Posted by Steven Sistare 1 month ago

On 8/5/2024 6:01 AM, Daniel P. Berrangé wrote:
> On Thu, Jul 18, 2024 at 11:56:33AM -0400, Peter Xu wrote:
>> Steve,
>>
>> On Sun, Jun 30, 2024 at 12:40:23PM -0700, Steve Sistare wrote:
>>> What?
>>
>> Thanks for trying out with the cpr-transfer series.  I saw that that series
>> missed most of the cc list here, so I'm attaching the link here:
>>
>> https://lore.kernel.org/r/1719776648-435073-1-git-send-email-steven.sistare@oracle.com
>>
>> I think most of my previous questions for exec() solution still are there,
>> I'll try to summarize them all in this reply as much as I can.
>>
>>>
>>> This patch series adds the live migration cpr-exec mode, which allows
>>> the user to update QEMU with minimal guest pause time, by preserving
>>> guest RAM in place, albeit with new virtual addresses in new QEMU, and
>>> by preserving device file descriptors.
>>>
>>> The new user-visible interfaces are:
>>>    * cpr-exec (MigMode migration parameter)
>>>    * cpr-exec-command (migration parameter)
>>
>> I really, really hope we can avoid this..
>>
>> It's super cumbersome to pass in a qemu cmdline in a qemu migration
>> parameter.. if we can do that with generic live migration ways, I hope we
>> stick with the clean approach.
> 
> A further issue I have is that it presumes the QEMU configuration is
> fully captured by the command line. We have a long term design goal
> in QEMU to get away from specifying configuration on the command
> line, and move entirely to configuring QEMU via a series of QMP
> commands.
> 
> This proposed command is introducing the concept of command line argv
> as a formal part of the QEMU API and IMHO that is undesirable. 

Actually cpr-exec-command does not presume anything; it is an arbitrary
command with arbitrary arguments.  If in the future QEMU takes no command-line
arguments, then mgmt will pass a simple launcher command as cpr-exec-command,
and the launcher will issue QMP commands.  Or the launcher will send a message
to another mgmt agent to do so.  It is very flexible.  Regardless, the API
definition of cpr-exec-command will not change.

As another example, in our cloud environment, when the mgmt agent starts QEMU,
it saves the QEMU args in a file.  My cpr-exec-command  is just "/bin/qemu-exec"
with a few simple arguments.  That command reads QEMU args from the file and
exec's new QEMU.

> Even
> today we have backend configuration steps only done via QMP, and I'm
> wondering how it would fit in with how mgmt apps currently doing live
> migration.

Sure, and that still works.  For live migration, mgmt starts new QEMU with
its static arguments plus -S plus -incoming, then mgmt detects QEMU has reached
the prelaunch state, then it issues QMP commands.  For live update, cpr-exec-command
has the static arguments plus -S, then mgmt detects QEMU has reached the prelaunch
state, then it issues QMP commands.

> The flipside, however, is that localhost migration via 2 separate QEMU
> processes has issues where both QEMUs want to be opening the very same
> file, and only 1 of them can ever have them open.

Indeed, and "files" includes unix domain sockets.  Network ports also conflict.
cpr-exec avoids such problems, and is one of the advantages of the method that
I forgot to promote.

- Steve

Re: [PATCH V2 00/11] Live update: cpr-exec

Posted by Peter Xu 3 weeks, 4 days ago

On Tue, Aug 06, 2024 at 04:56:18PM -0400, Steven Sistare wrote:
> > The flipside, however, is that localhost migration via 2 separate QEMU
> > processes has issues where both QEMUs want to be opening the very same
> > file, and only 1 of them can ever have them open.

I thought we used to have similar issue with block devices, but I assume
it's solved for years (and whoever owns it will take proper file lock,
IIRC, and QEMU migration should properly serialize the time window on who's
going to take the file lock).

Maybe this is about something else?

> 
> Indeed, and "files" includes unix domain sockets.  Network ports also conflict.
> cpr-exec avoids such problems, and is one of the advantages of the method that
> I forgot to promote.

I was thinking that's fine, as the host ports should be the backend of the
VM ports only anyway so they don't need to be identical on both sides?

IOW, my understanding is it's the guest IP/ports/... which should still be
stable across migrations, where the host ports can be different as long as
the host ports can forward guest port messages correctly?

Thanks,

-- 
Peter Xu

Re: [PATCH V2 00/11] Live update: cpr-exec

Posted by Steven Sistare 3 weeks, 2 days ago

On 8/13/2024 3:46 PM, Peter Xu wrote:
> On Tue, Aug 06, 2024 at 04:56:18PM -0400, Steven Sistare wrote:
>>> The flipside, however, is that localhost migration via 2 separate QEMU
>>> processes has issues where both QEMUs want to be opening the very same
>>> file, and only 1 of them can ever have them open.
> 
> I thought we used to have similar issue with block devices, but I assume
> it's solved for years (and whoever owns it will take proper file lock,
> IIRC, and QEMU migration should properly serialize the time window on who's
> going to take the file lock).
> 
> Maybe this is about something else?

I don't have an example where this fails.

I can cause "Failed to get "write" lock" errors if two qemu instances open
the same block device, but the error is suppressed if you add the -incoming
argument, due to this code:

   blk_attach_dev()
     if (runstate_check(RUN_STATE_INMIGRATE))
       blk->disable_perm = true;

>> Indeed, and "files" includes unix domain sockets.  

More on this -- the second qemu to bind a unix domain socket for listening
wins, and the first qemu loses it (because second qemu unlinks and recreates
the socket path before binding on the assumption that it is stale).

One must use a different name for the socket for second qemu, and clients
that wish to connect must be aware of the new port.

>> Network ports also conflict.
>> cpr-exec avoids such problems, and is one of the advantages of the method that
>> I forgot to promote.
> 
> I was thinking that's fine, as the host ports should be the backend of the
> VM ports only anyway so they don't need to be identical on both sides?
> 
> IOW, my understanding is it's the guest IP/ports/... which should still be
> stable across migrations, where the host ports can be different as long as
> the host ports can forward guest port messages correctly?

Yes, one must use a different host port number for the second qemu, and clients
that wish to connect must be aware of the new port.

That is my point -- cpr-transfer requires fiddling with such things.
cpr-exec does not.

- Steve

Re: [PATCH V2 00/11] Live update: cpr-exec

Posted by Peter Xu 3 weeks, 1 day ago

On Thu, Aug 15, 2024 at 04:55:20PM -0400, Steven Sistare wrote:
> On 8/13/2024 3:46 PM, Peter Xu wrote:
> > On Tue, Aug 06, 2024 at 04:56:18PM -0400, Steven Sistare wrote:
> > > > The flipside, however, is that localhost migration via 2 separate QEMU
> > > > processes has issues where both QEMUs want to be opening the very same
> > > > file, and only 1 of them can ever have them open.
> > 
> > I thought we used to have similar issue with block devices, but I assume
> > it's solved for years (and whoever owns it will take proper file lock,
> > IIRC, and QEMU migration should properly serialize the time window on who's
> > going to take the file lock).
> > 
> > Maybe this is about something else?
> 
> I don't have an example where this fails.
> 
> I can cause "Failed to get "write" lock" errors if two qemu instances open
> the same block device, but the error is suppressed if you add the -incoming
> argument, due to this code:
> 
>   blk_attach_dev()
>     if (runstate_check(RUN_STATE_INMIGRATE))
>       blk->disable_perm = true;

Yep, this one is pretty much expected.

> 
> > > Indeed, and "files" includes unix domain sockets.
> 
> More on this -- the second qemu to bind a unix domain socket for listening
> wins, and the first qemu loses it (because second qemu unlinks and recreates
> the socket path before binding on the assumption that it is stale).
> 
> One must use a different name for the socket for second qemu, and clients
> that wish to connect must be aware of the new port.
> 
> > > Network ports also conflict.
> > > cpr-exec avoids such problems, and is one of the advantages of the method that
> > > I forgot to promote.
> > 
> > I was thinking that's fine, as the host ports should be the backend of the
> > VM ports only anyway so they don't need to be identical on both sides?
> > 
> > IOW, my understanding is it's the guest IP/ports/... which should still be
> > stable across migrations, where the host ports can be different as long as
> > the host ports can forward guest port messages correctly?
> 
> Yes, one must use a different host port number for the second qemu, and clients
> that wish to connect must be aware of the new port.
> 
> That is my point -- cpr-transfer requires fiddling with such things.
> cpr-exec does not.

Right, and my understanding is all these facilities are already there, so
no new code should be needed on reconnect issues if to support cpr-transfer
in Libvirt or similar management layers that supports migrations.

I suppose that's also why I'm slightly confused on how cpr-exec can provide
benefit for mgmt layers yet so far with these open projects.  It might
affect Oracle's mgmt layers, but again I'm curious why Oracle does not
support these, because if that should support normal live migration, I
thought it should be needed to support changed ports on host etc..

-- 
Peter Xu

Re: [PATCH V2 00/11] Live update: cpr-exec

Posted by Daniel P. Berrangé 3 weeks, 1 day ago

On Fri, Aug 16, 2024 at 11:06:10AM -0400, Peter Xu wrote:
> On Thu, Aug 15, 2024 at 04:55:20PM -0400, Steven Sistare wrote:
> > On 8/13/2024 3:46 PM, Peter Xu wrote:
> > > On Tue, Aug 06, 2024 at 04:56:18PM -0400, Steven Sistare wrote:
> > > > > The flipside, however, is that localhost migration via 2 separate QEMU
> > > > > processes has issues where both QEMUs want to be opening the very same
> > > > > file, and only 1 of them can ever have them open.
> > > 
> > > I thought we used to have similar issue with block devices, but I assume
> > > it's solved for years (and whoever owns it will take proper file lock,
> > > IIRC, and QEMU migration should properly serialize the time window on who's
> > > going to take the file lock).
> > > 
> > > Maybe this is about something else?
> > 
> > I don't have an example where this fails.
> > 
> > I can cause "Failed to get "write" lock" errors if two qemu instances open
> > the same block device, but the error is suppressed if you add the -incoming
> > argument, due to this code:
> > 
> >   blk_attach_dev()
> >     if (runstate_check(RUN_STATE_INMIGRATE))
> >       blk->disable_perm = true;
> 
> Yep, this one is pretty much expected.
> 
> > 
> > > > Indeed, and "files" includes unix domain sockets.
> > 
> > More on this -- the second qemu to bind a unix domain socket for listening
> > wins, and the first qemu loses it (because second qemu unlinks and recreates
> > the socket path before binding on the assumption that it is stale).
> > 
> > One must use a different name for the socket for second qemu, and clients
> > that wish to connect must be aware of the new port.
> > 
> > > > Network ports also conflict.
> > > > cpr-exec avoids such problems, and is one of the advantages of the method that
> > > > I forgot to promote.
> > > 
> > > I was thinking that's fine, as the host ports should be the backend of the
> > > VM ports only anyway so they don't need to be identical on both sides?
> > > 
> > > IOW, my understanding is it's the guest IP/ports/... which should still be
> > > stable across migrations, where the host ports can be different as long as
> > > the host ports can forward guest port messages correctly?
> > 
> > Yes, one must use a different host port number for the second qemu, and clients
> > that wish to connect must be aware of the new port.
> > 
> > That is my point -- cpr-transfer requires fiddling with such things.
> > cpr-exec does not.
> 
> Right, and my understanding is all these facilities are already there, so
> no new code should be needed on reconnect issues if to support cpr-transfer
> in Libvirt or similar management layers that supports migrations.

Note Libvirt explicitly blocks localhost migration today because
solving all these clashing resource problems is a huge can of worms
and it can't be made invisible to the user of libvirt in any practical
way.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH V2 00/11] Live update: cpr-exec

Posted by Peter Xu 3 weeks, 1 day ago

On Fri, Aug 16, 2024 at 04:16:50PM +0100, Daniel P. Berrangé wrote:
> On Fri, Aug 16, 2024 at 11:06:10AM -0400, Peter Xu wrote:
> > On Thu, Aug 15, 2024 at 04:55:20PM -0400, Steven Sistare wrote:
> > > On 8/13/2024 3:46 PM, Peter Xu wrote:
> > > > On Tue, Aug 06, 2024 at 04:56:18PM -0400, Steven Sistare wrote:
> > > > > > The flipside, however, is that localhost migration via 2 separate QEMU
> > > > > > processes has issues where both QEMUs want to be opening the very same
> > > > > > file, and only 1 of them can ever have them open.
> > > > 
> > > > I thought we used to have similar issue with block devices, but I assume
> > > > it's solved for years (and whoever owns it will take proper file lock,
> > > > IIRC, and QEMU migration should properly serialize the time window on who's
> > > > going to take the file lock).
> > > > 
> > > > Maybe this is about something else?
> > > 
> > > I don't have an example where this fails.
> > > 
> > > I can cause "Failed to get "write" lock" errors if two qemu instances open
> > > the same block device, but the error is suppressed if you add the -incoming
> > > argument, due to this code:
> > > 
> > >   blk_attach_dev()
> > >     if (runstate_check(RUN_STATE_INMIGRATE))
> > >       blk->disable_perm = true;
> > 
> > Yep, this one is pretty much expected.
> > 
> > > 
> > > > > Indeed, and "files" includes unix domain sockets.
> > > 
> > > More on this -- the second qemu to bind a unix domain socket for listening
> > > wins, and the first qemu loses it (because second qemu unlinks and recreates
> > > the socket path before binding on the assumption that it is stale).
> > > 
> > > One must use a different name for the socket for second qemu, and clients
> > > that wish to connect must be aware of the new port.
> > > 
> > > > > Network ports also conflict.
> > > > > cpr-exec avoids such problems, and is one of the advantages of the method that
> > > > > I forgot to promote.
> > > > 
> > > > I was thinking that's fine, as the host ports should be the backend of the
> > > > VM ports only anyway so they don't need to be identical on both sides?
> > > > 
> > > > IOW, my understanding is it's the guest IP/ports/... which should still be
> > > > stable across migrations, where the host ports can be different as long as
> > > > the host ports can forward guest port messages correctly?
> > > 
> > > Yes, one must use a different host port number for the second qemu, and clients
> > > that wish to connect must be aware of the new port.
> > > 
> > > That is my point -- cpr-transfer requires fiddling with such things.
> > > cpr-exec does not.
> > 
> > Right, and my understanding is all these facilities are already there, so
> > no new code should be needed on reconnect issues if to support cpr-transfer
> > in Libvirt or similar management layers that supports migrations.
> 
> Note Libvirt explicitly blocks localhost migration today because
> solving all these clashing resource problems is a huge can of worms
> and it can't be made invisible to the user of libvirt in any practical
> way.

Ahhh, OK.  I'm pretty surprised by this, as I thought at least kubevirt
supported local migration somehow on top of libvirt.

Does it mean that cpr-transfer is a no-go in this case at least for Libvirt
to consume it (as cpr-* is only for local host migrations so far)?  Even if
all the rest issues we're discussing with cpr-exec, is that the only way to
go for Libvirt, then?

-- 
Peter Xu

Re: [PATCH V2 00/11] Live update: cpr-exec

Posted by Daniel P. Berrangé 3 weeks, 1 day ago

On Fri, Aug 16, 2024 at 11:34:10AM -0400, Peter Xu wrote:
> On Fri, Aug 16, 2024 at 04:16:50PM +0100, Daniel P. Berrangé wrote:
> > On Fri, Aug 16, 2024 at 11:06:10AM -0400, Peter Xu wrote:
> > > On Thu, Aug 15, 2024 at 04:55:20PM -0400, Steven Sistare wrote:
> > > > On 8/13/2024 3:46 PM, Peter Xu wrote:
> > > > > On Tue, Aug 06, 2024 at 04:56:18PM -0400, Steven Sistare wrote:
> > > > > > > The flipside, however, is that localhost migration via 2 separate QEMU
> > > > > > > processes has issues where both QEMUs want to be opening the very same
> > > > > > > file, and only 1 of them can ever have them open.
> > > > > 
> > > > > I thought we used to have similar issue with block devices, but I assume
> > > > > it's solved for years (and whoever owns it will take proper file lock,
> > > > > IIRC, and QEMU migration should properly serialize the time window on who's
> > > > > going to take the file lock).
> > > > > 
> > > > > Maybe this is about something else?
> > > > 
> > > > I don't have an example where this fails.
> > > > 
> > > > I can cause "Failed to get "write" lock" errors if two qemu instances open
> > > > the same block device, but the error is suppressed if you add the -incoming
> > > > argument, due to this code:
> > > > 
> > > >   blk_attach_dev()
> > > >     if (runstate_check(RUN_STATE_INMIGRATE))
> > > >       blk->disable_perm = true;
> > > 
> > > Yep, this one is pretty much expected.
> > > 
> > > > 
> > > > > > Indeed, and "files" includes unix domain sockets.
> > > > 
> > > > More on this -- the second qemu to bind a unix domain socket for listening
> > > > wins, and the first qemu loses it (because second qemu unlinks and recreates
> > > > the socket path before binding on the assumption that it is stale).
> > > > 
> > > > One must use a different name for the socket for second qemu, and clients
> > > > that wish to connect must be aware of the new port.
> > > > 
> > > > > > Network ports also conflict.
> > > > > > cpr-exec avoids such problems, and is one of the advantages of the method that
> > > > > > I forgot to promote.
> > > > > 
> > > > > I was thinking that's fine, as the host ports should be the backend of the
> > > > > VM ports only anyway so they don't need to be identical on both sides?
> > > > > 
> > > > > IOW, my understanding is it's the guest IP/ports/... which should still be
> > > > > stable across migrations, where the host ports can be different as long as
> > > > > the host ports can forward guest port messages correctly?
> > > > 
> > > > Yes, one must use a different host port number for the second qemu, and clients
> > > > that wish to connect must be aware of the new port.
> > > > 
> > > > That is my point -- cpr-transfer requires fiddling with such things.
> > > > cpr-exec does not.
> > > 
> > > Right, and my understanding is all these facilities are already there, so
> > > no new code should be needed on reconnect issues if to support cpr-transfer
> > > in Libvirt or similar management layers that supports migrations.
> > 
> > Note Libvirt explicitly blocks localhost migration today because
> > solving all these clashing resource problems is a huge can of worms
> > and it can't be made invisible to the user of libvirt in any practical
> > way.
> 
> Ahhh, OK.  I'm pretty surprised by this, as I thought at least kubevirt
> supported local migration somehow on top of libvirt.

Since kubevirt runs inside a container, "localhost" migration
is effectively migrating between 2 completely separate OS installs
(containers), that happen to be on the same physical host. IOW, it
is a cross-host migration from Libvirt & QEMU's POV, and there are
no clashing resources to worry about.

> Does it mean that cpr-transfer is a no-go in this case at least for Libvirt
> to consume it (as cpr-* is only for local host migrations so far)?  Even if
> all the rest issues we're discussing with cpr-exec, is that the only way to
> go for Libvirt, then?

cpr-exec is certainly appealing from the POV of avoiding the clashing
resources problem in libvirt.

It has own issues though, because libvirt runs all QEMU processes with
seccomp filters that block 'execve', as we consider QEMU to be untrustworthy
and thus don't want to allow it to exec anything !

I don't know which is the lesser evil from libvirt's POV.

Personally I see security controls as an overriding requirement for
everything.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH V2 00/11] Live update: cpr-exec

Posted by Peter Xu 3 weeks, 1 day ago

On Fri, Aug 16, 2024 at 05:00:32PM +0100, Daniel P. Berrangé wrote:
> On Fri, Aug 16, 2024 at 11:34:10AM -0400, Peter Xu wrote:
> > On Fri, Aug 16, 2024 at 04:16:50PM +0100, Daniel P. Berrangé wrote:
> > > On Fri, Aug 16, 2024 at 11:06:10AM -0400, Peter Xu wrote:
> > > > On Thu, Aug 15, 2024 at 04:55:20PM -0400, Steven Sistare wrote:
> > > > > On 8/13/2024 3:46 PM, Peter Xu wrote:
> > > > > > On Tue, Aug 06, 2024 at 04:56:18PM -0400, Steven Sistare wrote:
> > > > > > > > The flipside, however, is that localhost migration via 2 separate QEMU
> > > > > > > > processes has issues where both QEMUs want to be opening the very same
> > > > > > > > file, and only 1 of them can ever have them open.
> > > > > > 
> > > > > > I thought we used to have similar issue with block devices, but I assume
> > > > > > it's solved for years (and whoever owns it will take proper file lock,
> > > > > > IIRC, and QEMU migration should properly serialize the time window on who's
> > > > > > going to take the file lock).
> > > > > > 
> > > > > > Maybe this is about something else?
> > > > > 
> > > > > I don't have an example where this fails.
> > > > > 
> > > > > I can cause "Failed to get "write" lock" errors if two qemu instances open
> > > > > the same block device, but the error is suppressed if you add the -incoming
> > > > > argument, due to this code:
> > > > > 
> > > > >   blk_attach_dev()
> > > > >     if (runstate_check(RUN_STATE_INMIGRATE))
> > > > >       blk->disable_perm = true;
> > > > 
> > > > Yep, this one is pretty much expected.
> > > > 
> > > > > 
> > > > > > > Indeed, and "files" includes unix domain sockets.
> > > > > 
> > > > > More on this -- the second qemu to bind a unix domain socket for listening
> > > > > wins, and the first qemu loses it (because second qemu unlinks and recreates
> > > > > the socket path before binding on the assumption that it is stale).
> > > > > 
> > > > > One must use a different name for the socket for second qemu, and clients
> > > > > that wish to connect must be aware of the new port.
> > > > > 
> > > > > > > Network ports also conflict.
> > > > > > > cpr-exec avoids such problems, and is one of the advantages of the method that
> > > > > > > I forgot to promote.
> > > > > > 
> > > > > > I was thinking that's fine, as the host ports should be the backend of the
> > > > > > VM ports only anyway so they don't need to be identical on both sides?
> > > > > > 
> > > > > > IOW, my understanding is it's the guest IP/ports/... which should still be
> > > > > > stable across migrations, where the host ports can be different as long as
> > > > > > the host ports can forward guest port messages correctly?
> > > > > 
> > > > > Yes, one must use a different host port number for the second qemu, and clients
> > > > > that wish to connect must be aware of the new port.
> > > > > 
> > > > > That is my point -- cpr-transfer requires fiddling with such things.
> > > > > cpr-exec does not.
> > > > 
> > > > Right, and my understanding is all these facilities are already there, so
> > > > no new code should be needed on reconnect issues if to support cpr-transfer
> > > > in Libvirt or similar management layers that supports migrations.
> > > 
> > > Note Libvirt explicitly blocks localhost migration today because
> > > solving all these clashing resource problems is a huge can of worms
> > > and it can't be made invisible to the user of libvirt in any practical
> > > way.
> > 
> > Ahhh, OK.  I'm pretty surprised by this, as I thought at least kubevirt
> > supported local migration somehow on top of libvirt.
> 
> Since kubevirt runs inside a container, "localhost" migration
> is effectively migrating between 2 completely separate OS installs
> (containers), that happen to be on the same physical host. IOW, it
> is a cross-host migration from Libvirt & QEMU's POV, and there are
> no clashing resources to worry about.

OK, makes sense.

Then do you think it's possible to support cpr-transfer in that scenario
from Libvirt POV?

> 
> > Does it mean that cpr-transfer is a no-go in this case at least for Libvirt
> > to consume it (as cpr-* is only for local host migrations so far)?  Even if
> > all the rest issues we're discussing with cpr-exec, is that the only way to
> > go for Libvirt, then?
> 
> cpr-exec is certainly appealing from the POV of avoiding the clashing
> resources problem in libvirt.
> 
> It has own issues though, because libvirt runs all QEMU processes with
> seccomp filters that block 'execve', as we consider QEMU to be untrustworthy
> and thus don't want to allow it to exec anything !
> 
> I don't know which is the lesser evil from libvirt's POV.
> 
> Personally I see security controls as an overriding requirement for
> everything.

One thing I am aware of is cpr-exec is not the only one who might start to
use exec() in QEMU. TDX fundamentally will need to create another key VM to
deliver the keys and the plan seems to be using exec() too.  However in
that case per my understanding the exec() is optional - the key VM can also
be created by Libvirt.

IOW, it looks like we can still stick with execve() being blocked yet so
far except cpr-exec().

Hmm, this makes the decision harder to make.  We need to figure out a way
on knowing how to consume this feature for at least open source virt
stack..  So far it looks like it's only possible (if we take seccomp high
priority) we use cpr-transfer but only in a container.

Thanks,

-- 
Peter Xu

Re: [PATCH V2 00/11] Live update: cpr-exec

Posted by Steven Sistare 3 weeks, 1 day ago

On 8/16/2024 12:17 PM, Peter Xu wrote:
> On Fri, Aug 16, 2024 at 05:00:32PM +0100, Daniel P. Berrangé wrote:
>> On Fri, Aug 16, 2024 at 11:34:10AM -0400, Peter Xu wrote:
>>> On Fri, Aug 16, 2024 at 04:16:50PM +0100, Daniel P. Berrangé wrote:
>>>> On Fri, Aug 16, 2024 at 11:06:10AM -0400, Peter Xu wrote:
>>>>> On Thu, Aug 15, 2024 at 04:55:20PM -0400, Steven Sistare wrote:
>>>>>> On 8/13/2024 3:46 PM, Peter Xu wrote:
>>>>>>> On Tue, Aug 06, 2024 at 04:56:18PM -0400, Steven Sistare wrote:
>>>>>>>>> The flipside, however, is that localhost migration via 2 separate QEMU
>>>>>>>>> processes has issues where both QEMUs want to be opening the very same
>>>>>>>>> file, and only 1 of them can ever have them open.
>>>>>>>
>>>>>>> I thought we used to have similar issue with block devices, but I assume
>>>>>>> it's solved for years (and whoever owns it will take proper file lock,
>>>>>>> IIRC, and QEMU migration should properly serialize the time window on who's
>>>>>>> going to take the file lock).
>>>>>>>
>>>>>>> Maybe this is about something else?
>>>>>>
>>>>>> I don't have an example where this fails.
>>>>>>
>>>>>> I can cause "Failed to get "write" lock" errors if two qemu instances open
>>>>>> the same block device, but the error is suppressed if you add the -incoming
>>>>>> argument, due to this code:
>>>>>>
>>>>>>    blk_attach_dev()
>>>>>>      if (runstate_check(RUN_STATE_INMIGRATE))
>>>>>>        blk->disable_perm = true;
>>>>>
>>>>> Yep, this one is pretty much expected.
>>>>>
>>>>>>
>>>>>>>> Indeed, and "files" includes unix domain sockets.
>>>>>>
>>>>>> More on this -- the second qemu to bind a unix domain socket for listening
>>>>>> wins, and the first qemu loses it (because second qemu unlinks and recreates
>>>>>> the socket path before binding on the assumption that it is stale).
>>>>>>
>>>>>> One must use a different name for the socket for second qemu, and clients
>>>>>> that wish to connect must be aware of the new port.
>>>>>>
>>>>>>>> Network ports also conflict.
>>>>>>>> cpr-exec avoids such problems, and is one of the advantages of the method that
>>>>>>>> I forgot to promote.
>>>>>>>
>>>>>>> I was thinking that's fine, as the host ports should be the backend of the
>>>>>>> VM ports only anyway so they don't need to be identical on both sides?
>>>>>>>
>>>>>>> IOW, my understanding is it's the guest IP/ports/... which should still be
>>>>>>> stable across migrations, where the host ports can be different as long as
>>>>>>> the host ports can forward guest port messages correctly?
>>>>>>
>>>>>> Yes, one must use a different host port number for the second qemu, and clients
>>>>>> that wish to connect must be aware of the new port.
>>>>>>
>>>>>> That is my point -- cpr-transfer requires fiddling with such things.
>>>>>> cpr-exec does not.
>>>>>
>>>>> Right, and my understanding is all these facilities are already there, so
>>>>> no new code should be needed on reconnect issues if to support cpr-transfer
>>>>> in Libvirt or similar management layers that supports migrations.
>>>>
>>>> Note Libvirt explicitly blocks localhost migration today because
>>>> solving all these clashing resource problems is a huge can of worms
>>>> and it can't be made invisible to the user of libvirt in any practical
>>>> way.
>>>
>>> Ahhh, OK.  I'm pretty surprised by this, as I thought at least kubevirt
>>> supported local migration somehow on top of libvirt.
>>
>> Since kubevirt runs inside a container, "localhost" migration
>> is effectively migrating between 2 completely separate OS installs
>> (containers), that happen to be on the same physical host. IOW, it
>> is a cross-host migration from Libvirt & QEMU's POV, and there are
>> no clashing resources to worry about.
> 
> OK, makes sense.
> 
> Then do you think it's possible to support cpr-transfer in that scenario
> from Libvirt POV?
> 
>>
>>> Does it mean that cpr-transfer is a no-go in this case at least for Libvirt
>>> to consume it (as cpr-* is only for local host migrations so far)?  Even if
>>> all the rest issues we're discussing with cpr-exec, is that the only way to
>>> go for Libvirt, then?
>>
>> cpr-exec is certainly appealing from the POV of avoiding the clashing
>> resources problem in libvirt.
>>
>> It has own issues though, because libvirt runs all QEMU processes with
>> seccomp filters that block 'execve', as we consider QEMU to be untrustworthy
>> and thus don't want to allow it to exec anything !
>>
>> I don't know which is the lesser evil from libvirt's POV.
>>
>> Personally I see security controls as an overriding requirement for
>> everything.
> 
> One thing I am aware of is cpr-exec is not the only one who might start to
> use exec() in QEMU. TDX fundamentally will need to create another key VM to
> deliver the keys and the plan seems to be using exec() too.  However in
> that case per my understanding the exec() is optional - the key VM can also
> be created by Libvirt.
> 
> IOW, it looks like we can still stick with execve() being blocked yet so
> far except cpr-exec().
> 
> Hmm, this makes the decision harder to make.  We need to figure out a way
> on knowing how to consume this feature for at least open source virt
> stack..  So far it looks like it's only possible (if we take seccomp high
> priority) we use cpr-transfer but only in a container.

libvirt starts qemu with the -sandbox spawn=deny option which blocks fork, exec,
and change namespace operations.  I have a patch in my workspace to be submitted
later called "seccomp: fine-grained control of fork, exec, and namespace" that allows
libvirt to block fork and namespace but allow exec.

- Steve

Re: [PATCH V2 00/11] Live update: cpr-exec

Posted by Daniel P. Berrangé 2 days, 15 hours ago

On Fri, Aug 16, 2024 at 01:09:23PM -0400, Steven Sistare wrote:
> On 8/16/2024 12:17 PM, Peter Xu wrote:
> > On Fri, Aug 16, 2024 at 05:00:32PM +0100, Daniel P. Berrangé wrote:
> > > On Fri, Aug 16, 2024 at 11:34:10AM -0400, Peter Xu wrote:
> > > > On Fri, Aug 16, 2024 at 04:16:50PM +0100, Daniel P. Berrangé wrote:
> > > > > On Fri, Aug 16, 2024 at 11:06:10AM -0400, Peter Xu wrote:
> > > > > > On Thu, Aug 15, 2024 at 04:55:20PM -0400, Steven Sistare wrote:
> > > > > > > On 8/13/2024 3:46 PM, Peter Xu wrote:
> > > > > > > > On Tue, Aug 06, 2024 at 04:56:18PM -0400, Steven Sistare wrote:
> > > > > > > > > > The flipside, however, is that localhost migration via 2 separate QEMU
> > > > > > > > > > processes has issues where both QEMUs want to be opening the very same
> > > > > > > > > > file, and only 1 of them can ever have them open.
> > > > > > > > 
> > > > > > > > I thought we used to have similar issue with block devices, but I assume
> > > > > > > > it's solved for years (and whoever owns it will take proper file lock,
> > > > > > > > IIRC, and QEMU migration should properly serialize the time window on who's
> > > > > > > > going to take the file lock).
> > > > > > > > 
> > > > > > > > Maybe this is about something else?
> > > > > > > 
> > > > > > > I don't have an example where this fails.
> > > > > > > 
> > > > > > > I can cause "Failed to get "write" lock" errors if two qemu instances open
> > > > > > > the same block device, but the error is suppressed if you add the -incoming
> > > > > > > argument, due to this code:
> > > > > > > 
> > > > > > >    blk_attach_dev()
> > > > > > >      if (runstate_check(RUN_STATE_INMIGRATE))
> > > > > > >        blk->disable_perm = true;
> > > > > > 
> > > > > > Yep, this one is pretty much expected.
> > > > > > 
> > > > > > > 
> > > > > > > > > Indeed, and "files" includes unix domain sockets.
> > > > > > > 
> > > > > > > More on this -- the second qemu to bind a unix domain socket for listening
> > > > > > > wins, and the first qemu loses it (because second qemu unlinks and recreates
> > > > > > > the socket path before binding on the assumption that it is stale).
> > > > > > > 
> > > > > > > One must use a different name for the socket for second qemu, and clients
> > > > > > > that wish to connect must be aware of the new port.
> > > > > > > 
> > > > > > > > > Network ports also conflict.
> > > > > > > > > cpr-exec avoids such problems, and is one of the advantages of the method that
> > > > > > > > > I forgot to promote.
> > > > > > > > 
> > > > > > > > I was thinking that's fine, as the host ports should be the backend of the
> > > > > > > > VM ports only anyway so they don't need to be identical on both sides?
> > > > > > > > 
> > > > > > > > IOW, my understanding is it's the guest IP/ports/... which should still be
> > > > > > > > stable across migrations, where the host ports can be different as long as
> > > > > > > > the host ports can forward guest port messages correctly?
> > > > > > > 
> > > > > > > Yes, one must use a different host port number for the second qemu, and clients
> > > > > > > that wish to connect must be aware of the new port.
> > > > > > > 
> > > > > > > That is my point -- cpr-transfer requires fiddling with such things.
> > > > > > > cpr-exec does not.
> > > > > > 
> > > > > > Right, and my understanding is all these facilities are already there, so
> > > > > > no new code should be needed on reconnect issues if to support cpr-transfer
> > > > > > in Libvirt or similar management layers that supports migrations.
> > > > > 
> > > > > Note Libvirt explicitly blocks localhost migration today because
> > > > > solving all these clashing resource problems is a huge can of worms
> > > > > and it can't be made invisible to the user of libvirt in any practical
> > > > > way.
> > > > 
> > > > Ahhh, OK.  I'm pretty surprised by this, as I thought at least kubevirt
> > > > supported local migration somehow on top of libvirt.
> > > 
> > > Since kubevirt runs inside a container, "localhost" migration
> > > is effectively migrating between 2 completely separate OS installs
> > > (containers), that happen to be on the same physical host. IOW, it
> > > is a cross-host migration from Libvirt & QEMU's POV, and there are
> > > no clashing resources to worry about.
> > 
> > OK, makes sense.
> > 
> > Then do you think it's possible to support cpr-transfer in that scenario
> > from Libvirt POV?
> > 
> > > 
> > > > Does it mean that cpr-transfer is a no-go in this case at least for Libvirt
> > > > to consume it (as cpr-* is only for local host migrations so far)?  Even if
> > > > all the rest issues we're discussing with cpr-exec, is that the only way to
> > > > go for Libvirt, then?
> > > 
> > > cpr-exec is certainly appealing from the POV of avoiding the clashing
> > > resources problem in libvirt.
> > > 
> > > It has own issues though, because libvirt runs all QEMU processes with
> > > seccomp filters that block 'execve', as we consider QEMU to be untrustworthy
> > > and thus don't want to allow it to exec anything !
> > > 
> > > I don't know which is the lesser evil from libvirt's POV.
> > > 
> > > Personally I see security controls as an overriding requirement for
> > > everything.
> > 
> > One thing I am aware of is cpr-exec is not the only one who might start to
> > use exec() in QEMU. TDX fundamentally will need to create another key VM to
> > deliver the keys and the plan seems to be using exec() too.  However in
> > that case per my understanding the exec() is optional - the key VM can also
> > be created by Libvirt.
> > 
> > IOW, it looks like we can still stick with execve() being blocked yet so
> > far except cpr-exec().
> > 
> > Hmm, this makes the decision harder to make.  We need to figure out a way
> > on knowing how to consume this feature for at least open source virt
> > stack..  So far it looks like it's only possible (if we take seccomp high
> > priority) we use cpr-transfer but only in a container.
> 
> libvirt starts qemu with the -sandbox spawn=deny option which blocks fork, exec,
> and change namespace operations.  I have a patch in my workspace to be submitted
> later called "seccomp: fine-grained control of fork, exec, and namespace" that allows
> libvirt to block fork and namespace but allow exec.

IMHO this significantly undermines the protection offered. fork(), without
execve() is relatively benign from a security POV, mostly a slightly greater
resource consumption issue, compared to spawning threads, which is always
allowed. Blocking execve() is the key security benefit, as that is a way to
pick up new privileges (through setuid), or bring new binary code into
memory (via the new ELF images loaded), or pick up new MAC policy through
transition rules, etc.

IOW, if you're going to allow 'exec', there's little point in blocking fork
IMHO, and as such this doesn't sound very appealing as something to add to
libvirt.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH V2 00/11] Live update: cpr-exec

Posted by Peter Xu 2 weeks, 3 days ago

On Fri, Aug 16, 2024 at 01:09:23PM -0400, Steven Sistare wrote:
> On 8/16/2024 12:17 PM, Peter Xu wrote:
> > On Fri, Aug 16, 2024 at 05:00:32PM +0100, Daniel P. Berrangé wrote:
> > > On Fri, Aug 16, 2024 at 11:34:10AM -0400, Peter Xu wrote:
> > > > On Fri, Aug 16, 2024 at 04:16:50PM +0100, Daniel P. Berrangé wrote:
> > > > > On Fri, Aug 16, 2024 at 11:06:10AM -0400, Peter Xu wrote:
> > > > > > On Thu, Aug 15, 2024 at 04:55:20PM -0400, Steven Sistare wrote:
> > > > > > > On 8/13/2024 3:46 PM, Peter Xu wrote:
> > > > > > > > On Tue, Aug 06, 2024 at 04:56:18PM -0400, Steven Sistare wrote:
> > > > > > > > > > The flipside, however, is that localhost migration via 2 separate QEMU
> > > > > > > > > > processes has issues where both QEMUs want to be opening the very same
> > > > > > > > > > file, and only 1 of them can ever have them open.
> > > > > > > > 
> > > > > > > > I thought we used to have similar issue with block devices, but I assume
> > > > > > > > it's solved for years (and whoever owns it will take proper file lock,
> > > > > > > > IIRC, and QEMU migration should properly serialize the time window on who's
> > > > > > > > going to take the file lock).
> > > > > > > > 
> > > > > > > > Maybe this is about something else?
> > > > > > > 
> > > > > > > I don't have an example where this fails.
> > > > > > > 
> > > > > > > I can cause "Failed to get "write" lock" errors if two qemu instances open
> > > > > > > the same block device, but the error is suppressed if you add the -incoming
> > > > > > > argument, due to this code:
> > > > > > > 
> > > > > > >    blk_attach_dev()
> > > > > > >      if (runstate_check(RUN_STATE_INMIGRATE))
> > > > > > >        blk->disable_perm = true;
> > > > > > 
> > > > > > Yep, this one is pretty much expected.
> > > > > > 
> > > > > > > 
> > > > > > > > > Indeed, and "files" includes unix domain sockets.
> > > > > > > 
> > > > > > > More on this -- the second qemu to bind a unix domain socket for listening
> > > > > > > wins, and the first qemu loses it (because second qemu unlinks and recreates
> > > > > > > the socket path before binding on the assumption that it is stale).
> > > > > > > 
> > > > > > > One must use a different name for the socket for second qemu, and clients
> > > > > > > that wish to connect must be aware of the new port.
> > > > > > > 
> > > > > > > > > Network ports also conflict.
> > > > > > > > > cpr-exec avoids such problems, and is one of the advantages of the method that
> > > > > > > > > I forgot to promote.
> > > > > > > > 
> > > > > > > > I was thinking that's fine, as the host ports should be the backend of the
> > > > > > > > VM ports only anyway so they don't need to be identical on both sides?
> > > > > > > > 
> > > > > > > > IOW, my understanding is it's the guest IP/ports/... which should still be
> > > > > > > > stable across migrations, where the host ports can be different as long as
> > > > > > > > the host ports can forward guest port messages correctly?
> > > > > > > 
> > > > > > > Yes, one must use a different host port number for the second qemu, and clients
> > > > > > > that wish to connect must be aware of the new port.
> > > > > > > 
> > > > > > > That is my point -- cpr-transfer requires fiddling with such things.
> > > > > > > cpr-exec does not.
> > > > > > 
> > > > > > Right, and my understanding is all these facilities are already there, so
> > > > > > no new code should be needed on reconnect issues if to support cpr-transfer
> > > > > > in Libvirt or similar management layers that supports migrations.
> > > > > 
> > > > > Note Libvirt explicitly blocks localhost migration today because
> > > > > solving all these clashing resource problems is a huge can of worms
> > > > > and it can't be made invisible to the user of libvirt in any practical
> > > > > way.
> > > > 
> > > > Ahhh, OK.  I'm pretty surprised by this, as I thought at least kubevirt
> > > > supported local migration somehow on top of libvirt.
> > > 
> > > Since kubevirt runs inside a container, "localhost" migration
> > > is effectively migrating between 2 completely separate OS installs
> > > (containers), that happen to be on the same physical host. IOW, it
> > > is a cross-host migration from Libvirt & QEMU's POV, and there are
> > > no clashing resources to worry about.
> > 
> > OK, makes sense.
> > 
> > Then do you think it's possible to support cpr-transfer in that scenario
> > from Libvirt POV?
> > 
> > > 
> > > > Does it mean that cpr-transfer is a no-go in this case at least for Libvirt
> > > > to consume it (as cpr-* is only for local host migrations so far)?  Even if
> > > > all the rest issues we're discussing with cpr-exec, is that the only way to
> > > > go for Libvirt, then?
> > > 
> > > cpr-exec is certainly appealing from the POV of avoiding the clashing
> > > resources problem in libvirt.
> > > 
> > > It has own issues though, because libvirt runs all QEMU processes with
> > > seccomp filters that block 'execve', as we consider QEMU to be untrustworthy
> > > and thus don't want to allow it to exec anything !
> > > 
> > > I don't know which is the lesser evil from libvirt's POV.
> > > 
> > > Personally I see security controls as an overriding requirement for
> > > everything.
> > 
> > One thing I am aware of is cpr-exec is not the only one who might start to
> > use exec() in QEMU. TDX fundamentally will need to create another key VM to
> > deliver the keys and the plan seems to be using exec() too.  However in
> > that case per my understanding the exec() is optional - the key VM can also
> > be created by Libvirt.
> > 
> > IOW, it looks like we can still stick with execve() being blocked yet so
> > far except cpr-exec().
> > 
> > Hmm, this makes the decision harder to make.  We need to figure out a way
> > on knowing how to consume this feature for at least open source virt
> > stack..  So far it looks like it's only possible (if we take seccomp high
> > priority) we use cpr-transfer but only in a container.
> 
> libvirt starts qemu with the -sandbox spawn=deny option which blocks fork, exec,
> and change namespace operations.  I have a patch in my workspace to be submitted
> later called "seccomp: fine-grained control of fork, exec, and namespace" that allows
> libvirt to block fork and namespace but allow exec.

The question is whether that would be accepted, and it also gives me the
feeling that even if it's accepted, it might limit the use cases that cpr
can apply to.

What I read so far from Dan is that cpr-transfer seems to be also preferred
from Libvirt POV:

  https://lore.kernel.org/r/Zr9-IvoRkGjre4CI@redhat.com

Did I read it right?

Thanks,

-- 
Peter Xu

Re: [PATCH V2 00/11] Live update: cpr-exec

Posted by Steven Sistare 3 days, 4 hours ago

On 8/21/2024 2:34 PM, Peter Xu wrote:
> On Fri, Aug 16, 2024 at 01:09:23PM -0400, Steven Sistare wrote:
>> On 8/16/2024 12:17 PM, Peter Xu wrote:
>>> On Fri, Aug 16, 2024 at 05:00:32PM +0100, Daniel P. Berrangé wrote:
>>>> On Fri, Aug 16, 2024 at 11:34:10AM -0400, Peter Xu wrote:
>>>>> On Fri, Aug 16, 2024 at 04:16:50PM +0100, Daniel P. Berrangé wrote:
>>>>>> On Fri, Aug 16, 2024 at 11:06:10AM -0400, Peter Xu wrote:
>>>>>>> On Thu, Aug 15, 2024 at 04:55:20PM -0400, Steven Sistare wrote:
>>>>>>>> On 8/13/2024 3:46 PM, Peter Xu wrote:
>>>>>>>>> On Tue, Aug 06, 2024 at 04:56:18PM -0400, Steven Sistare wrote:
>>>>>>>>>>> The flipside, however, is that localhost migration via 2 separate QEMU
>>>>>>>>>>> processes has issues where both QEMUs want to be opening the very same
>>>>>>>>>>> file, and only 1 of them can ever have them open.
>>>>>>>>>
>>>>>>>>> I thought we used to have similar issue with block devices, but I assume
>>>>>>>>> it's solved for years (and whoever owns it will take proper file lock,
>>>>>>>>> IIRC, and QEMU migration should properly serialize the time window on who's
>>>>>>>>> going to take the file lock).
>>>>>>>>>
>>>>>>>>> Maybe this is about something else?
>>>>>>>>
>>>>>>>> I don't have an example where this fails.
>>>>>>>>
>>>>>>>> I can cause "Failed to get "write" lock" errors if two qemu instances open
>>>>>>>> the same block device, but the error is suppressed if you add the -incoming
>>>>>>>> argument, due to this code:
>>>>>>>>
>>>>>>>>     blk_attach_dev()
>>>>>>>>       if (runstate_check(RUN_STATE_INMIGRATE))
>>>>>>>>         blk->disable_perm = true;
>>>>>>>
>>>>>>> Yep, this one is pretty much expected.
>>>>>>>
>>>>>>>>
>>>>>>>>>> Indeed, and "files" includes unix domain sockets.
>>>>>>>>
>>>>>>>> More on this -- the second qemu to bind a unix domain socket for listening
>>>>>>>> wins, and the first qemu loses it (because second qemu unlinks and recreates
>>>>>>>> the socket path before binding on the assumption that it is stale).
>>>>>>>>
>>>>>>>> One must use a different name for the socket for second qemu, and clients
>>>>>>>> that wish to connect must be aware of the new port.
>>>>>>>>
>>>>>>>>>> Network ports also conflict.
>>>>>>>>>> cpr-exec avoids such problems, and is one of the advantages of the method that
>>>>>>>>>> I forgot to promote.
>>>>>>>>>
>>>>>>>>> I was thinking that's fine, as the host ports should be the backend of the
>>>>>>>>> VM ports only anyway so they don't need to be identical on both sides?
>>>>>>>>>
>>>>>>>>> IOW, my understanding is it's the guest IP/ports/... which should still be
>>>>>>>>> stable across migrations, where the host ports can be different as long as
>>>>>>>>> the host ports can forward guest port messages correctly?
>>>>>>>>
>>>>>>>> Yes, one must use a different host port number for the second qemu, and clients
>>>>>>>> that wish to connect must be aware of the new port.
>>>>>>>>
>>>>>>>> That is my point -- cpr-transfer requires fiddling with such things.
>>>>>>>> cpr-exec does not.
>>>>>>>
>>>>>>> Right, and my understanding is all these facilities are already there, so
>>>>>>> no new code should be needed on reconnect issues if to support cpr-transfer
>>>>>>> in Libvirt or similar management layers that supports migrations.
>>>>>>
>>>>>> Note Libvirt explicitly blocks localhost migration today because
>>>>>> solving all these clashing resource problems is a huge can of worms
>>>>>> and it can't be made invisible to the user of libvirt in any practical
>>>>>> way.
>>>>>
>>>>> Ahhh, OK.  I'm pretty surprised by this, as I thought at least kubevirt
>>>>> supported local migration somehow on top of libvirt.
>>>>
>>>> Since kubevirt runs inside a container, "localhost" migration
>>>> is effectively migrating between 2 completely separate OS installs
>>>> (containers), that happen to be on the same physical host. IOW, it
>>>> is a cross-host migration from Libvirt & QEMU's POV, and there are
>>>> no clashing resources to worry about.
>>>
>>> OK, makes sense.
>>>
>>> Then do you think it's possible to support cpr-transfer in that scenario
>>> from Libvirt POV?
>>>
>>>>
>>>>> Does it mean that cpr-transfer is a no-go in this case at least for Libvirt
>>>>> to consume it (as cpr-* is only for local host migrations so far)?  Even if
>>>>> all the rest issues we're discussing with cpr-exec, is that the only way to
>>>>> go for Libvirt, then?
>>>>
>>>> cpr-exec is certainly appealing from the POV of avoiding the clashing
>>>> resources problem in libvirt.
>>>>
>>>> It has own issues though, because libvirt runs all QEMU processes with
>>>> seccomp filters that block 'execve', as we consider QEMU to be untrustworthy
>>>> and thus don't want to allow it to exec anything !
>>>>
>>>> I don't know which is the lesser evil from libvirt's POV.
>>>>
>>>> Personally I see security controls as an overriding requirement for
>>>> everything.
>>>
>>> One thing I am aware of is cpr-exec is not the only one who might start to
>>> use exec() in QEMU. TDX fundamentally will need to create another key VM to
>>> deliver the keys and the plan seems to be using exec() too.  However in
>>> that case per my understanding the exec() is optional - the key VM can also
>>> be created by Libvirt.
>>>
>>> IOW, it looks like we can still stick with execve() being blocked yet so
>>> far except cpr-exec().
>>>
>>> Hmm, this makes the decision harder to make.  We need to figure out a way
>>> on knowing how to consume this feature for at least open source virt
>>> stack..  So far it looks like it's only possible (if we take seccomp high
>>> priority) we use cpr-transfer but only in a container.
>>
>> libvirt starts qemu with the -sandbox spawn=deny option which blocks fork, exec,
>> and change namespace operations.  I have a patch in my workspace to be submitted
>> later called "seccomp: fine-grained control of fork, exec, and namespace" that allows
>> libvirt to block fork and namespace but allow exec.
> 
> The question is whether that would be accepted, and it also gives me the
> feeling that even if it's accepted, it might limit the use cases that cpr
> can apply to.

This is more acceptable for libvirt running in a container (such as under kubevirt)
with a limited set of binaries in /bin that could be exec'd.  In that case allowing
exec is more reasonable.

> What I read so far from Dan is that cpr-transfer seems to be also preferred
> from Libvirt POV:
> 
>    https://lore.kernel.org/r/Zr9-IvoRkGjre4CI@redhat.com
> 
> Did I read it right?

I read that as: cpr-transfer is a viable option for libvirt.  I don't hear him
excluding the possibility of cpr-exec.

I agree that "Dan the libvirt expert prefers cpr-transfer" is a good reason to
provide cpr-transfer.  Which I will do.

So does "Steve the OCI expert prefers cpr-exec" carry equal weight, for also
providing cpr-exec?

We are at an impasse on this series.  To make forward progress, I am willing to
reorder the patches, and re-submit cpr-transfer as the first mode, so we can
review and pull that.  I will submit cpr-exec as a follow on and we can resume
our arguments then.

- Steve

Re: [PATCH V2 00/11] Live update: cpr-exec

Posted by Daniel P. Berrangé 2 days, 15 hours ago

On Wed, Sep 04, 2024 at 04:58:14PM -0400, Steven Sistare wrote:
> On 8/21/2024 2:34 PM, Peter Xu wrote:
> > On Fri, Aug 16, 2024 at 01:09:23PM -0400, Steven Sistare wrote:
> > > 
> > > libvirt starts qemu with the -sandbox spawn=deny option which blocks fork, exec,
> > > and change namespace operations.  I have a patch in my workspace to be submitted
> > > later called "seccomp: fine-grained control of fork, exec, and namespace" that allows
> > > libvirt to block fork and namespace but allow exec.
> > 
> > The question is whether that would be accepted, and it also gives me the
> > feeling that even if it's accepted, it might limit the use cases that cpr
> > can apply to.
> 
> This is more acceptable for libvirt running in a container (such as under kubevirt)
> with a limited set of binaries in /bin that could be exec'd.  In that case allowing
> exec is more reasonable.

Running inside a container does protect the host to a significant
degree. I'd say it is still important, however, to protect the
control plane (libvirt's daemons & kubevirt's agent) from the QEMU
process being managed, and in that case it still looks pretty
compelling to deny exec.

> > What I read so far from Dan is that cpr-transfer seems to be also preferred
> > from Libvirt POV:
> > 
> >    https://lore.kernel.org/r/Zr9-IvoRkGjre4CI@redhat.com
> > 
> > Did I read it right?
> 
> I read that as: cpr-transfer is a viable option for libvirt.  I don't hear him
> excluding the possibility of cpr-exec.
> 
> I agree that "Dan the libvirt expert prefers cpr-transfer" is a good reason to
> provide cpr-transfer.  Which I will do.

Both approaches have significant challenges for integration, but my general
preference is towards a solution that doesn't require undermining our security
protections.

When starting a VM we have no knowledge of whether a user may want to use
CPR at a later date. We're not going to disable the seccomp sandbox by
default, so that means cpr-exec would not be viable in a default VM
deployment.

Admins could choose to modify /etc/libvirt/qemu.conf to turn off seccomp,
but I'm very much not in favour of introducing a feature that requires
them todo this. It would be a first in libvirt, as everything else we
support is possible to use with seccomp enabled. The seccomp opt-out is
essentially just there as an emergency escape hatch, not as something we
want used in production.

> We are at an impasse on this series.  To make forward progress, I am willing to
> reorder the patches, and re-submit cpr-transfer as the first mode, so we can
> review and pull that.  I will submit cpr-exec as a follow on and we can resume
> our arguments then.

Considering the end result, are there CPR usage scenarios that are possible
with cpr-exec, that can't be achieved with cpr-transfer ?

Supporting two ways to doing the same thing is increasing the maint burden
for QEMU maintainers, as well as downstream testing engineers who have to
validate this functionality. So unless there's compelling need to support
both cpr-transfer and cpr-exec, it'd be nice to standardize on just one of
them.

cpr-transfer does look like its probably more viable, even with its own
challenges wrt resources being opened twice.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH V2 00/11] Live update: cpr-exec

Posted by Peter Xu 3 days, 3 hours ago

On Wed, Sep 04, 2024 at 04:58:14PM -0400, Steven Sistare wrote:
> On 8/21/2024 2:34 PM, Peter Xu wrote:
> > On Fri, Aug 16, 2024 at 01:09:23PM -0400, Steven Sistare wrote:
> > > On 8/16/2024 12:17 PM, Peter Xu wrote:
> > > > On Fri, Aug 16, 2024 at 05:00:32PM +0100, Daniel P. Berrangé wrote:
> > > > > On Fri, Aug 16, 2024 at 11:34:10AM -0400, Peter Xu wrote:
> > > > > > On Fri, Aug 16, 2024 at 04:16:50PM +0100, Daniel P. Berrangé wrote:
> > > > > > > On Fri, Aug 16, 2024 at 11:06:10AM -0400, Peter Xu wrote:
> > > > > > > > On Thu, Aug 15, 2024 at 04:55:20PM -0400, Steven Sistare wrote:
> > > > > > > > > On 8/13/2024 3:46 PM, Peter Xu wrote:
> > > > > > > > > > On Tue, Aug 06, 2024 at 04:56:18PM -0400, Steven Sistare wrote:
> > > > > > > > > > > > The flipside, however, is that localhost migration via 2 separate QEMU
> > > > > > > > > > > > processes has issues where both QEMUs want to be opening the very same
> > > > > > > > > > > > file, and only 1 of them can ever have them open.
> > > > > > > > > > 
> > > > > > > > > > I thought we used to have similar issue with block devices, but I assume
> > > > > > > > > > it's solved for years (and whoever owns it will take proper file lock,
> > > > > > > > > > IIRC, and QEMU migration should properly serialize the time window on who's
> > > > > > > > > > going to take the file lock).
> > > > > > > > > > 
> > > > > > > > > > Maybe this is about something else?
> > > > > > > > > 
> > > > > > > > > I don't have an example where this fails.
> > > > > > > > > 
> > > > > > > > > I can cause "Failed to get "write" lock" errors if two qemu instances open
> > > > > > > > > the same block device, but the error is suppressed if you add the -incoming
> > > > > > > > > argument, due to this code:
> > > > > > > > > 
> > > > > > > > >     blk_attach_dev()
> > > > > > > > >       if (runstate_check(RUN_STATE_INMIGRATE))
> > > > > > > > >         blk->disable_perm = true;
> > > > > > > > 
> > > > > > > > Yep, this one is pretty much expected.
> > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > > Indeed, and "files" includes unix domain sockets.
> > > > > > > > > 
> > > > > > > > > More on this -- the second qemu to bind a unix domain socket for listening
> > > > > > > > > wins, and the first qemu loses it (because second qemu unlinks and recreates
> > > > > > > > > the socket path before binding on the assumption that it is stale).
> > > > > > > > > 
> > > > > > > > > One must use a different name for the socket for second qemu, and clients
> > > > > > > > > that wish to connect must be aware of the new port.
> > > > > > > > > 
> > > > > > > > > > > Network ports also conflict.
> > > > > > > > > > > cpr-exec avoids such problems, and is one of the advantages of the method that
> > > > > > > > > > > I forgot to promote.
> > > > > > > > > > 
> > > > > > > > > > I was thinking that's fine, as the host ports should be the backend of the
> > > > > > > > > > VM ports only anyway so they don't need to be identical on both sides?
> > > > > > > > > > 
> > > > > > > > > > IOW, my understanding is it's the guest IP/ports/... which should still be
> > > > > > > > > > stable across migrations, where the host ports can be different as long as
> > > > > > > > > > the host ports can forward guest port messages correctly?
> > > > > > > > > 
> > > > > > > > > Yes, one must use a different host port number for the second qemu, and clients
> > > > > > > > > that wish to connect must be aware of the new port.
> > > > > > > > > 
> > > > > > > > > That is my point -- cpr-transfer requires fiddling with such things.
> > > > > > > > > cpr-exec does not.
> > > > > > > > 
> > > > > > > > Right, and my understanding is all these facilities are already there, so
> > > > > > > > no new code should be needed on reconnect issues if to support cpr-transfer
> > > > > > > > in Libvirt or similar management layers that supports migrations.
> > > > > > > 
> > > > > > > Note Libvirt explicitly blocks localhost migration today because
> > > > > > > solving all these clashing resource problems is a huge can of worms
> > > > > > > and it can't be made invisible to the user of libvirt in any practical
> > > > > > > way.
> > > > > > 
> > > > > > Ahhh, OK.  I'm pretty surprised by this, as I thought at least kubevirt
> > > > > > supported local migration somehow on top of libvirt.
> > > > > 
> > > > > Since kubevirt runs inside a container, "localhost" migration
> > > > > is effectively migrating between 2 completely separate OS installs
> > > > > (containers), that happen to be on the same physical host. IOW, it
> > > > > is a cross-host migration from Libvirt & QEMU's POV, and there are
> > > > > no clashing resources to worry about.
> > > > 
> > > > OK, makes sense.
> > > > 
> > > > Then do you think it's possible to support cpr-transfer in that scenario
> > > > from Libvirt POV?
> > > > 
> > > > > 
> > > > > > Does it mean that cpr-transfer is a no-go in this case at least for Libvirt
> > > > > > to consume it (as cpr-* is only for local host migrations so far)?  Even if
> > > > > > all the rest issues we're discussing with cpr-exec, is that the only way to
> > > > > > go for Libvirt, then?
> > > > > 
> > > > > cpr-exec is certainly appealing from the POV of avoiding the clashing
> > > > > resources problem in libvirt.
> > > > > 
> > > > > It has own issues though, because libvirt runs all QEMU processes with
> > > > > seccomp filters that block 'execve', as we consider QEMU to be untrustworthy
> > > > > and thus don't want to allow it to exec anything !
> > > > > 
> > > > > I don't know which is the lesser evil from libvirt's POV.
> > > > > 
> > > > > Personally I see security controls as an overriding requirement for
> > > > > everything.
> > > > 
> > > > One thing I am aware of is cpr-exec is not the only one who might start to
> > > > use exec() in QEMU. TDX fundamentally will need to create another key VM to
> > > > deliver the keys and the plan seems to be using exec() too.  However in
> > > > that case per my understanding the exec() is optional - the key VM can also
> > > > be created by Libvirt.
> > > > 
> > > > IOW, it looks like we can still stick with execve() being blocked yet so
> > > > far except cpr-exec().
> > > > 
> > > > Hmm, this makes the decision harder to make.  We need to figure out a way
> > > > on knowing how to consume this feature for at least open source virt
> > > > stack..  So far it looks like it's only possible (if we take seccomp high
> > > > priority) we use cpr-transfer but only in a container.
> > > 
> > > libvirt starts qemu with the -sandbox spawn=deny option which blocks fork, exec,
> > > and change namespace operations.  I have a patch in my workspace to be submitted
> > > later called "seccomp: fine-grained control of fork, exec, and namespace" that allows
> > > libvirt to block fork and namespace but allow exec.
> > 
> > The question is whether that would be accepted, and it also gives me the
> > feeling that even if it's accepted, it might limit the use cases that cpr
> > can apply to.
> 
> This is more acceptable for libvirt running in a container (such as under kubevirt)
> with a limited set of binaries in /bin that could be exec'd.  In that case allowing
> exec is more reasonable.
> 
> > What I read so far from Dan is that cpr-transfer seems to be also preferred
> > from Libvirt POV:
> > 
> >    https://lore.kernel.org/r/Zr9-IvoRkGjre4CI@redhat.com
> > 
> > Did I read it right?
> 
> I read that as: cpr-transfer is a viable option for libvirt.  I don't hear him
> excluding the possibility of cpr-exec.

I preferred not having two solution because if they work the same problem
out, then it potentially means one of them might be leftover at some point,
unless they suite different needs.  But I don't feel strongly, especially
if cpr-exec is light if cpr-transfer is there.

> 
> I agree that "Dan the libvirt expert prefers cpr-transfer" is a good reason to
> provide cpr-transfer.  Which I will do.
> 
> So does "Steve the OCI expert prefers cpr-exec" carry equal weight, for also
> providing cpr-exec?

As an open source project, Libvirt using it means the feature can be
actively used and tested.  When e.g. there's a new feature replacing CPR we
know when we can obsolete the old CPR, no matter -exec or -transfer.

Close sourced projects can also be great itself but naturally are less
important in open source communities IMHO due to not accessible to anyone
in the community.  E.g., we never know when an close sourced project
abandoned a feature, then QEMU can carry over that feature forever without
knowing who's using it.

It's the same as when Linux doesn't maintain kabi on out-of-tree drivers to
me.  It's just that here the open source virt stack is a huge project and
QEMU plays its role within.

> 
> We are at an impasse on this series.  To make forward progress, I am willing to
> reorder the patches, and re-submit cpr-transfer as the first mode, so we can
> review and pull that.  I will submit cpr-exec as a follow on and we can resume
> our arguments then.

Yes this could be better to justify how small change cpr-exec would need on
top of cpr-transfer, but I'd still wait for some comments from Dan or
others in case they'll chime in, just to avoid sinking your time with
rebases.

Thanks,

-- 
Peter Xu

Re: [PATCH V2 00/11] Live update: cpr-exec

Posted by Daniel P. Berrangé 2 days, 15 hours ago

On Wed, Sep 04, 2024 at 06:23:50PM -0400, Peter Xu wrote:
> On Wed, Sep 04, 2024 at 04:58:14PM -0400, Steven Sistare wrote:
> > On 8/21/2024 2:34 PM, Peter Xu wrote:
> > > On Fri, Aug 16, 2024 at 01:09:23PM -0400, Steven Sistare wrote:
> > > > On 8/16/2024 12:17 PM, Peter Xu wrote:
> > > What I read so far from Dan is that cpr-transfer seems to be also preferred
> > > from Libvirt POV:
> > > 
> > >    https://lore.kernel.org/r/Zr9-IvoRkGjre4CI@redhat.com
> > > 
> > > Did I read it right?
> > 
> > I read that as: cpr-transfer is a viable option for libvirt.  I don't hear him
> > excluding the possibility of cpr-exec.
> 
> I preferred not having two solution because if they work the same problem
> out, then it potentially means one of them might be leftover at some point,
> unless they suite different needs.  But I don't feel strongly, especially
> if cpr-exec is light if cpr-transfer is there.
> 
> > 
> > I agree that "Dan the libvirt expert prefers cpr-transfer" is a good reason to
> > provide cpr-transfer.  Which I will do.
> > 
> > So does "Steve the OCI expert prefers cpr-exec" carry equal weight, for also
> > providing cpr-exec?
> 
> As an open source project, Libvirt using it means the feature can be
> actively used and tested.  When e.g. there's a new feature replacing CPR we
> know when we can obsolete the old CPR, no matter -exec or -transfer.
> 
> Close sourced projects can also be great itself but naturally are less
> important in open source communities IMHO due to not accessible to anyone
> in the community.  E.g., we never know when an close sourced project
> abandoned a feature, then QEMU can carry over that feature forever without
> knowing who's using it.

In terms of closed source projects, effectively they don't exist from a
QEMU maintainer's POV. Our deprecation & removal policy is designed so
that we don't need to think about who is using stuff.

When QEMU deprecates something, any users (whether open source or closed
source) have 2 releases in which to notice this, and make a request that
we cancel the deprecation, or change their code.

Libvirt is special in the sense that we'll CC libvirt mailing list on
changes to the deprecated.rst file, and we'll often not propose
deprecations in the first place if we know libvirt is using it, since
we can ask libvirt quite easily & libvirt people pay attention to QEMU.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH V2 00/11] Live update: cpr-exec

Posted by Daniel P. Berrangé 3 weeks, 1 day ago

On Fri, Aug 16, 2024 at 12:17:30PM -0400, Peter Xu wrote:
> On Fri, Aug 16, 2024 at 05:00:32PM +0100, Daniel P. Berrangé wrote:
> > On Fri, Aug 16, 2024 at 11:34:10AM -0400, Peter Xu wrote:
> > > On Fri, Aug 16, 2024 at 04:16:50PM +0100, Daniel P. Berrangé wrote:
> > > > On Fri, Aug 16, 2024 at 11:06:10AM -0400, Peter Xu wrote:
> > > > > On Thu, Aug 15, 2024 at 04:55:20PM -0400, Steven Sistare wrote:
> > > > > > On 8/13/2024 3:46 PM, Peter Xu wrote:
> > > > > > > On Tue, Aug 06, 2024 at 04:56:18PM -0400, Steven Sistare wrote:
> > > > > > > > > The flipside, however, is that localhost migration via 2 separate QEMU
> > > > > > > > > processes has issues where both QEMUs want to be opening the very same
> > > > > > > > > file, and only 1 of them can ever have them open.
> > > > > > > 
> > > > > > > I thought we used to have similar issue with block devices, but I assume
> > > > > > > it's solved for years (and whoever owns it will take proper file lock,
> > > > > > > IIRC, and QEMU migration should properly serialize the time window on who's
> > > > > > > going to take the file lock).
> > > > > > > 
> > > > > > > Maybe this is about something else?
> > > > > > 
> > > > > > I don't have an example where this fails.
> > > > > > 
> > > > > > I can cause "Failed to get "write" lock" errors if two qemu instances open
> > > > > > the same block device, but the error is suppressed if you add the -incoming
> > > > > > argument, due to this code:
> > > > > > 
> > > > > >   blk_attach_dev()
> > > > > >     if (runstate_check(RUN_STATE_INMIGRATE))
> > > > > >       blk->disable_perm = true;
> > > > > 
> > > > > Yep, this one is pretty much expected.
> > > > > 
> > > > > > 
> > > > > > > > Indeed, and "files" includes unix domain sockets.
> > > > > > 
> > > > > > More on this -- the second qemu to bind a unix domain socket for listening
> > > > > > wins, and the first qemu loses it (because second qemu unlinks and recreates
> > > > > > the socket path before binding on the assumption that it is stale).
> > > > > > 
> > > > > > One must use a different name for the socket for second qemu, and clients
> > > > > > that wish to connect must be aware of the new port.
> > > > > > 
> > > > > > > > Network ports also conflict.
> > > > > > > > cpr-exec avoids such problems, and is one of the advantages of the method that
> > > > > > > > I forgot to promote.
> > > > > > > 
> > > > > > > I was thinking that's fine, as the host ports should be the backend of the
> > > > > > > VM ports only anyway so they don't need to be identical on both sides?
> > > > > > > 
> > > > > > > IOW, my understanding is it's the guest IP/ports/... which should still be
> > > > > > > stable across migrations, where the host ports can be different as long as
> > > > > > > the host ports can forward guest port messages correctly?
> > > > > > 
> > > > > > Yes, one must use a different host port number for the second qemu, and clients
> > > > > > that wish to connect must be aware of the new port.
> > > > > > 
> > > > > > That is my point -- cpr-transfer requires fiddling with such things.
> > > > > > cpr-exec does not.
> > > > > 
> > > > > Right, and my understanding is all these facilities are already there, so
> > > > > no new code should be needed on reconnect issues if to support cpr-transfer
> > > > > in Libvirt or similar management layers that supports migrations.
> > > > 
> > > > Note Libvirt explicitly blocks localhost migration today because
> > > > solving all these clashing resource problems is a huge can of worms
> > > > and it can't be made invisible to the user of libvirt in any practical
> > > > way.
> > > 
> > > Ahhh, OK.  I'm pretty surprised by this, as I thought at least kubevirt
> > > supported local migration somehow on top of libvirt.
> > 
> > Since kubevirt runs inside a container, "localhost" migration
> > is effectively migrating between 2 completely separate OS installs
> > (containers), that happen to be on the same physical host. IOW, it
> > is a cross-host migration from Libvirt & QEMU's POV, and there are
> > no clashing resources to worry about.
> 
> OK, makes sense.
> 
> Then do you think it's possible to support cpr-transfer in that scenario
> from Libvirt POV?
> 
> > 
> > > Does it mean that cpr-transfer is a no-go in this case at least for Libvirt
> > > to consume it (as cpr-* is only for local host migrations so far)?  Even if
> > > all the rest issues we're discussing with cpr-exec, is that the only way to
> > > go for Libvirt, then?
> > 
> > cpr-exec is certainly appealing from the POV of avoiding the clashing
> > resources problem in libvirt.
> > 
> > It has own issues though, because libvirt runs all QEMU processes with
> > seccomp filters that block 'execve', as we consider QEMU to be untrustworthy
> > and thus don't want to allow it to exec anything !
> > 
> > I don't know which is the lesser evil from libvirt's POV.
> > 
> > Personally I see security controls as an overriding requirement for
> > everything.
> 
> One thing I am aware of is cpr-exec is not the only one who might start to
> use exec() in QEMU. TDX fundamentally will need to create another key VM to
> deliver the keys and the plan seems to be using exec() too.  However in
> that case per my understanding the exec() is optional - the key VM can also
> be created by Libvirt.

Since nothing is merged, I'd consider whatever might have been presented
wrt TDX migration to all be open for potential re-design. With SNP there
is the SVSM paravisor which runs inside the guest to provide services,
and IIUC its intended to address migration service needs.

There's a push to support SVSM with TDX too, in order to enable vTPM
support. With that it might make sense to explore whether SVSM can
service migration for TDX too, instead of having a separate parallel
VM on the host. IMHO its highly desirable to have a common architecture
for CVM migration from QEMU's POV, and from a libvirt POV I'd like to
avoid having extra host VMs too.


> IOW, it looks like we can still stick with execve() being blocked yet so
> far except cpr-exec().
> 
> Hmm, this makes the decision harder to make.  We need to figure out a way
> on knowing how to consume this feature for at least open source virt
> stack..  So far it looks like it's only possible (if we take seccomp high
> priority) we use cpr-transfer but only in a container.

Or we have cpr-transfer, but libvirt has to more games to solve the
clashing resources problem, by making much more use of FD passing,
and/or by changnig path conventions, or a mix of both.

What might make this viable is that IIUC, CPR only permits a subset
of backends to be used, so libvirt doesn't have to solve clashing
resources for /everything/, just parts that are supported by CPR.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [PATCH V2 00/11] Live update: cpr-exec

Posted by Steven Sistare 3 weeks, 1 day ago

On 8/16/2024 11:16 AM, Daniel P. Berrangé wrote:
> On Fri, Aug 16, 2024 at 11:06:10AM -0400, Peter Xu wrote:
>> On Thu, Aug 15, 2024 at 04:55:20PM -0400, Steven Sistare wrote:
>>> On 8/13/2024 3:46 PM, Peter Xu wrote:
>>>> On Tue, Aug 06, 2024 at 04:56:18PM -0400, Steven Sistare wrote:
>>>>>> The flipside, however, is that localhost migration via 2 separate QEMU
>>>>>> processes has issues where both QEMUs want to be opening the very same
>>>>>> file, and only 1 of them can ever have them open.
>>>>
>>>> I thought we used to have similar issue with block devices, but I assume
>>>> it's solved for years (and whoever owns it will take proper file lock,
>>>> IIRC, and QEMU migration should properly serialize the time window on who's
>>>> going to take the file lock).
>>>>
>>>> Maybe this is about something else?
>>>
>>> I don't have an example where this fails.
>>>
>>> I can cause "Failed to get "write" lock" errors if two qemu instances open
>>> the same block device, but the error is suppressed if you add the -incoming
>>> argument, due to this code:
>>>
>>>    blk_attach_dev()
>>>      if (runstate_check(RUN_STATE_INMIGRATE))
>>>        blk->disable_perm = true;
>>
>> Yep, this one is pretty much expected.
>>
>>>
>>>>> Indeed, and "files" includes unix domain sockets.
>>>
>>> More on this -- the second qemu to bind a unix domain socket for listening
>>> wins, and the first qemu loses it (because second qemu unlinks and recreates
>>> the socket path before binding on the assumption that it is stale).
>>>
>>> One must use a different name for the socket for second qemu, and clients
>>> that wish to connect must be aware of the new port.
>>>
>>>>> Network ports also conflict.
>>>>> cpr-exec avoids such problems, and is one of the advantages of the method that
>>>>> I forgot to promote.
>>>>
>>>> I was thinking that's fine, as the host ports should be the backend of the
>>>> VM ports only anyway so they don't need to be identical on both sides?
>>>>
>>>> IOW, my understanding is it's the guest IP/ports/... which should still be
>>>> stable across migrations, where the host ports can be different as long as
>>>> the host ports can forward guest port messages correctly?
>>>
>>> Yes, one must use a different host port number for the second qemu, and clients
>>> that wish to connect must be aware of the new port.
>>>
>>> That is my point -- cpr-transfer requires fiddling with such things.
>>> cpr-exec does not.
>>
>> Right, and my understanding is all these facilities are already there, so
>> no new code should be needed on reconnect issues if to support cpr-transfer
>> in Libvirt or similar management layers that supports migrations.
> 
> Note Libvirt explicitly blocks localhost migration today because
> solving all these clashing resource problems is a huge can of worms
> and it can't be made invisible to the user of libvirt in any practical
> way.

Thank you!  This is what I suspected but could not prove due to my lack of
experience with libvirt.

- Steve