[v3] Live update: cpr-exec

[PATCH V3 0/9] Live update: cpr-exec

Posted by Steve Sistare 3 months ago

This patch series adds the live migration cpr-exec mode.  

The new user-visible interfaces are:
  * cpr-exec (MigMode migration parameter)
  * cpr-exec-command (migration parameter)

cpr-exec mode is similar in most respects to cpr-transfer mode, with the 
primary difference being that old QEMU directly exec's new QEMU.  The user
specifies the command to exec new QEMU in the migration parameter
cpr-exec-command.

Why?

In a containerized QEMU environment, cpr-exec reuses an existing QEMU
container and its assigned resources.  By contrast, cpr-transfer mode
requires a new container to be created on the same host as the target of
the CPR operation.  Resources must be reserved for the new container, while
the old container still reserves resources until the operation completes.
Avoiding over commitment requires extra work in the management layer.
This is one reason why a cloud provider may prefer cpr-exec.  A second reason
is that the container may include agents with their own connections to the
outside world, and such connections remain intact if the container is reused.

How?

cpr-exec preserves descriptors across exec by clearing the CLOEXEC flag,
and by sending the unique name and value of each descriptor to new QEMU
via CPR state.

CPR state cannot be sent over the normal migration channel, because devices
and backends are created prior to reading the channel, so this mode sends
CPR state over a second migration channel that is not visible to the user.
New QEMU reads the second channel prior to creating devices or backends.

The exec itself is trivial.  After writing to the migration channels, the
migration code calls a new main-loop hook to perform the exec.

Example:

In this example, we simply restart the same version of QEMU, but in
a real scenario one would use a new QEMU binary path in cpr-exec-command.

  # qemu-kvm -monitor stdio
  -object memory-backend-memfd,id=ram0,size=1G
  -machine memory-backend=ram0 -machine aux-ram-share=on ...

  QEMU 10.1.50 monitor - type 'help' for more information
  (qemu) info status
  VM status: running
  (qemu) migrate_set_parameter mode cpr-exec
  (qemu) migrate_set_parameter cpr-exec-command qemu-kvm ... -incoming file:vm.state
  (qemu) migrate -d file:vm.state
  (qemu) QEMU 10.1.50 monitor - type 'help' for more information
  (qemu) info status
  VM status: running

Steve Sistare (9):
  migration: multi-mode notifier
  migration: add cpr_walk_fd
  oslib: qemu_clear_cloexec
  vl: helper to request exec
  migration: cpr-exec-command parameter
  migration: cpr-exec save and load
  migration: cpr-exec mode
  migration: cpr-exec docs
  vfio: cpr-exec mode

 docs/devel/migration/CPR.rst   | 103 ++++++++++++++++++++++++-
 qapi/migration.json            |  46 ++++++++++-
 include/migration/cpr.h        |   9 +++
 include/migration/misc.h       |  12 +++
 include/qemu/osdep.h           |   9 +++
 include/system/runstate.h      |   3 +
 hw/vfio/container.c            |   3 +-
 hw/vfio/cpr-iommufd.c          |   3 +-
 hw/vfio/cpr-legacy.c           |   9 ++-
 hw/vfio/cpr.c                  |  13 ++--
 migration/cpr-exec.c           | 168 +++++++++++++++++++++++++++++++++++++++++
 migration/cpr.c                |  39 +++++++++-
 migration/migration-hmp-cmds.c |  25 ++++++
 migration/migration.c          |  70 +++++++++++++----
 migration/options.c            |  14 ++++
 migration/ram.c                |   1 +
 migration/vmstate-types.c      |   8 ++
 system/runstate.c              |  29 +++++++
 util/oslib-posix.c             |   9 +++
 util/oslib-win32.c             |   4 +
 hmp-commands.hx                |   2 +-
 migration/meson.build          |   1 +
 migration/trace-events         |   1 +
 23 files changed, 548 insertions(+), 33 deletions(-)
 create mode 100644 migration/cpr-exec.c

-- 
1.8.3.1

Re: [PATCH V3 0/9] Live update: cpr-exec

Posted by Vladimir Sementsov-Ogievskiy 2 months, 1 week ago

On 14.08.25 20:17, Steve Sistare wrote:
> This patch series adds the live migration cpr-exec mode.
> 
> The new user-visible interfaces are:
>    * cpr-exec (MigMode migration parameter)
>    * cpr-exec-command (migration parameter)
> 
> cpr-exec mode is similar in most respects to cpr-transfer mode, with the
> primary difference being that old QEMU directly exec's new QEMU.  The user
> specifies the command to exec new QEMU in the migration parameter
> cpr-exec-command.
> 
> Why?
> 
> In a containerized QEMU environment, cpr-exec reuses an existing QEMU
> container and its assigned resources.  By contrast, cpr-transfer mode
> requires a new container to be created on the same host as the target of
> the CPR operation.  Resources must be reserved for the new container, while
> the old container still reserves resources until the operation completes.
> Avoiding over commitment requires extra work in the management layer.
> This is one reason why a cloud provider may prefer cpr-exec.  A second reason
> is that the container may include agents with their own connections to the
> outside world, and such connections remain intact if the container is reused.
> 

My two cents:

We considered a possibility to switch to cpr-exec, and even more,
we thought about some kind of loading new version of QEMU binary to running
QEMU process (like library) and switching to it. But finally decided to
keep our current approach (starting new QEMU in a separate process) and
use CPR transfer (and finally come to my current in-list proposals of
just migrating all fds in main migration channel).

First, we don't run QEMU in docker, so probably we don't encounter some
problems around it. The real problem for us is migration downtime for
switching network and disk.

Still, why we don't want cpr-exec? Two reasons:

1. It seems, that current approach is more safe against different errors during
migration: we have more chances just to say "cont" on source process, if something
goes wrong.

2. It seems, that with second process we do have more possibilities to minimize
downtime, as we can do some initializations in a new QEMU process _before_ migration
(when second process starts, the first is still running).

I also thought about, could we do a kind of "exec", but still be able to avoid [2]?
This leads to an idea of loading new qemu binary to the running process (like library),
and .. start executing it in parallel with the old one? But that looks like trying
to reinvent processes again, which is obviously bad idea.

-- 
Best regards,
Vladimir

Re: [PATCH V3 0/9] Live update: cpr-exec

Posted by Peter Xu 2 months, 1 week ago

Add Vladimir and Dan.

On Thu, Aug 14, 2025 at 10:17:14AM -0700, Steve Sistare wrote:
> This patch series adds the live migration cpr-exec mode.  
> 
> The new user-visible interfaces are:
>   * cpr-exec (MigMode migration parameter)
>   * cpr-exec-command (migration parameter)
> 
> cpr-exec mode is similar in most respects to cpr-transfer mode, with the 
> primary difference being that old QEMU directly exec's new QEMU.  The user
> specifies the command to exec new QEMU in the migration parameter
> cpr-exec-command.
> 
> Why?
> 
> In a containerized QEMU environment, cpr-exec reuses an existing QEMU
> container and its assigned resources.  By contrast, cpr-transfer mode
> requires a new container to be created on the same host as the target of
> the CPR operation.  Resources must be reserved for the new container, while
> the old container still reserves resources until the operation completes.
> Avoiding over commitment requires extra work in the management layer.

Can we spell out what are these resources?

CPR definitely relies on completely shared memory.  That's already not a
concern.

CPR resolves resources that are bound to devices like VFIO by passing over
FDs, these are not over commited either.

Is it accounting QEMU/KVM process overhead?  That would really be trivial,
IMHO, but maybe something else?

> This is one reason why a cloud provider may prefer cpr-exec.  A second reason
> is that the container may include agents with their own connections to the
> outside world, and such connections remain intact if the container is reused.

We discussed about this one.  Personally I still cannot understand why this
is a concern if the agents can be trivially started as a new instance.  But
I admit I may not know the whole picture.  To me, the above point is more
persuasive, but I'll need to understand which part that is over-commited
that can be a problem.

After all, cloud hosts should preserve some extra memory anyway to make
sure dynamic resources allocations all the time (e.g., when live migration
starts, KVM pgtables can drastically increase if huge pages are enabled,
for PAGE_SIZE trackings), I assumed the over-commit portion should be less
that those.. and when it's also temporary (src QEMU will release all
resources after live upgrade) then it looks manageable.

> 
> How?
> 
> cpr-exec preserves descriptors across exec by clearing the CLOEXEC flag,
> and by sending the unique name and value of each descriptor to new QEMU
> via CPR state.
> 
> CPR state cannot be sent over the normal migration channel, because devices
> and backends are created prior to reading the channel, so this mode sends
> CPR state over a second migration channel that is not visible to the user.
> New QEMU reads the second channel prior to creating devices or backends.
> 
> The exec itself is trivial.  After writing to the migration channels, the
> migration code calls a new main-loop hook to perform the exec.
> 
> Example:
> 
> In this example, we simply restart the same version of QEMU, but in
> a real scenario one would use a new QEMU binary path in cpr-exec-command.
> 
>   # qemu-kvm -monitor stdio
>   -object memory-backend-memfd,id=ram0,size=1G
>   -machine memory-backend=ram0 -machine aux-ram-share=on ...
> 
>   QEMU 10.1.50 monitor - type 'help' for more information
>   (qemu) info status
>   VM status: running
>   (qemu) migrate_set_parameter mode cpr-exec
>   (qemu) migrate_set_parameter cpr-exec-command qemu-kvm ... -incoming file:vm.state
>   (qemu) migrate -d file:vm.state
>   (qemu) QEMU 10.1.50 monitor - type 'help' for more information
>   (qemu) info status
>   VM status: running
> 
> Steve Sistare (9):
>   migration: multi-mode notifier
>   migration: add cpr_walk_fd
>   oslib: qemu_clear_cloexec
>   vl: helper to request exec
>   migration: cpr-exec-command parameter
>   migration: cpr-exec save and load
>   migration: cpr-exec mode
>   migration: cpr-exec docs
>   vfio: cpr-exec mode

The other thing is, as Vladimir is working on (looks like) a cleaner way of
passing FDs fully relying on unix sockets, I want to understand better on
the relationships of his work and the exec model.

I still personally think we should always stick with unix sockets, but I'm
open to be convinced on above limitations.  If exec is better than
cpr-transfer in any way, the hope is more people can and should adopt it.

We also have no answer yet on how cpr-exec can resolve container world with
seccomp forbidding exec.  I guess that's a no-go.  It's definitely a
downside instead.  Better mention that in the cover letter.

Thanks,

-- 
Peter Xu

Re: [PATCH V3 0/9] Live update: cpr-exec

Posted by Steven Sistare 2 months ago

On 9/5/2025 12:48 PM, Peter Xu wrote:
> Add Vladimir and Dan.
> 
> On Thu, Aug 14, 2025 at 10:17:14AM -0700, Steve Sistare wrote:
>> This patch series adds the live migration cpr-exec mode.
>>
>> The new user-visible interfaces are:
>>    * cpr-exec (MigMode migration parameter)
>>    * cpr-exec-command (migration parameter)
>>
>> cpr-exec mode is similar in most respects to cpr-transfer mode, with the
>> primary difference being that old QEMU directly exec's new QEMU.  The user
>> specifies the command to exec new QEMU in the migration parameter
>> cpr-exec-command.
>>
>> Why?
>>
>> In a containerized QEMU environment, cpr-exec reuses an existing QEMU
>> container and its assigned resources.  By contrast, cpr-transfer mode
>> requires a new container to be created on the same host as the target of
>> the CPR operation.  Resources must be reserved for the new container, while
>> the old container still reserves resources until the operation completes.
>> Avoiding over commitment requires extra work in the management layer.
> 
> Can we spell out what are these resources?
> 
> CPR definitely relies on completely shared memory.  That's already not a
> concern.
> 
> CPR resolves resources that are bound to devices like VFIO by passing over
> FDs, these are not over commited either.
> 
> Is it accounting QEMU/KVM process overhead?  That would really be trivial,
> IMHO, but maybe something else?

Accounting is one issue, and it is not trivial.  Another is arranging exclusive
use of a set of CPUs, the same set for the old and new container, concurrently.
Another is avoiding namespace conflicts, the kind that make localhost migration
difficult.

>> This is one reason why a cloud provider may prefer cpr-exec.  A second reason
>> is that the container may include agents with their own connections to the
>> outside world, and such connections remain intact if the container is reused.
> 
> We discussed about this one.  Personally I still cannot understand why this
> is a concern if the agents can be trivially started as a new instance.  But
> I admit I may not know the whole picture.  To me, the above point is more
> persuasive, but I'll need to understand which part that is over-commited
> that can be a problem.

Agents can be restarted, but that would sever the connection to the outside
world.  With cpr-transfer or any local migration, you would need agents
outside of old and new containers that persist.

With cpr-exec, connections can be preserved without requiring the end user
to reconnect, and can be done trivially, by preserving chardevs.  With that
support in qemu, the management layer does nothing extra to preserve them.
chardev support is not part of this series but is part of my vision,
and makes exec mode even more compelling.

Management layers have a lot of code and complexity to manage live migration,
resources, and connections.  It requires modification to support cpr-transfer.
All that can be bypassed with exec mode.  Less complexity, less maintainance,
and  fewer points of failure.  I know this because I implemented exec mode in
OCI at Oracle, and we use it in production.
> After all, cloud hosts should preserve some extra memory anyway to make
> sure dynamic resources allocations all the time (e.g., when live migration
> starts, KVM pgtables can drastically increase if huge pages are enabled,
> for PAGE_SIZE trackings), I assumed the over-commit portion should be less
> that those.. and when it's also temporary (src QEMU will release all
> resources after live upgrade) then it looks manageable. >>
>> How?
>>
>> cpr-exec preserves descriptors across exec by clearing the CLOEXEC flag,
>> and by sending the unique name and value of each descriptor to new QEMU
>> via CPR state.
>>
>> CPR state cannot be sent over the normal migration channel, because devices
>> and backends are created prior to reading the channel, so this mode sends
>> CPR state over a second migration channel that is not visible to the user.
>> New QEMU reads the second channel prior to creating devices or backends.
>>
>> The exec itself is trivial.  After writing to the migration channels, the
>> migration code calls a new main-loop hook to perform the exec.
>>
>> Example:
>>
>> In this example, we simply restart the same version of QEMU, but in
>> a real scenario one would use a new QEMU binary path in cpr-exec-command.
>>
>>    # qemu-kvm -monitor stdio
>>    -object memory-backend-memfd,id=ram0,size=1G
>>    -machine memory-backend=ram0 -machine aux-ram-share=on ...
>>
>>    QEMU 10.1.50 monitor - type 'help' for more information
>>    (qemu) info status
>>    VM status: running
>>    (qemu) migrate_set_parameter mode cpr-exec
>>    (qemu) migrate_set_parameter cpr-exec-command qemu-kvm ... -incoming file:vm.state
>>    (qemu) migrate -d file:vm.state
>>    (qemu) QEMU 10.1.50 monitor - type 'help' for more information
>>    (qemu) info status
>>    VM status: running
>>
>> Steve Sistare (9):
>>    migration: multi-mode notifier
>>    migration: add cpr_walk_fd
>>    oslib: qemu_clear_cloexec
>>    vl: helper to request exec
>>    migration: cpr-exec-command parameter
>>    migration: cpr-exec save and load
>>    migration: cpr-exec mode
>>    migration: cpr-exec docs
>>    vfio: cpr-exec mode
> 
> The other thing is, as Vladimir is working on (looks like) a cleaner way of
> passing FDs fully relying on unix sockets, I want to understand better on
> the relationships of his work and the exec model.

His work is based on my work -- the ability to embed a file descriptor in a
migration stream with a VMSTATE_FD declaration -- so it is compatible.

The cpr-exec series preserves VMSTATE_FD across exec by remembering the fd
integer and embedding that in the data stream.  See the changes in vmstate-types.c
in [PATCH V3 7/9] migration: cpr-exec mode.

Thus cpr-exec will still preserve tap devices via Vladimir's code.
> I still personally think we should always stick with unix sockets, but I'm
> open to be convinced on above limitations.  If exec is better than
> cpr-transfer in any way, the hope is more people can and should adopt it.

Various people and companies have expressed interest in CPR and want to explore
cpr-exec.  Vladimir was one, he chose transfer instead, and that is fine, but
give people the option.  And Oracle continues to use cpr-exec mode.

There is no downside to supporting cpr-exec mode.  It is astonishing how much
code is shared by the cpr-transfer and cpr-exec modes.  Most of the code in
this series is factored into specific cpr-exec files and functions, code that
will never run for any other reason.  There are very few conditionals in common
code that do something different for exec mode.
> We also have no answer yet on how cpr-exec can resolve container world with
> seccomp forbidding exec.  I guess that's a no-go.  It's definitely a
> downside instead.  Better mention that in the cover letter.
The key is limiting the contents of the container, so exec only has a limited
and known safe set of things to target.  I'll add that to the cover letter.

- Steve

Re: [PATCH V3 0/9] Live update: cpr-exec

Posted by Peter Xu 2 months ago

On Tue, Sep 09, 2025 at 10:36:16AM -0400, Steven Sistare wrote:
> On 9/5/2025 12:48 PM, Peter Xu wrote:
> > Add Vladimir and Dan.
> > 
> > On Thu, Aug 14, 2025 at 10:17:14AM -0700, Steve Sistare wrote:
> > > This patch series adds the live migration cpr-exec mode.
> > > 
> > > The new user-visible interfaces are:
> > >    * cpr-exec (MigMode migration parameter)
> > >    * cpr-exec-command (migration parameter)
> > > 
> > > cpr-exec mode is similar in most respects to cpr-transfer mode, with the
> > > primary difference being that old QEMU directly exec's new QEMU.  The user
> > > specifies the command to exec new QEMU in the migration parameter
> > > cpr-exec-command.
> > > 
> > > Why?
> > > 
> > > In a containerized QEMU environment, cpr-exec reuses an existing QEMU
> > > container and its assigned resources.  By contrast, cpr-transfer mode
> > > requires a new container to be created on the same host as the target of
> > > the CPR operation.  Resources must be reserved for the new container, while
> > > the old container still reserves resources until the operation completes.
> > > Avoiding over commitment requires extra work in the management layer.
> > 
> > Can we spell out what are these resources?
> > 
> > CPR definitely relies on completely shared memory.  That's already not a
> > concern.
> > 
> > CPR resolves resources that are bound to devices like VFIO by passing over
> > FDs, these are not over commited either.
> > 
> > Is it accounting QEMU/KVM process overhead?  That would really be trivial,
> > IMHO, but maybe something else?
> 
> Accounting is one issue, and it is not trivial.  Another is arranging exclusive
> use of a set of CPUs, the same set for the old and new container, concurrently.
> Another is avoiding namespace conflicts, the kind that make localhost migration
> difficult.
> 
> > > This is one reason why a cloud provider may prefer cpr-exec.  A second reason
> > > is that the container may include agents with their own connections to the
> > > outside world, and such connections remain intact if the container is reused.
> > 
> > We discussed about this one.  Personally I still cannot understand why this
> > is a concern if the agents can be trivially started as a new instance.  But
> > I admit I may not know the whole picture.  To me, the above point is more
> > persuasive, but I'll need to understand which part that is over-commited
> > that can be a problem.
> 
> Agents can be restarted, but that would sever the connection to the outside
> world.  With cpr-transfer or any local migration, you would need agents
> outside of old and new containers that persist.
> 
> With cpr-exec, connections can be preserved without requiring the end user
> to reconnect, and can be done trivially, by preserving chardevs.  With that
> support in qemu, the management layer does nothing extra to preserve them.
> chardev support is not part of this series but is part of my vision,
> and makes exec mode even more compelling.
> 
> Management layers have a lot of code and complexity to manage live migration,
> resources, and connections.  It requires modification to support cpr-transfer.
> All that can be bypassed with exec mode.  Less complexity, less maintainance,
> and  fewer points of failure.  I know this because I implemented exec mode in
> OCI at Oracle, and we use it in production.

I wonders how this part works in Vladimir's use case.

> > After all, cloud hosts should preserve some extra memory anyway to make
> > sure dynamic resources allocations all the time (e.g., when live migration
> > starts, KVM pgtables can drastically increase if huge pages are enabled,
> > for PAGE_SIZE trackings), I assumed the over-commit portion should be less
> > that those.. and when it's also temporary (src QEMU will release all
> > resources after live upgrade) then it looks manageable. >>
> > > How?
> > > 
> > > cpr-exec preserves descriptors across exec by clearing the CLOEXEC flag,
> > > and by sending the unique name and value of each descriptor to new QEMU
> > > via CPR state.
> > > 
> > > CPR state cannot be sent over the normal migration channel, because devices
> > > and backends are created prior to reading the channel, so this mode sends
> > > CPR state over a second migration channel that is not visible to the user.
> > > New QEMU reads the second channel prior to creating devices or backends.
> > > 
> > > The exec itself is trivial.  After writing to the migration channels, the
> > > migration code calls a new main-loop hook to perform the exec.
> > > 
> > > Example:
> > > 
> > > In this example, we simply restart the same version of QEMU, but in
> > > a real scenario one would use a new QEMU binary path in cpr-exec-command.
> > > 
> > >    # qemu-kvm -monitor stdio
> > >    -object memory-backend-memfd,id=ram0,size=1G
> > >    -machine memory-backend=ram0 -machine aux-ram-share=on ...
> > > 
> > >    QEMU 10.1.50 monitor - type 'help' for more information
> > >    (qemu) info status
> > >    VM status: running
> > >    (qemu) migrate_set_parameter mode cpr-exec
> > >    (qemu) migrate_set_parameter cpr-exec-command qemu-kvm ... -incoming file:vm.state
> > >    (qemu) migrate -d file:vm.state
> > >    (qemu) QEMU 10.1.50 monitor - type 'help' for more information
> > >    (qemu) info status
> > >    VM status: running
> > > 
> > > Steve Sistare (9):
> > >    migration: multi-mode notifier
> > >    migration: add cpr_walk_fd
> > >    oslib: qemu_clear_cloexec
> > >    vl: helper to request exec
> > >    migration: cpr-exec-command parameter
> > >    migration: cpr-exec save and load
> > >    migration: cpr-exec mode
> > >    migration: cpr-exec docs
> > >    vfio: cpr-exec mode
> > 
> > The other thing is, as Vladimir is working on (looks like) a cleaner way of
> > passing FDs fully relying on unix sockets, I want to understand better on
> > the relationships of his work and the exec model.
> 
> His work is based on my work -- the ability to embed a file descriptor in a
> migration stream with a VMSTATE_FD declaration -- so it is compatible.
> 
> The cpr-exec series preserves VMSTATE_FD across exec by remembering the fd
> integer and embedding that in the data stream.  See the changes in vmstate-types.c
> in [PATCH V3 7/9] migration: cpr-exec mode.
> 
> Thus cpr-exec will still preserve tap devices via Vladimir's code.
> > I still personally think we should always stick with unix sockets, but I'm
> > open to be convinced on above limitations.  If exec is better than
> > cpr-transfer in any way, the hope is more people can and should adopt it.
> 
> Various people and companies have expressed interest in CPR and want to explore
> cpr-exec.  Vladimir was one, he chose transfer instead, and that is fine, but
> give people the option.  And Oracle continues to use cpr-exec mode.

How does cpr-exec guarantees everything will go smoothly with no failure
after the exec?  Essentially, this is Vladimir's question 1.  Feel free to
answer there, because there's also question 2 (which we used to cover some
but maybe not as much).

The other thing I don't remember if we discussed, on how cpr-exec manages
device hotplugs. Say, what happens if there are devices hot plugged (via
QMP) then cpr-exec migration happens?

Does cpr-exec cmdline needs to convert all QMP hot-plugged devices into
cmdlines and append them?  How to guarantee src/dst device topology match
exactly the same with the new cmdline?

> 
> There is no downside to supporting cpr-exec mode.  It is astonishing how much
> code is shared by the cpr-transfer and cpr-exec modes.  Most of the code in
> this series is factored into specific cpr-exec files and functions, code that
> will never run for any other reason.  There are very few conditionals in common
> code that do something different for exec mode.
> > We also have no answer yet on how cpr-exec can resolve container world with
> > seccomp forbidding exec.  I guess that's a no-go.  It's definitely a
> > downside instead.  Better mention that in the cover letter.
> The key is limiting the contents of the container, so exec only has a limited
> and known safe set of things to target.  I'll add that to the cover letter.

Thanks.

-- 
Peter Xu

Re: [PATCH V3 0/9] Live update: cpr-exec

Posted by Vladimir Sementsov-Ogievskiy 2 months ago

On 09.09.25 18:24, Peter Xu wrote:
> On Tue, Sep 09, 2025 at 10:36:16AM -0400, Steven Sistare wrote:
>> On 9/5/2025 12:48 PM, Peter Xu wrote:
>>> Add Vladimir and Dan.
>>>
>>> On Thu, Aug 14, 2025 at 10:17:14AM -0700, Steve Sistare wrote:
>>>> This patch series adds the live migration cpr-exec mode.
>>>>
>>>> The new user-visible interfaces are:
>>>>     * cpr-exec (MigMode migration parameter)
>>>>     * cpr-exec-command (migration parameter)
>>>>
>>>> cpr-exec mode is similar in most respects to cpr-transfer mode, with the
>>>> primary difference being that old QEMU directly exec's new QEMU.  The user
>>>> specifies the command to exec new QEMU in the migration parameter
>>>> cpr-exec-command.
>>>>
>>>> Why?
>>>>
>>>> In a containerized QEMU environment, cpr-exec reuses an existing QEMU
>>>> container and its assigned resources.  By contrast, cpr-transfer mode
>>>> requires a new container to be created on the same host as the target of
>>>> the CPR operation.  Resources must be reserved for the new container, while
>>>> the old container still reserves resources until the operation completes.
>>>> Avoiding over commitment requires extra work in the management layer.
>>>
>>> Can we spell out what are these resources?
>>>
>>> CPR definitely relies on completely shared memory.  That's already not a
>>> concern.
>>>
>>> CPR resolves resources that are bound to devices like VFIO by passing over
>>> FDs, these are not over commited either.
>>>
>>> Is it accounting QEMU/KVM process overhead?  That would really be trivial,
>>> IMHO, but maybe something else?
>>
>> Accounting is one issue, and it is not trivial.  Another is arranging exclusive
>> use of a set of CPUs, the same set for the old and new container, concurrently.
>> Another is avoiding namespace conflicts, the kind that make localhost migration
>> difficult.
>>
>>>> This is one reason why a cloud provider may prefer cpr-exec.  A second reason
>>>> is that the container may include agents with their own connections to the
>>>> outside world, and such connections remain intact if the container is reused.
>>>
>>> We discussed about this one.  Personally I still cannot understand why this
>>> is a concern if the agents can be trivially started as a new instance.  But
>>> I admit I may not know the whole picture.  To me, the above point is more
>>> persuasive, but I'll need to understand which part that is over-commited
>>> that can be a problem.
>>
>> Agents can be restarted, but that would sever the connection to the outside
>> world.  With cpr-transfer or any local migration, you would need agents
>> outside of old and new containers that persist.
>>
>> With cpr-exec, connections can be preserved without requiring the end user
>> to reconnect, and can be done trivially, by preserving chardevs.  With that
>> support in qemu, the management layer does nothing extra to preserve them.
>> chardev support is not part of this series but is part of my vision,
>> and makes exec mode even more compelling.
>>
>> Management layers have a lot of code and complexity to manage live migration,
>> resources, and connections.  It requires modification to support cpr-transfer.
>> All that can be bypassed with exec mode.  Less complexity, less maintainance,
>> and  fewer points of failure.  I know this because I implemented exec mode in
>> OCI at Oracle, and we use it in production.
> 
> I wonders how this part works in Vladimir's use case.


For now, we don't have live-update with fd-passing, I'm working on it. But we do
have working live-update with starting second QEMU process.

I hope, that finally support for fd-passing in management layer will only
need three steps:

- use unix-socket as migration channel
- enable new migration capability (and probably some options to enable feature per device)
- opt-out the code [1], which implements logic of switching TAP and disk for new QEMU instance

And I don't think we want to remove this logic [1] completely, as we may want to do
normal migration without fds at some moment, for example to change the backend. Or
to jump-over some theoretical future problems with fds passing (that's a new experimental
feature, there may be bugs, or even future incompatible changes (until it become stable).

cpr-transfer needs additional steps:

- more complex interface to setup two migration channels
- tricky logic about unavailable QMP for target process at start

Still, that's possible.

> 
>>> After all, cloud hosts should preserve some extra memory anyway to make
>>> sure dynamic resources allocations all the time (e.g., when live migration
>>> starts, KVM pgtables can drastically increase if huge pages are enabled,
>>> for PAGE_SIZE trackings), I assumed the over-commit portion should be less
>>> that those.. and when it's also temporary (src QEMU will release all
>>> resources after live upgrade) then it looks manageable. >>
>>>> How?
>>>>
>>>> cpr-exec preserves descriptors across exec by clearing the CLOEXEC flag,
>>>> and by sending the unique name and value of each descriptor to new QEMU
>>>> via CPR state.
>>>>
>>>> CPR state cannot be sent over the normal migration channel, because devices
>>>> and backends are created prior to reading the channel, so this mode sends
>>>> CPR state over a second migration channel that is not visible to the user.
>>>> New QEMU reads the second channel prior to creating devices or backends.
>>>>
>>>> The exec itself is trivial.  After writing to the migration channels, the
>>>> migration code calls a new main-loop hook to perform the exec.
>>>>
>>>> Example:
>>>>
>>>> In this example, we simply restart the same version of QEMU, but in
>>>> a real scenario one would use a new QEMU binary path in cpr-exec-command.
>>>>
>>>>     # qemu-kvm -monitor stdio
>>>>     -object memory-backend-memfd,id=ram0,size=1G
>>>>     -machine memory-backend=ram0 -machine aux-ram-share=on ...
>>>>
>>>>     QEMU 10.1.50 monitor - type 'help' for more information
>>>>     (qemu) info status
>>>>     VM status: running
>>>>     (qemu) migrate_set_parameter mode cpr-exec
>>>>     (qemu) migrate_set_parameter cpr-exec-command qemu-kvm ... -incoming file:vm.state
>>>>     (qemu) migrate -d file:vm.state
>>>>     (qemu) QEMU 10.1.50 monitor - type 'help' for more information
>>>>     (qemu) info status
>>>>     VM status: running
>>>>
>>>> Steve Sistare (9):
>>>>     migration: multi-mode notifier
>>>>     migration: add cpr_walk_fd
>>>>     oslib: qemu_clear_cloexec
>>>>     vl: helper to request exec
>>>>     migration: cpr-exec-command parameter
>>>>     migration: cpr-exec save and load
>>>>     migration: cpr-exec mode
>>>>     migration: cpr-exec docs
>>>>     vfio: cpr-exec mode
>>>
>>> The other thing is, as Vladimir is working on (looks like) a cleaner way of
>>> passing FDs fully relying on unix sockets, I want to understand better on
>>> the relationships of his work and the exec model.
>>
>> His work is based on my work -- the ability to embed a file descriptor in a
>> migration stream with a VMSTATE_FD declaration -- so it is compatible.
>>
>> The cpr-exec series preserves VMSTATE_FD across exec by remembering the fd
>> integer and embedding that in the data stream.  See the changes in vmstate-types.c
>> in [PATCH V3 7/9] migration: cpr-exec mode.
>>
>> Thus cpr-exec will still preserve tap devices via Vladimir's code.
>>> I still personally think we should always stick with unix sockets, but I'm
>>> open to be convinced on above limitations.  If exec is better than
>>> cpr-transfer in any way, the hope is more people can and should adopt it.
>>
>> Various people and companies have expressed interest in CPR and want to explore
>> cpr-exec.  Vladimir was one, he chose transfer instead, and that is fine, but
>> give people the option.  And Oracle continues to use cpr-exec mode.
> 
> How does cpr-exec guarantees everything will go smoothly with no failure
> after the exec?  Essentially, this is Vladimir's question 1.  Feel free to
> answer there, because there's also question 2 (which we used to cover some
> but maybe not as much).
> 
> The other thing I don't remember if we discussed, on how cpr-exec manages
> device hotplugs. Say, what happens if there are devices hot plugged (via
> QMP) then cpr-exec migration happens?
> 
> Does cpr-exec cmdline needs to convert all QMP hot-plugged devices into
> cmdlines and append them?  How to guarantee src/dst device topology match
> exactly the same with the new cmdline?
> 

Seems, we discussed.

As I understand, it should work the same way like for normal migration:
we add -incoming defer to cpr-exec-command, and after exec we can add
our infrastructure through QMP interface, and than run "migrate-incoming".

Still, that would be done during downtime, unlike cpr-transfer, where source
is still running during target QMP setup.

So exec mode works more like migrating to file, and than restore from it.

Maybe, we may have a mediator program, which gets migration stream and fds
from source QEMU together with fds, than we start new QEMU process in same
container, and it gets the incoming migration stream together with fds
from the mediator?

Probably target QEMU itself may be used as this mediator, but we'll need an
option to buferise somehow the incoming migration state (together with fds),
and only start to apply it when source QEMU is closed.

How much exec mode would differ from such setup?

>>
>> There is no downside to supporting cpr-exec mode.  It is astonishing how much
>> code is shared by the cpr-transfer and cpr-exec modes.  Most of the code in
>> this series is factored into specific cpr-exec files and functions, code that
>> will never run for any other reason.  There are very few conditionals in common
>> code that do something different for exec mode.
>>> We also have no answer yet on how cpr-exec can resolve container world with
>>> seccomp forbidding exec.  I guess that's a no-go.  It's definitely a
>>> downside instead.  Better mention that in the cover letter.
>> The key is limiting the contents of the container, so exec only has a limited
>> and known safe set of things to target.  I'll add that to the cover letter.
> 
> Thanks.
> 


-- 
Best regards,
Vladimir

Re: [PATCH V3 0/9] Live update: cpr-exec

Posted by Steven Sistare 2 months ago

On 9/9/2025 11:24 AM, Peter Xu wrote:
> On Tue, Sep 09, 2025 at 10:36:16AM -0400, Steven Sistare wrote:
>> On 9/5/2025 12:48 PM, Peter Xu wrote:
>>> Add Vladimir and Dan.
>>>
>>> On Thu, Aug 14, 2025 at 10:17:14AM -0700, Steve Sistare wrote:
>>>> This patch series adds the live migration cpr-exec mode.
>>>>
>>>> The new user-visible interfaces are:
>>>>     * cpr-exec (MigMode migration parameter)
>>>>     * cpr-exec-command (migration parameter)
>>>>
>>>> cpr-exec mode is similar in most respects to cpr-transfer mode, with the
>>>> primary difference being that old QEMU directly exec's new QEMU.  The user
>>>> specifies the command to exec new QEMU in the migration parameter
>>>> cpr-exec-command.
>>>>
>>>> Why?
>>>>
>>>> In a containerized QEMU environment, cpr-exec reuses an existing QEMU
>>>> container and its assigned resources.  By contrast, cpr-transfer mode
>>>> requires a new container to be created on the same host as the target of
>>>> the CPR operation.  Resources must be reserved for the new container, while
>>>> the old container still reserves resources until the operation completes.
>>>> Avoiding over commitment requires extra work in the management layer.
>>>
>>> Can we spell out what are these resources?
>>>
>>> CPR definitely relies on completely shared memory.  That's already not a
>>> concern.
>>>
>>> CPR resolves resources that are bound to devices like VFIO by passing over
>>> FDs, these are not over commited either.
>>>
>>> Is it accounting QEMU/KVM process overhead?  That would really be trivial,
>>> IMHO, but maybe something else?
>>
>> Accounting is one issue, and it is not trivial.  Another is arranging exclusive
>> use of a set of CPUs, the same set for the old and new container, concurrently.
>> Another is avoiding namespace conflicts, the kind that make localhost migration
>> difficult.
>>
>>>> This is one reason why a cloud provider may prefer cpr-exec.  A second reason
>>>> is that the container may include agents with their own connections to the
>>>> outside world, and such connections remain intact if the container is reused.
>>>
>>> We discussed about this one.  Personally I still cannot understand why this
>>> is a concern if the agents can be trivially started as a new instance.  But
>>> I admit I may not know the whole picture.  To me, the above point is more
>>> persuasive, but I'll need to understand which part that is over-commited
>>> that can be a problem.
>>
>> Agents can be restarted, but that would sever the connection to the outside
>> world.  With cpr-transfer or any local migration, you would need agents
>> outside of old and new containers that persist.
>>
>> With cpr-exec, connections can be preserved without requiring the end user
>> to reconnect, and can be done trivially, by preserving chardevs.  With that
>> support in qemu, the management layer does nothing extra to preserve them.
>> chardev support is not part of this series but is part of my vision,
>> and makes exec mode even more compelling.
>>
>> Management layers have a lot of code and complexity to manage live migration,
>> resources, and connections.  It requires modification to support cpr-transfer.
>> All that can be bypassed with exec mode.  Less complexity, less maintainance,
>> and  fewer points of failure.  I know this because I implemented exec mode in
>> OCI at Oracle, and we use it in production.
> 
> I wonders how this part works in Vladimir's use case.
> 
>>> After all, cloud hosts should preserve some extra memory anyway to make
>>> sure dynamic resources allocations all the time (e.g., when live migration
>>> starts, KVM pgtables can drastically increase if huge pages are enabled,
>>> for PAGE_SIZE trackings), I assumed the over-commit portion should be less
>>> that those.. and when it's also temporary (src QEMU will release all
>>> resources after live upgrade) then it looks manageable. >>
>>>> How?
>>>>
>>>> cpr-exec preserves descriptors across exec by clearing the CLOEXEC flag,
>>>> and by sending the unique name and value of each descriptor to new QEMU
>>>> via CPR state.
>>>>
>>>> CPR state cannot be sent over the normal migration channel, because devices
>>>> and backends are created prior to reading the channel, so this mode sends
>>>> CPR state over a second migration channel that is not visible to the user.
>>>> New QEMU reads the second channel prior to creating devices or backends.
>>>>
>>>> The exec itself is trivial.  After writing to the migration channels, the
>>>> migration code calls a new main-loop hook to perform the exec.
>>>>
>>>> Example:
>>>>
>>>> In this example, we simply restart the same version of QEMU, but in
>>>> a real scenario one would use a new QEMU binary path in cpr-exec-command.
>>>>
>>>>     # qemu-kvm -monitor stdio
>>>>     -object memory-backend-memfd,id=ram0,size=1G
>>>>     -machine memory-backend=ram0 -machine aux-ram-share=on ...
>>>>
>>>>     QEMU 10.1.50 monitor - type 'help' for more information
>>>>     (qemu) info status
>>>>     VM status: running
>>>>     (qemu) migrate_set_parameter mode cpr-exec
>>>>     (qemu) migrate_set_parameter cpr-exec-command qemu-kvm ... -incoming file:vm.state
>>>>     (qemu) migrate -d file:vm.state
>>>>     (qemu) QEMU 10.1.50 monitor - type 'help' for more information
>>>>     (qemu) info status
>>>>     VM status: running
>>>>
>>>> Steve Sistare (9):
>>>>     migration: multi-mode notifier
>>>>     migration: add cpr_walk_fd
>>>>     oslib: qemu_clear_cloexec
>>>>     vl: helper to request exec
>>>>     migration: cpr-exec-command parameter
>>>>     migration: cpr-exec save and load
>>>>     migration: cpr-exec mode
>>>>     migration: cpr-exec docs
>>>>     vfio: cpr-exec mode
>>>
>>> The other thing is, as Vladimir is working on (looks like) a cleaner way of
>>> passing FDs fully relying on unix sockets, I want to understand better on
>>> the relationships of his work and the exec model.
>>
>> His work is based on my work -- the ability to embed a file descriptor in a
>> migration stream with a VMSTATE_FD declaration -- so it is compatible.
>>
>> The cpr-exec series preserves VMSTATE_FD across exec by remembering the fd
>> integer and embedding that in the data stream.  See the changes in vmstate-types.c
>> in [PATCH V3 7/9] migration: cpr-exec mode.
>>
>> Thus cpr-exec will still preserve tap devices via Vladimir's code.
>>> I still personally think we should always stick with unix sockets, but I'm
>>> open to be convinced on above limitations.  If exec is better than
>>> cpr-transfer in any way, the hope is more people can and should adopt it.
>>
>> Various people and companies have expressed interest in CPR and want to explore
>> cpr-exec.  Vladimir was one, he chose transfer instead, and that is fine, but
>> give people the option.  And Oracle continues to use cpr-exec mode.
> 
> How does cpr-exec guarantees everything will go smoothly with no failure
> after the exec?  Essentially, this is Vladimir's question 1.  

Live migration can fail if dirty memory copy does not converge.  CPR does not.
cpr-transfer can fail if it fails to create a new container.  cpr-exec cannot.
cpr-transfer can fail to allocate resources.  cpr-exec needs less.

cpr-exec failure is almost always due to a QEMU bug.  For example, a new feature
has been added to new QEMU, and is *not* forced to false in a compatibility entry
for the old machine model. We do our best to find and fix those before going into
production. In production, the success rate is high. That is one reason I like the
mode so much.

> Feel free to
> answer there, because there's also question 2 (which we used to cover some
> but maybe not as much).

Question 2 is about minimizing downtime by starting new QEMU while old QEMU
is still running.  That is true, but the savings are small.
> The other thing I don't remember if we discussed, on how cpr-exec manages
> device hotplugs. Say, what happens if there are devices hot plugged (via
> QMP) then cpr-exec migration happens?One method: start new qemu with the original command-line arguments plus -S, then
mgmt re-sends the hot plug commands to the qemu monitor.  Same as for live
migration.
> Does cpr-exec cmdline needs to convert all QMP hot-plugged devices into
> cmdlines and append them?  
That also works, and is a technique I have used to reduce guest pause time.

> How to guarantee src/dst device topology match
> exactly the same with the new cmdline?

That is up to the mgmt layer, to know how QEMU was originally started, and
what has been hot plugged afterwards.  The fast qom-list-get command that
I recently added can help here.

- Steve
>> There is no downside to supporting cpr-exec mode.  It is astonishing how much
>> code is shared by the cpr-transfer and cpr-exec modes.  Most of the code in
>> this series is factored into specific cpr-exec files and functions, code that
>> will never run for any other reason.  There are very few conditionals in common
>> code that do something different for exec mode.
>>> We also have no answer yet on how cpr-exec can resolve container world with
>>> seccomp forbidding exec.  I guess that's a no-go.  It's definitely a
>>> downside instead.  Better mention that in the cover letter.
>> The key is limiting the contents of the container, so exec only has a limited
>> and known safe set of things to target.  I'll add that to the cover letter.
> 
> Thanks.

Re: [PATCH V3 0/9] Live update: cpr-exec

Posted by Peter Xu 2 months ago

On Tue, Sep 09, 2025 at 12:03:11PM -0400, Steven Sistare wrote:
> On 9/9/2025 11:24 AM, Peter Xu wrote:
> > On Tue, Sep 09, 2025 at 10:36:16AM -0400, Steven Sistare wrote:
> > > On 9/5/2025 12:48 PM, Peter Xu wrote:
> > > > Add Vladimir and Dan.
> > > > 
> > > > On Thu, Aug 14, 2025 at 10:17:14AM -0700, Steve Sistare wrote:
> > > > > This patch series adds the live migration cpr-exec mode.
> > > > > 
> > > > > The new user-visible interfaces are:
> > > > >     * cpr-exec (MigMode migration parameter)
> > > > >     * cpr-exec-command (migration parameter)
> > > > > 
> > > > > cpr-exec mode is similar in most respects to cpr-transfer mode, with the
> > > > > primary difference being that old QEMU directly exec's new QEMU.  The user
> > > > > specifies the command to exec new QEMU in the migration parameter
> > > > > cpr-exec-command.
> > > > > 
> > > > > Why?
> > > > > 
> > > > > In a containerized QEMU environment, cpr-exec reuses an existing QEMU
> > > > > container and its assigned resources.  By contrast, cpr-transfer mode
> > > > > requires a new container to be created on the same host as the target of
> > > > > the CPR operation.  Resources must be reserved for the new container, while
> > > > > the old container still reserves resources until the operation completes.
> > > > > Avoiding over commitment requires extra work in the management layer.
> > > > 
> > > > Can we spell out what are these resources?
> > > > 
> > > > CPR definitely relies on completely shared memory.  That's already not a
> > > > concern.
> > > > 
> > > > CPR resolves resources that are bound to devices like VFIO by passing over
> > > > FDs, these are not over commited either.
> > > > 
> > > > Is it accounting QEMU/KVM process overhead?  That would really be trivial,
> > > > IMHO, but maybe something else?
> > > 
> > > Accounting is one issue, and it is not trivial.  Another is arranging exclusive
> > > use of a set of CPUs, the same set for the old and new container, concurrently.
> > > Another is avoiding namespace conflicts, the kind that make localhost migration
> > > difficult.
> > > 
> > > > > This is one reason why a cloud provider may prefer cpr-exec.  A second reason
> > > > > is that the container may include agents with their own connections to the
> > > > > outside world, and such connections remain intact if the container is reused.
> > > > 
> > > > We discussed about this one.  Personally I still cannot understand why this
> > > > is a concern if the agents can be trivially started as a new instance.  But
> > > > I admit I may not know the whole picture.  To me, the above point is more
> > > > persuasive, but I'll need to understand which part that is over-commited
> > > > that can be a problem.
> > > 
> > > Agents can be restarted, but that would sever the connection to the outside
> > > world.  With cpr-transfer or any local migration, you would need agents
> > > outside of old and new containers that persist.
> > > 
> > > With cpr-exec, connections can be preserved without requiring the end user
> > > to reconnect, and can be done trivially, by preserving chardevs.  With that
> > > support in qemu, the management layer does nothing extra to preserve them.
> > > chardev support is not part of this series but is part of my vision,
> > > and makes exec mode even more compelling.
> > > 
> > > Management layers have a lot of code and complexity to manage live migration,
> > > resources, and connections.  It requires modification to support cpr-transfer.
> > > All that can be bypassed with exec mode.  Less complexity, less maintainance,
> > > and  fewer points of failure.  I know this because I implemented exec mode in
> > > OCI at Oracle, and we use it in production.
> > 
> > I wonders how this part works in Vladimir's use case.
> > 
> > > > After all, cloud hosts should preserve some extra memory anyway to make
> > > > sure dynamic resources allocations all the time (e.g., when live migration
> > > > starts, KVM pgtables can drastically increase if huge pages are enabled,
> > > > for PAGE_SIZE trackings), I assumed the over-commit portion should be less
> > > > that those.. and when it's also temporary (src QEMU will release all
> > > > resources after live upgrade) then it looks manageable. >>
> > > > > How?
> > > > > 
> > > > > cpr-exec preserves descriptors across exec by clearing the CLOEXEC flag,
> > > > > and by sending the unique name and value of each descriptor to new QEMU
> > > > > via CPR state.
> > > > > 
> > > > > CPR state cannot be sent over the normal migration channel, because devices
> > > > > and backends are created prior to reading the channel, so this mode sends
> > > > > CPR state over a second migration channel that is not visible to the user.
> > > > > New QEMU reads the second channel prior to creating devices or backends.
> > > > > 
> > > > > The exec itself is trivial.  After writing to the migration channels, the
> > > > > migration code calls a new main-loop hook to perform the exec.
> > > > > 
> > > > > Example:
> > > > > 
> > > > > In this example, we simply restart the same version of QEMU, but in
> > > > > a real scenario one would use a new QEMU binary path in cpr-exec-command.
> > > > > 
> > > > >     # qemu-kvm -monitor stdio
> > > > >     -object memory-backend-memfd,id=ram0,size=1G
> > > > >     -machine memory-backend=ram0 -machine aux-ram-share=on ...
> > > > > 
> > > > >     QEMU 10.1.50 monitor - type 'help' for more information
> > > > >     (qemu) info status
> > > > >     VM status: running
> > > > >     (qemu) migrate_set_parameter mode cpr-exec
> > > > >     (qemu) migrate_set_parameter cpr-exec-command qemu-kvm ... -incoming file:vm.state
> > > > >     (qemu) migrate -d file:vm.state
> > > > >     (qemu) QEMU 10.1.50 monitor - type 'help' for more information
> > > > >     (qemu) info status
> > > > >     VM status: running
> > > > > 
> > > > > Steve Sistare (9):
> > > > >     migration: multi-mode notifier
> > > > >     migration: add cpr_walk_fd
> > > > >     oslib: qemu_clear_cloexec
> > > > >     vl: helper to request exec
> > > > >     migration: cpr-exec-command parameter
> > > > >     migration: cpr-exec save and load
> > > > >     migration: cpr-exec mode
> > > > >     migration: cpr-exec docs
> > > > >     vfio: cpr-exec mode
> > > > 
> > > > The other thing is, as Vladimir is working on (looks like) a cleaner way of
> > > > passing FDs fully relying on unix sockets, I want to understand better on
> > > > the relationships of his work and the exec model.
> > > 
> > > His work is based on my work -- the ability to embed a file descriptor in a
> > > migration stream with a VMSTATE_FD declaration -- so it is compatible.
> > > 
> > > The cpr-exec series preserves VMSTATE_FD across exec by remembering the fd
> > > integer and embedding that in the data stream.  See the changes in vmstate-types.c
> > > in [PATCH V3 7/9] migration: cpr-exec mode.
> > > 
> > > Thus cpr-exec will still preserve tap devices via Vladimir's code.
> > > > I still personally think we should always stick with unix sockets, but I'm
> > > > open to be convinced on above limitations.  If exec is better than
> > > > cpr-transfer in any way, the hope is more people can and should adopt it.
> > > 
> > > Various people and companies have expressed interest in CPR and want to explore
> > > cpr-exec.  Vladimir was one, he chose transfer instead, and that is fine, but
> > > give people the option.  And Oracle continues to use cpr-exec mode.
> > 
> > How does cpr-exec guarantees everything will go smoothly with no failure
> > after the exec?  Essentially, this is Vladimir's question 1.
> 
> Live migration can fail if dirty memory copy does not converge.  CPR does not.

As we're comparing cpr-transfer and cpr-exec, this one doesn't really count, AFAIU.

> cpr-transfer can fail if it fails to create a new container.  cpr-exec cannot.
> cpr-transfer can fail to allocate resources.  cpr-exec needs less.

These two could happen in very occpied hosts indeed, but is it really that
common an issue when ignoring the whole guest memory section after all?

> 
> cpr-exec failure is almost always due to a QEMU bug.  For example, a new feature
> has been added to new QEMU, and is *not* forced to false in a compatibility entry
> for the old machine model. We do our best to find and fix those before going into
> production. In production, the success rate is high. That is one reason I like the
> mode so much.

Yes, but this is still a major issue.  The problem is I don't think we have
good way to provide 100% coverage on the code base covering all kinds of
migrations.

After all, we have tons of needed() fields in VMSD, we need to always be
prepared that the migration stream can change from time to time with
exactly the same device setup, and some of them may prone to put() failures
on the other side.

After all, live migration was designed to be fine with such, so at least VM
won't crash on src if anything happens.

Precopy always does that, we're trying to make postcopy do the same, which
Juraj is working on, so that postcopy can FAIL and rollback to src too if
device state doesn't apply all fine.

It's still not uncommon to have guest OS / driver behavior change causing
some corner case migration failures but only when applying the states.

That's IMHO a high risk even if low possibility.

> 
> > Feel free to
> > answer there, because there's also question 2 (which we used to cover some
> > but maybe not as much).
> 
> Question 2 is about minimizing downtime by starting new QEMU while old QEMU
> is still running.  That is true, but the savings are small.

I thought we discussed about this, and it should be known to have at least
below two major part of things that will increase downtime (either directly
accounted into downtime, or slow down vcpus later)?

  - Process pgtable, aka, QEMU's view of guest mem
  - EPT pgtable, aka, vCPU's view of guest mem

Populating these should normally take time when VM becomes huge, while
cpr-transfer can still benefit on pre-populations before switchover.

IIUC that's a known issue, but please correct me if I remembered it wrong.
I think it means this issue is more severe with larger VMs, which is a
trade-off.  It's just that I don't know what else might be relevant.

Personally I don't think this is a blocker for cpr-exec, but we should IMHO
record the differences.  It would be best, IMHO, to have a section in
cpr.rst to discuss this, helping user decide which to choose when both
benefits from CPR in general.

Meanwhile, just to mention unit test for cpr-exec is still missing.

> > The other thing I don't remember if we discussed, on how cpr-exec manages
> > device hotplugs. Say, what happens if there are devices hot plugged (via
> > QMP) then cpr-exec migration happens?One method: start new qemu with the original command-line arguments plus -S, then
> mgmt re-sends the hot plug commands to the qemu monitor.  Same as for live
> migration.
> > Does cpr-exec cmdline needs to convert all QMP hot-plugged devices into
> > cmdlines and append them?
> That also works, and is a technique I have used to reduce guest pause time.
> 
> > How to guarantee src/dst device topology match
> > exactly the same with the new cmdline?
> 
> That is up to the mgmt layer, to know how QEMU was originally started, and
> what has been hot plugged afterwards.  The fast qom-list-get command that
> I recently added can help here.

I see.  If you think that is the best way to consume cpr-exec, would you
add a small section into the doc patch for it as well?

-- 
Peter Xu

Re: [PATCH V3 0/9] Live update: cpr-exec

Posted by Steven Sistare 2 months ago

On 9/9/2025 2:37 PM, Peter Xu wrote:
> On Tue, Sep 09, 2025 at 12:03:11PM -0400, Steven Sistare wrote:
>> On 9/9/2025 11:24 AM, Peter Xu wrote:
>>> On Tue, Sep 09, 2025 at 10:36:16AM -0400, Steven Sistare wrote:
>>>> On 9/5/2025 12:48 PM, Peter Xu wrote:
>>>>> Add Vladimir and Dan.
>>>>>
>>>>> On Thu, Aug 14, 2025 at 10:17:14AM -0700, Steve Sistare wrote:
>>>>>> This patch series adds the live migration cpr-exec mode.
>>>>>>
>>>>>> The new user-visible interfaces are:
>>>>>>      * cpr-exec (MigMode migration parameter)
>>>>>>      * cpr-exec-command (migration parameter)
>>>>>>
>>>>>> cpr-exec mode is similar in most respects to cpr-transfer mode, with the
>>>>>> primary difference being that old QEMU directly exec's new QEMU.  The user
>>>>>> specifies the command to exec new QEMU in the migration parameter
>>>>>> cpr-exec-command.
>>>>>>
>>>>>> Why?
>>>>>>
>>>>>> In a containerized QEMU environment, cpr-exec reuses an existing QEMU
>>>>>> container and its assigned resources.  By contrast, cpr-transfer mode
>>>>>> requires a new container to be created on the same host as the target of
>>>>>> the CPR operation.  Resources must be reserved for the new container, while
>>>>>> the old container still reserves resources until the operation completes.
>>>>>> Avoiding over commitment requires extra work in the management layer.
>>>>>
>>>>> Can we spell out what are these resources?
>>>>>
>>>>> CPR definitely relies on completely shared memory.  That's already not a
>>>>> concern.
>>>>>
>>>>> CPR resolves resources that are bound to devices like VFIO by passing over
>>>>> FDs, these are not over commited either.
>>>>>
>>>>> Is it accounting QEMU/KVM process overhead?  That would really be trivial,
>>>>> IMHO, but maybe something else?
>>>>
>>>> Accounting is one issue, and it is not trivial.  Another is arranging exclusive
>>>> use of a set of CPUs, the same set for the old and new container, concurrently.
>>>> Another is avoiding namespace conflicts, the kind that make localhost migration
>>>> difficult.
>>>>
>>>>>> This is one reason why a cloud provider may prefer cpr-exec.  A second reason
>>>>>> is that the container may include agents with their own connections to the
>>>>>> outside world, and such connections remain intact if the container is reused.
>>>>>
>>>>> We discussed about this one.  Personally I still cannot understand why this
>>>>> is a concern if the agents can be trivially started as a new instance.  But
>>>>> I admit I may not know the whole picture.  To me, the above point is more
>>>>> persuasive, but I'll need to understand which part that is over-commited
>>>>> that can be a problem.
>>>>
>>>> Agents can be restarted, but that would sever the connection to the outside
>>>> world.  With cpr-transfer or any local migration, you would need agents
>>>> outside of old and new containers that persist.
>>>>
>>>> With cpr-exec, connections can be preserved without requiring the end user
>>>> to reconnect, and can be done trivially, by preserving chardevs.  With that
>>>> support in qemu, the management layer does nothing extra to preserve them.
>>>> chardev support is not part of this series but is part of my vision,
>>>> and makes exec mode even more compelling.
>>>>
>>>> Management layers have a lot of code and complexity to manage live migration,
>>>> resources, and connections.  It requires modification to support cpr-transfer.
>>>> All that can be bypassed with exec mode.  Less complexity, less maintainance,
>>>> and  fewer points of failure.  I know this because I implemented exec mode in
>>>> OCI at Oracle, and we use it in production.
>>>
>>> I wonders how this part works in Vladimir's use case.
>>>
>>>>> After all, cloud hosts should preserve some extra memory anyway to make
>>>>> sure dynamic resources allocations all the time (e.g., when live migration
>>>>> starts, KVM pgtables can drastically increase if huge pages are enabled,
>>>>> for PAGE_SIZE trackings), I assumed the over-commit portion should be less
>>>>> that those.. and when it's also temporary (src QEMU will release all
>>>>> resources after live upgrade) then it looks manageable. >>
>>>>>> How?
>>>>>>
>>>>>> cpr-exec preserves descriptors across exec by clearing the CLOEXEC flag,
>>>>>> and by sending the unique name and value of each descriptor to new QEMU
>>>>>> via CPR state.
>>>>>>
>>>>>> CPR state cannot be sent over the normal migration channel, because devices
>>>>>> and backends are created prior to reading the channel, so this mode sends
>>>>>> CPR state over a second migration channel that is not visible to the user.
>>>>>> New QEMU reads the second channel prior to creating devices or backends.
>>>>>>
>>>>>> The exec itself is trivial.  After writing to the migration channels, the
>>>>>> migration code calls a new main-loop hook to perform the exec.
>>>>>>
>>>>>> Example:
>>>>>>
>>>>>> In this example, we simply restart the same version of QEMU, but in
>>>>>> a real scenario one would use a new QEMU binary path in cpr-exec-command.
>>>>>>
>>>>>>      # qemu-kvm -monitor stdio
>>>>>>      -object memory-backend-memfd,id=ram0,size=1G
>>>>>>      -machine memory-backend=ram0 -machine aux-ram-share=on ...
>>>>>>
>>>>>>      QEMU 10.1.50 monitor - type 'help' for more information
>>>>>>      (qemu) info status
>>>>>>      VM status: running
>>>>>>      (qemu) migrate_set_parameter mode cpr-exec
>>>>>>      (qemu) migrate_set_parameter cpr-exec-command qemu-kvm ... -incoming file:vm.state
>>>>>>      (qemu) migrate -d file:vm.state
>>>>>>      (qemu) QEMU 10.1.50 monitor - type 'help' for more information
>>>>>>      (qemu) info status
>>>>>>      VM status: running
>>>>>>
>>>>>> Steve Sistare (9):
>>>>>>      migration: multi-mode notifier
>>>>>>      migration: add cpr_walk_fd
>>>>>>      oslib: qemu_clear_cloexec
>>>>>>      vl: helper to request exec
>>>>>>      migration: cpr-exec-command parameter
>>>>>>      migration: cpr-exec save and load
>>>>>>      migration: cpr-exec mode
>>>>>>      migration: cpr-exec docs
>>>>>>      vfio: cpr-exec mode
>>>>>
>>>>> The other thing is, as Vladimir is working on (looks like) a cleaner way of
>>>>> passing FDs fully relying on unix sockets, I want to understand better on
>>>>> the relationships of his work and the exec model.
>>>>
>>>> His work is based on my work -- the ability to embed a file descriptor in a
>>>> migration stream with a VMSTATE_FD declaration -- so it is compatible.
>>>>
>>>> The cpr-exec series preserves VMSTATE_FD across exec by remembering the fd
>>>> integer and embedding that in the data stream.  See the changes in vmstate-types.c
>>>> in [PATCH V3 7/9] migration: cpr-exec mode.
>>>>
>>>> Thus cpr-exec will still preserve tap devices via Vladimir's code.
>>>>> I still personally think we should always stick with unix sockets, but I'm
>>>>> open to be convinced on above limitations.  If exec is better than
>>>>> cpr-transfer in any way, the hope is more people can and should adopt it.
>>>>
>>>> Various people and companies have expressed interest in CPR and want to explore
>>>> cpr-exec.  Vladimir was one, he chose transfer instead, and that is fine, but
>>>> give people the option.  And Oracle continues to use cpr-exec mode.
>>>
>>> How does cpr-exec guarantees everything will go smoothly with no failure
>>> after the exec?  Essentially, this is Vladimir's question 1.
>>
>> Live migration can fail if dirty memory copy does not converge.  CPR does not.
> 
> As we're comparing cpr-transfer and cpr-exec, this one doesn't really count, AFAIU.
> 
>> cpr-transfer can fail if it fails to create a new container.  cpr-exec cannot.
>> cpr-transfer can fail to allocate resources.  cpr-exec needs less.
> 
> These two could happen in very occpied hosts indeed, but is it really that
> common an issue when ignoring the whole guest memory section after all?

Conventional wisdom holds that in migration scenarios, we must have the option
to fall back to the source if the target fails.  In all the above, I point out
the reasons behind this wisdom, and that many of those reasons do not apply
for cpr-exec.
>> cpr-exec failure is almost always due to a QEMU bug.  For example, a new feature
>> has been added to new QEMU, and is *not* forced to false in a compatibility entry
>> for the old machine model. We do our best to find and fix those before going into
>> production. In production, the success rate is high. That is one reason I like the
>> mode so much.
> 
> Yes, but this is still a major issue.  The problem is I don't think we have
> good way to provide 100% coverage on the code base covering all kinds of
> migrations.
> 
> After all, we have tons of needed() fields in VMSD, we need to always be
> prepared that the migration stream can change from time to time with
> exactly the same device setup, and some of them may prone to put() failures
> on the other side.
> 
> After all, live migration was designed to be fine with such, so at least VM
> won't crash on src if anything happens.
> 
> Precopy always does that, we're trying to make postcopy do the same, which
> Juraj is working on, so that postcopy can FAIL and rollback to src too if
> device state doesn't apply all fine.
> 
> It's still not uncommon to have guest OS / driver behavior change causing
> some corner case migration failures but only when applying the states.
> 
> That's IMHO a high risk even if low possibility.

No question, bugs are a risk and will occur.

>>> Feel free to
>>> answer there, because there's also question 2 (which we used to cover some
>>> but maybe not as much).
>>
>> Question 2 is about minimizing downtime by starting new QEMU while old QEMU
>> is still running.  That is true, but the savings are small.
> 
> I thought we discussed about this, and it should be known to have at least
> below two major part of things that will increase downtime (either directly
> accounted into downtime, or slow down vcpus later)?
> 
>    - Process pgtable, aka, QEMU's view of guest mem
>    - EPT pgtable, aka, vCPU's view of guest mem
> 
> Populating these should normally take time when VM becomes huge, while
> cpr-transfer can still benefit on pre-populations before switchover.
> 
> IIUC that's a known issue, but please correct me if I remembered it wrong.
> I think it means this issue is more severe with larger VMs, which is a
> trade-off.  It's just that I don't know what else might be relevant.
> 
> Personally I don't think this is a blocker for cpr-exec, but we should IMHO
> record the differences.  It would be best, IMHO, to have a section in
> cpr.rst to discuss this, helping user decide which to choose when both
> benefits from CPR in general.

Will do.

> Meanwhile, just to mention unit test for cpr-exec is still missing.

I will rebase and post it after receiving all comments for V3.
I think we are almost there.
>>> The other thing I don't remember if we discussed, on how cpr-exec manages
>>> device hotplugs. Say, what happens if there are devices hot plugged (via
>>> QMP) then cpr-exec migration happens?One method: start new qemu with the original command-line arguments plus -S, then
>> mgmt re-sends the hot plug commands to the qemu monitor.  Same as for live
>> migration.
>>> Does cpr-exec cmdline needs to convert all QMP hot-plugged devices into
>>> cmdlines and append them?
>> That also works, and is a technique I have used to reduce guest pause time.
>>
>>> How to guarantee src/dst device topology match
>>> exactly the same with the new cmdline?
>>
>> That is up to the mgmt layer, to know how QEMU was originally started, and
>> what has been hot plugged afterwards.  The fast qom-list-get command that
>> I recently added can help here.
> 
> I see.  If you think that is the best way to consume cpr-exec, would you
> add a small section into the doc patch for it as well?

It is not related to cpr-exec.  It is related to hot plug, for any migration
type scenario, so it does not fit in the cpr-exec docs.

- Steve

Re: [PATCH V3 0/9] Live update: cpr-exec

Posted by Peter Xu 2 months ago

On Fri, Sep 12, 2025 at 10:50:34AM -0400, Steven Sistare wrote:
> > > > How to guarantee src/dst device topology match
> > > > exactly the same with the new cmdline?
> > > 
> > > That is up to the mgmt layer, to know how QEMU was originally started, and
> > > what has been hot plugged afterwards.  The fast qom-list-get command that
> > > I recently added can help here.
> > 
> > I see.  If you think that is the best way to consume cpr-exec, would you
> > add a small section into the doc patch for it as well?
> 
> It is not related to cpr-exec.  It is related to hot plug, for any migration
> type scenario, so it does not fit in the cpr-exec docs.

IMHO it matters.. With cpr-transfer, QMP hot plugs works and will not
contribute to downtime.  cpr-exec also works, but will contribute to
downtime.

We could, in the comparison section between cpr-exec v.s. cpr-transfer,
mention the potential difference on device hot plugs (out of many other
differences), then also mention that there's an option to reduce downtime
for cpr-exec due to hot-plug by converting QMP hot plugs into cmdlines
leveraging qom-list-get and other facilities.  From there we could further
link to a special small section describing the usage of qom-list-get, or
stop there.

Thanks,

-- 
Peter Xu

Re: [PATCH V3 0/9] Live update: cpr-exec

Posted by Steven Sistare 1 month, 3 weeks ago

On 9/12/2025 11:44 AM, Peter Xu wrote:
> On Fri, Sep 12, 2025 at 10:50:34AM -0400, Steven Sistare wrote:
>>>>> How to guarantee src/dst device topology match
>>>>> exactly the same with the new cmdline?
>>>>
>>>> That is up to the mgmt layer, to know how QEMU was originally started, and
>>>> what has been hot plugged afterwards.  The fast qom-list-get command that
>>>> I recently added can help here.
>>>
>>> I see.  If you think that is the best way to consume cpr-exec, would you
>>> add a small section into the doc patch for it as well?
>>
>> It is not related to cpr-exec.  It is related to hot plug, for any migration
>> type scenario, so it does not fit in the cpr-exec docs.
> 
> IMHO it matters.. With cpr-transfer, QMP hot plugs works and will not
> contribute to downtime.

I don't follow.  The guest is not resumed until after all devices that were
present in old QEMU are hot plugged in new QEMU, regardless of mode.

> cpr-exec also works, but will contribute to
> downtime.
> 
> We could, in the comparison section between cpr-exec v.s. cpr-transfer,
> mention the potential difference on device hot plugs (out of many other
> differences), then also mention that there's an option to reduce downtime
> for cpr-exec due to hot-plug by converting QMP hot plugs into cmdlines
> leveraging qom-list-get and other facilities.  From there we could further
> link to a special small section describing the usage of qom-list-get, or
> stop there.

To hot plug a device, *or* to add it to the new QEMU command line, the manager
must know that the device was added sometime after old QEMU started, and
qom-list-get can help with that, by examining old QEMU initially and again
immediately before the update, then performing a diff.  But again, this
is independent of mode.

- Steve

Re: [PATCH V3 0/9] Live update: cpr-exec

Posted by Vladimir Sementsov-Ogievskiy 1 month, 3 weeks ago

On 19.09.25 20:16, Steven Sistare wrote:
> On 9/12/2025 11:44 AM, Peter Xu wrote:
>> On Fri, Sep 12, 2025 at 10:50:34AM -0400, Steven Sistare wrote:
>>>>>> How to guarantee src/dst device topology match
>>>>>> exactly the same with the new cmdline?
>>>>>
>>>>> That is up to the mgmt layer, to know how QEMU was originally started, and
>>>>> what has been hot plugged afterwards.  The fast qom-list-get command that
>>>>> I recently added can help here.
>>>>
>>>> I see.  If you think that is the best way to consume cpr-exec, would you
>>>> add a small section into the doc patch for it as well?
>>>
>>> It is not related to cpr-exec.  It is related to hot plug, for any migration
>>> type scenario, so it does not fit in the cpr-exec docs.
>>
>> IMHO it matters.. With cpr-transfer, QMP hot plugs works and will not
>> contribute to downtime.
> 
> I don't follow.  The guest is not resumed until after all devices that were
> present in old QEMU are hot plugged in new QEMU, regardless of mode.

Yes, but in case of cpr-transfer, source is still running at time when we do adding
devices to target through QMP. So, downtime is not started until we say "migrate-incoming".

> 
>> cpr-exec also works, but will contribute to
>> downtime.
>>
>> We could, in the comparison section between cpr-exec v.s. cpr-transfer,
>> mention the potential difference on device hot plugs (out of many other
>> differences), then also mention that there's an option to reduce downtime
>> for cpr-exec due to hot-plug by converting QMP hot plugs into cmdlines
>> leveraging qom-list-get and other facilities.  From there we could further
>> link to a special small section describing the usage of qom-list-get, or
>> stop there.
> 
> To hot plug a device, *or* to add it to the new QEMU command line, the manager
> must know that the device was added sometime after old QEMU started, and
> qom-list-get can help with that, by examining old QEMU initially and again
> immediately before the update, then performing a diff.  But again, this
> is independent of mode.
> 
> - Steve


-- 
Best regards,
Vladimir

Re: [PATCH V3 0/9] Live update: cpr-exec

Posted by Dr. David Alan Gilbert 2 months, 1 week ago

* Peter Xu (peterx@redhat.com) wrote:
> Add Vladimir and Dan.
> 
> On Thu, Aug 14, 2025 at 10:17:14AM -0700, Steve Sistare wrote:
> > This patch series adds the live migration cpr-exec mode.  
> > 
> > The new user-visible interfaces are:
> >   * cpr-exec (MigMode migration parameter)
> >   * cpr-exec-command (migration parameter)
> > 
> > cpr-exec mode is similar in most respects to cpr-transfer mode, with the 
> > primary difference being that old QEMU directly exec's new QEMU.  The user
> > specifies the command to exec new QEMU in the migration parameter
> > cpr-exec-command.
> > 
> > Why?
> > 
> > In a containerized QEMU environment, cpr-exec reuses an existing QEMU
> > container and its assigned resources.  By contrast, cpr-transfer mode
> > requires a new container to be created on the same host as the target of
> > the CPR operation.  Resources must be reserved for the new container, while
> > the old container still reserves resources until the operation completes.
> > Avoiding over commitment requires extra work in the management layer.
> 
> Can we spell out what are these resources?
> 
> CPR definitely relies on completely shared memory.  That's already not a
> concern.
> 
> CPR resolves resources that are bound to devices like VFIO by passing over
> FDs, these are not over commited either.
> 
> Is it accounting QEMU/KVM process overhead?  That would really be trivial,
> IMHO, but maybe something else?
> 
> > This is one reason why a cloud provider may prefer cpr-exec.  A second reason
> > is that the container may include agents with their own connections to the
> > outside world, and such connections remain intact if the container is reused.
> 
> We discussed about this one.  Personally I still cannot understand why this
> is a concern if the agents can be trivially started as a new instance.  But
> I admit I may not know the whole picture.  To me, the above point is more
> persuasive, but I'll need to understand which part that is over-commited
> that can be a problem.

> After all, cloud hosts should preserve some extra memory anyway to make
> sure dynamic resources allocations all the time (e.g., when live migration
> starts, KVM pgtables can drastically increase if huge pages are enabled,
> for PAGE_SIZE trackings), I assumed the over-commit portion should be less
> that those.. and when it's also temporary (src QEMU will release all
> resources after live upgrade) then it looks manageable.

k8s used to find it very hard to change the amount of memory allocated to a
container after launch (although I heard that's getting fixed); so you'd
need more excess at the start even if your peek during hand over is only
very short.

Dave
> 
> > 
> > How?
> > 
> > cpr-exec preserves descriptors across exec by clearing the CLOEXEC flag,
> > and by sending the unique name and value of each descriptor to new QEMU
> > via CPR state.
> > 
> > CPR state cannot be sent over the normal migration channel, because devices
> > and backends are created prior to reading the channel, so this mode sends
> > CPR state over a second migration channel that is not visible to the user.
> > New QEMU reads the second channel prior to creating devices or backends.
> > 
> > The exec itself is trivial.  After writing to the migration channels, the
> > migration code calls a new main-loop hook to perform the exec.
> > 
> > Example:
> > 
> > In this example, we simply restart the same version of QEMU, but in
> > a real scenario one would use a new QEMU binary path in cpr-exec-command.
> > 
> >   # qemu-kvm -monitor stdio
> >   -object memory-backend-memfd,id=ram0,size=1G
> >   -machine memory-backend=ram0 -machine aux-ram-share=on ...
> > 
> >   QEMU 10.1.50 monitor - type 'help' for more information
> >   (qemu) info status
> >   VM status: running
> >   (qemu) migrate_set_parameter mode cpr-exec
> >   (qemu) migrate_set_parameter cpr-exec-command qemu-kvm ... -incoming file:vm.state
> >   (qemu) migrate -d file:vm.state
> >   (qemu) QEMU 10.1.50 monitor - type 'help' for more information
> >   (qemu) info status
> >   VM status: running
> > 
> > Steve Sistare (9):
> >   migration: multi-mode notifier
> >   migration: add cpr_walk_fd
> >   oslib: qemu_clear_cloexec
> >   vl: helper to request exec
> >   migration: cpr-exec-command parameter
> >   migration: cpr-exec save and load
> >   migration: cpr-exec mode
> >   migration: cpr-exec docs
> >   vfio: cpr-exec mode
> 
> The other thing is, as Vladimir is working on (looks like) a cleaner way of
> passing FDs fully relying on unix sockets, I want to understand better on
> the relationships of his work and the exec model.
> 
> I still personally think we should always stick with unix sockets, but I'm
> open to be convinced on above limitations.  If exec is better than
> cpr-transfer in any way, the hope is more people can and should adopt it.
> 
> We also have no answer yet on how cpr-exec can resolve container world with
> seccomp forbidding exec.  I guess that's a no-go.  It's definitely a
> downside instead.  Better mention that in the cover letter.
> 
> Thanks,
> 
> -- 
> Peter Xu
> 
-- 
 -----Open up your eyes, open up your mind, open up your code -------   
/ Dr. David Alan Gilbert    |       Running GNU/Linux       | Happy  \ 
\        dave @ treblig.org |                               | In Hex /
 \ _________________________|_____ http://www.treblig.org   |_______/

Re: [PATCH V3 0/9] Live update: cpr-exec

Posted by Peter Xu 2 months, 1 week ago

On Fri, Sep 05, 2025 at 05:09:05PM +0000, Dr. David Alan Gilbert wrote:
> k8s used to find it very hard to change the amount of memory allocated to a
> container after launch (although I heard that's getting fixed); so you'd
> need more excess at the start even if your peek during hand over is only
> very short.

When kubevirt will need to support cpr, it needs to do live migration as
usual, normally by creating a separate container to put dest QEMU.  So the
hope is there's no need to change the memory setup.

I think it's not yet possible to start two QEMUs in one container after
all, because QEMU, in case of kubevirt, is always paired with a libvirt
instance. And AFAICT libvirt still doesn't support two instances appear in
the same container..  So another container should be required to trigger a
live migration, for CPR or not.

PS: I never fully understood why that's a challenge btw, especially on mem
growing not shrinking.  For CPU resources we have the same issue that
container cannot easily hot plug CPU resources into one container, that
made multifd almost useless for kubevirt when people use dedicated CPU
topology, it means all multifd threads will be run either on one physical
core (together with all the rest of QEMU mgmt threads, like main thread),
or directly run on vCPU threads which is even worse..

-- 
Peter Xu