[RFC V2 0/8] Live update: tap and vhost

Steve Sistare posted 8 patches 4 months ago
Failed in applying to current master (apply log)
qapi/net.json             |   5 +-
include/hw/virtio/vhost.h |   1 +
include/migration/cpr.h   |   3 +-
include/net/tap.h         |   1 +
hw/net/virtio-net.c       |  20 +++++++
hw/vfio/device.c          |   2 +-
hw/virtio/vhost-backend.c |   6 ++
hw/virtio/vhost.c         |  32 +++++++++++
migration/cpr.c           |  24 ++++++--
migration/migration.c     |  38 ++++++++-----
net/tap-win32.c           |   5 ++
net/tap.c                 | 141 +++++++++++++++++++++++++++++++++++-----------
12 files changed, 223 insertions(+), 55 deletions(-)
[RFC V2 0/8] Live update: tap and vhost
Posted by Steve Sistare 4 months ago
Tap and vhost devices can be preserved during cpr-transfer using
traditional live migration methods, wherein the management layer
creates new interfaces for the target and fiddles with 'ip link'
to deactivate the old interface and activate the new.

However, CPR can simply send the file descriptors to new QEMU,
with no special management actions required.  The user enables
this behavior by specifing '-netdev tap,cpr=on'.  The default
is cpr=off.

Steve Sistare (8):
  migration: stop vm earlier for cpr
  migration: cpr setup notifier
  vhost: reset vhost devices for cpr
  cpr: delete all fds
  Revert "vhost-backend: remove vhost_kernel_reset_device()"
  tap: common return label
  tap: cpr support
  tap: postload fix for cpr

 qapi/net.json             |   5 +-
 include/hw/virtio/vhost.h |   1 +
 include/migration/cpr.h   |   3 +-
 include/net/tap.h         |   1 +
 hw/net/virtio-net.c       |  20 +++++++
 hw/vfio/device.c          |   2 +-
 hw/virtio/vhost-backend.c |   6 ++
 hw/virtio/vhost.c         |  32 +++++++++++
 migration/cpr.c           |  24 ++++++--
 migration/migration.c     |  38 ++++++++-----
 net/tap-win32.c           |   5 ++
 net/tap.c                 | 141 +++++++++++++++++++++++++++++++++++-----------
 12 files changed, 223 insertions(+), 55 deletions(-)

-- 
1.8.3.1
Re: [RFC V2 0/8] Live update: tap and vhost
Posted by Ben Chaney 2 weeks, 1 day ago
> Tap and vhost devices can be preserved during cpr-transfer

Hi Steve,

    I tested this patch set. With the two fixes. it is working
as expected. Are there plans for a v3 of this patch set?

Thanks,
     Ben
Re: [RFC V2 0/8] Live update: tap and vhost
Posted by Peter Xu 2 weeks, 1 day ago
On Thu, Oct 30, 2025 at 12:52:23PM -0400, Ben Chaney wrote:
> > Tap and vhost devices can be preserved during cpr-transfer
> 
> Hi Steve,
> 
>     I tested this patch set. With the two fixes. it is working
> as expected. Are there plans for a v3 of this patch set?

Steve has retired.

Copy Mark Kanda <mark.kanda@oracle.com>.

-- 
Peter Xu
Re: [RFC V2 0/8] Live update: tap and vhost
Posted by Chaney, Ben 2 weeks, 1 day ago
> > I tested this patch set. With the two fixes. it is working
> as expected. Are there plans for a v3 of this patch set?
>
> Steve has retired.
>
> Copy Mark Kanda <mark.kanda@oracle.com <mailto:mark.kanda@oracle.com>>.

Hi Mark,

        Nice to meet you! Do you have any plans to continue work on this patch set?

        We have had success with it and are hopeful to see it merged.

Thanks,
        Ben

Re: [RFC V2 0/8] Live update: tap and vhost
Posted by Vladimir Sementsov-Ogievskiy 2 months, 3 weeks ago
On 17.07.25 21:39, Steve Sistare wrote:
> Tap and vhost devices can be preserved during cpr-transfer using
> traditional live migration methods, wherein the management layer
> creates new interfaces for the target and fiddles with 'ip link'
> to deactivate the old interface and activate the new.
> 
> However, CPR can simply send the file descriptors to new QEMU,
> with no special management actions required.  The user enables
> this behavior by specifing '-netdev tap,cpr=on'.  The default
> is cpr=off.
> 


Hi Steve!

First, me trying to test the series:

SOURCE:

sudo build/qemu-system-x86_64 -display none -vga none -device pxb-pcie,bus_nr=128,bus=pcie.0,id=pcie.1 -device pcie-root-port,id=s0,slot=0,bus=pcie.1 -device pcie-root-port,id=s1,slot=1,bus=pcie.1 -device pcie-root-port,id=s2,slot=2,bus=pcie.1 -hda /home/vsementsov/work/vms/newfocal.raw -m 4G -enable-kvm -M q35 -vnc :0 -nodefaults -vga std -qmp stdio -msg timestamp -S -object memory-backend-file,id=ram0,size=4G,mem-path=/dev/shm/ram0,share=on -machine memory-backend=ram0 -machine aux-ram-share=on

{"execute": "qmp_capabilities"}
{"return": {}}
{"execute": "netdev_add", "arguments": {"cpr": true, "script": "no", "downscript": "no", "vhostforce": false, "vhost": false, "queues": 4, "ifname": "tap0", "type": "tap", "id": "netdev.1"}}
{"return": {}}
{"execute": "device_add", "arguments": {"disable-legacy": "off", "bus": "s1", "netdev": "netdev.1", "driver": "virtio-net-pci", "vectors": 18, "mq": true, "romfile": "", "mac": "d6:0d:75:f8:0f:b7", "id": "vnet.1"}}
{"return": {}}
{"execute": "cont"}
{"timestamp": {"seconds": 1755977653, "microseconds": 248749}, "event": "RESUME"}
{"return": {}}
{"timestamp": {"seconds": 1755977657, "microseconds": 366274}, "event": "NIC_RX_FILTER_CHANGED", "data": {"name": "vnet.1", "path": "/machine/peripheral/vnet.1/virtio-backend"}}
{"execute": "migrate-set-parameters", "arguments": {"mode": "cpr-transfer"}}
{"return": {}}
{"execute": "migrate", "arguments": {"channels": [{"channel-type": "main", "addr": {"path": "/tmp/migr.sock", "transport": "socket", "type": "unix"}}, {"channel-type": "cpr", "addr": {"path": "/tmp/cpr.sock", "transport": "socket", "type": "unix"}}]}}
{"timestamp": {"seconds": 1755977767, "microseconds": 835571}, "event": "STOP"}
{"return": {}}


TARGET:

sudo build/qemu-system-x86_64 -display none -vga none -device pxb-pcie,bus_nr=128,bus=pcie.0,id=pcie.1 -device pcie-root-port,id=s0,slot=0,bus=pcie.1 -device pcie-root-port,id=s1,slot=1,bus=pcie.1 -device pcie-root-port,id=s2,slot=2,bus=pcie.1 -hda /home/vsementsov/work/vms/newfocal.raw -m 4G -enable-kvm -M q35 -vnc :1 -nodefaults -vga std -qmp stdio -S -object memory-backend-file,id=ram0,size=4G,mem-p
ath=/dev/shm/ram0,share=on -machine memory-backend=ram0 -machine aux-ram-share=on -incoming defer -incoming '{"channel-type": "cpr","addr": { "transport": "socket","type": "unix", "path": "/tmp/cpr.sock"}}'

<need to wait until "migrate" on source>

{"execute": "qmp_capabilities"}
{"return": {}}
{"execute": "netdev_add", "arguments": {"cpr": true, "script": "no", "downscript": "no", "vhostforce": false, "vhost": false, "queues": 4, "ifname": "tap0", "type": "tap", "id": "netdev.1"}}
{"return": {}}
{"execute": "device_add", "arguments": {"disable-legacy": "off", "bus": "s1", "netdev": "netdev.1", "driver": "virtio-net-pci", "vectors": 18, "mq": true, "romfile": "", "mac": "d6:0d:75:f8:0f:b7", "id": "vnet.1"}}
could not disable queue
qemu-system-x86_64: ../hw/net/virtio-net.c:771: virtio_net_set_queue_pairs: Assertion `!r' failed.
fish: Job 1, 'sudo build/qemu-system-x86_64 -…' terminated by signal SIGABRT (Abort)


So, it crashes on device_add..


Second, I've come a long way, backporting you TAP v1 series together with needed parts of CPR and migration channels to QEMU 7.2, fixing different issues (like, avoid reinitialization of vnet_hdr length on target, avoid simultaneous use of tap on source an target, avoid making the fd blocking again on target), and it finally started to work.

But next, I went to support similar migration for vhost-user-blk, and that was a lot more complex. No reason to pass an fd in preliminary stage, when source is running (like in CPR), because:

1. we just can't use the fd on target at all, until we stop use it on source, otherwise we just break vhost-user-blk protocol on the wire (unlike TAP, where some ioctls called on target doesn't break source)
2. we have to pass enough additional variables, which are simpler to pass through normal migration channel (how to pass anything except fds through cpr channel?)

So, I decided to go another way, and just migrate everything backend-related including fds through main migration channel. Of course, this requires deep reworking of device initialization in case of incoming migration (but for vhost-user-blk we need it anyway). The feature is in my series "[PATCH 00/33] vhost-user-blk: live-backend local migration" (you are in CC).

The success with vhost-user-blk (of-course) make me rethink TAP migration too: try to avoid using additional cpr channel and unusual waiting for QMP interface on target. And, I've just sent an RFC: "[RFC 0/7] virtio-net: live-TAP local migration"

What do you think?

-- 
Best regards,
Vladimir

Re: [RFC V2 0/8] Live update: tap and vhost
Posted by Steven Sistare 2 months, 2 weeks ago
On 8/23/2025 5:53 PM, Vladimir Sementsov-Ogievskiy wrote:
> On 17.07.25 21:39, Steve Sistare wrote:
>> Tap and vhost devices can be preserved during cpr-transfer using
>> traditional live migration methods, wherein the management layer
>> creates new interfaces for the target and fiddles with 'ip link'
>> to deactivate the old interface and activate the new.
>>
>> However, CPR can simply send the file descriptors to new QEMU,
>> with no special management actions required.  The user enables
>> this behavior by specifing '-netdev tap,cpr=on'.  The default
>> is cpr=off.
> 
> Hi Steve!
> 
> First, me trying to test the series:

Thank-you Vladimir for all the work you are doing in this area.  I have
reproduced the "virtio_net_set_queue_pairs: Assertion `!r' failed." bug.
Let me dig into that before I study the larger questions you pose
about preserving tap/vhost-user-blk in local migration versus cpr.

- Steve

> SOURCE:
> 
> sudo build/qemu-system-x86_64 -display none -vga none -device pxb-pcie,bus_nr=128,bus=pcie.0,id=pcie.1 -device pcie-root-port,id=s0,slot=0,bus=pcie.1 -device pcie-root-port,id=s1,slot=1,bus=pcie.1 -device pcie-root-port,id=s2,slot=2,bus=pcie.1 -hda /home/vsementsov/work/vms/newfocal.raw -m 4G -enable-kvm -M q35 -vnc :0 -nodefaults -vga std -qmp stdio -msg timestamp -S -object memory-backend-file,id=ram0,size=4G,mem-path=/dev/shm/ram0,share=on -machine memory-backend=ram0 -machine aux-ram-share=on
> 
> {"execute": "qmp_capabilities"}
> {"return": {}}
> {"execute": "netdev_add", "arguments": {"cpr": true, "script": "no", "downscript": "no", "vhostforce": false, "vhost": false, "queues": 4, "ifname": "tap0", "type": "tap", "id": "netdev.1"}}
> {"return": {}}
> {"execute": "device_add", "arguments": {"disable-legacy": "off", "bus": "s1", "netdev": "netdev.1", "driver": "virtio-net-pci", "vectors": 18, "mq": true, "romfile": "", "mac": "d6:0d:75:f8:0f:b7", "id": "vnet.1"}}
> {"return": {}}
> {"execute": "cont"}
> {"timestamp": {"seconds": 1755977653, "microseconds": 248749}, "event": "RESUME"}
> {"return": {}}
> {"timestamp": {"seconds": 1755977657, "microseconds": 366274}, "event": "NIC_RX_FILTER_CHANGED", "data": {"name": "vnet.1", "path": "/machine/peripheral/vnet.1/virtio-backend"}}
> {"execute": "migrate-set-parameters", "arguments": {"mode": "cpr-transfer"}}
> {"return": {}}
> {"execute": "migrate", "arguments": {"channels": [{"channel-type": "main", "addr": {"path": "/tmp/migr.sock", "transport": "socket", "type": "unix"}}, {"channel-type": "cpr", "addr": {"path": "/tmp/cpr.sock", "transport": "socket", "type": "unix"}}]}}
> {"timestamp": {"seconds": 1755977767, "microseconds": 835571}, "event": "STOP"}
> {"return": {}}
> 
> 
> TARGET:
> 
> sudo build/qemu-system-x86_64 -display none -vga none -device pxb-pcie,bus_nr=128,bus=pcie.0,id=pcie.1 -device pcie-root-port,id=s0,slot=0,bus=pcie.1 -device pcie-root-port,id=s1,slot=1,bus=pcie.1 -device pcie-root-port,id=s2,slot=2,bus=pcie.1 -hda /home/vsementsov/work/vms/newfocal.raw -m 4G -enable-kvm -M q35 -vnc :1 -nodefaults -vga std -qmp stdio -S -object memory-backend-file,id=ram0,size=4G,mem-p
> ath=/dev/shm/ram0,share=on -machine memory-backend=ram0 -machine aux-ram-share=on -incoming defer -incoming '{"channel-type": "cpr","addr": { "transport": "socket","type": "unix", "path": "/tmp/cpr.sock"}}'
> 
> <need to wait until "migrate" on source>
> 
> {"execute": "qmp_capabilities"}
> {"return": {}}
> {"execute": "netdev_add", "arguments": {"cpr": true, "script": "no", "downscript": "no", "vhostforce": false, "vhost": false, "queues": 4, "ifname": "tap0", "type": "tap", "id": "netdev.1"}}
> {"return": {}}
> {"execute": "device_add", "arguments": {"disable-legacy": "off", "bus": "s1", "netdev": "netdev.1", "driver": "virtio-net-pci", "vectors": 18, "mq": true, "romfile": "", "mac": "d6:0d:75:f8:0f:b7", "id": "vnet.1"}}
> could not disable queue
> qemu-system-x86_64: ../hw/net/virtio-net.c:771: virtio_net_set_queue_pairs: Assertion `!r' failed.
> fish: Job 1, 'sudo build/qemu-system-x86_64 -…' terminated by signal SIGABRT (Abort)
> 
> 
> So, it crashes on device_add..
> 
> 
> Second, I've come a long way, backporting you TAP v1 series together with needed parts of CPR and migration channels to QEMU 7.2, fixing different issues (like, avoid reinitialization of vnet_hdr length on target, avoid simultaneous use of tap on source an target, avoid making the fd blocking again on target), and it finally started to work.
> 
> But next, I went to support similar migration for vhost-user-blk, and that was a lot more complex. No reason to pass an fd in preliminary stage, when source is running (like in CPR), because:
> 
> 1. we just can't use the fd on target at all, until we stop use it on source, otherwise we just break vhost-user-blk protocol on the wire (unlike TAP, where some ioctls called on target doesn't break source)
> 2. we have to pass enough additional variables, which are simpler to pass through normal migration channel (how to pass anything except fds through cpr channel?)
> 
> So, I decided to go another way, and just migrate everything backend-related including fds through main migration channel. Of course, this requires deep reworking of device initialization in case of incoming migration (but for vhost-user-blk we need it anyway). The feature is in my series "[PATCH 00/33] vhost-user-blk: live-backend local migration" (you are in CC).
> 
> The success with vhost-user-blk (of-course) make me rethink TAP migration too: try to avoid using additional cpr channel and unusual waiting for QMP interface on target. And, I've just sent an RFC: "[RFC 0/7] virtio-net: live-TAP local migration"
> 
> What do you think?
> 



Re: [RFC V2 0/8] Live update: tap and vhost
Posted by Steven Sistare 2 months, 2 weeks ago
On 8/28/2025 11:48 AM, Steven Sistare wrote:
> On 8/23/2025 5:53 PM, Vladimir Sementsov-Ogievskiy wrote:
>> On 17.07.25 21:39, Steve Sistare wrote:
>>> Tap and vhost devices can be preserved during cpr-transfer using
>>> traditional live migration methods, wherein the management layer
>>> creates new interfaces for the target and fiddles with 'ip link'
>>> to deactivate the old interface and activate the new.
>>>
>>> However, CPR can simply send the file descriptors to new QEMU,
>>> with no special management actions required.  The user enables
>>> this behavior by specifing '-netdev tap,cpr=on'.  The default
>>> is cpr=off.
>>
>> Hi Steve!
>>
>> First, me trying to test the series:
> 
> Thank-you Vladimir for all the work you are doing in this area.  I have
> reproduced the "virtio_net_set_queue_pairs: Assertion `!r' failed." bug.
> Let me dig into that before I study the larger questions you pose
> about preserving tap/vhost-user-blk in local migration versus cpr.

I have reproduced your journey!  I fixed the assertion, the vnet_hdr, and
the blocking fd problems which you allude to.  The attached patch fixes
them, and will be squashed into the series.

Ben, you also reported the !r assertion failure, so this fix should help
you also.

>> SOURCE:
>>
>> sudo build/qemu-system-x86_64 -display none -vga none -device pxb-pcie,bus_nr=128,bus=pcie.0,id=pcie.1 -device pcie-root-port,id=s0,slot=0,bus=pcie.1 -device pcie-root-port,id=s1,slot=1,bus=pcie.1 -device pcie-root-port,id=s2,slot=2,bus=pcie.1 -hda /home/vsementsov/work/vms/newfocal.raw -m 4G -enable-kvm -M q35 -vnc :0 -nodefaults -vga std -qmp stdio -msg timestamp -S -object memory-backend-file,id=ram0,size=4G,mem-path=/dev/shm/ram0,share=on -machine memory-backend=ram0 -machine aux-ram-share=on
>>
>> {"execute": "qmp_capabilities"}
>> {"return": {}}
>> {"execute": "netdev_add", "arguments": {"cpr": true, "script": "no", "downscript": "no", "vhostforce": false, "vhost": false, "queues": 4, "ifname": "tap0", "type": "tap", "id": "netdev.1"}}
>> {"return": {}}
>> {"execute": "device_add", "arguments": {"disable-legacy": "off", "bus": "s1", "netdev": "netdev.1", "driver": "virtio-net-pci", "vectors": 18, "mq": true, "romfile": "", "mac": "d6:0d:75:f8:0f:b7", "id": "vnet.1"}}
>> {"return": {}}
>> {"execute": "cont"}
>> {"timestamp": {"seconds": 1755977653, "microseconds": 248749}, "event": "RESUME"}
>> {"return": {}}
>> {"timestamp": {"seconds": 1755977657, "microseconds": 366274}, "event": "NIC_RX_FILTER_CHANGED", "data": {"name": "vnet.1", "path": "/machine/peripheral/vnet.1/virtio-backend"}}
>> {"execute": "migrate-set-parameters", "arguments": {"mode": "cpr-transfer"}}
>> {"return": {}}
>> {"execute": "migrate", "arguments": {"channels": [{"channel-type": "main", "addr": {"path": "/tmp/migr.sock", "transport": "socket", "type": "unix"}}, {"channel-type": "cpr", "addr": {"path": "/tmp/cpr.sock", "transport": "socket", "type": "unix"}}]}}
>> {"timestamp": {"seconds": 1755977767, "microseconds": 835571}, "event": "STOP"}
>> {"return": {}}
>>
>> TARGET:
>>
>> sudo build/qemu-system-x86_64 -display none -vga none -device pxb-pcie,bus_nr=128,bus=pcie.0,id=pcie.1 -device pcie-root-port,id=s0,slot=0,bus=pcie.1 -device pcie-root-port,id=s1,slot=1,bus=pcie.1 -device pcie-root-port,id=s2,slot=2,bus=pcie.1 -hda /home/vsementsov/work/vms/newfocal.raw -m 4G -enable-kvm -M q35 -vnc :1 -nodefaults -vga std -qmp stdio -S -object memory-backend-file,id=ram0,size=4G,mem-p
>> ath=/dev/shm/ram0,share=on -machine memory-backend=ram0 -machine aux-ram-share=on -incoming defer -incoming '{"channel-type": "cpr","addr": { "transport": "socket","type": "unix", "path": "/tmp/cpr.sock"}}'
>>
>> <need to wait until "migrate" on source>
>>
>> {"execute": "qmp_capabilities"}
>> {"return": {}}
>> {"execute": "netdev_add", "arguments": {"cpr": true, "script": "no", "downscript": "no", "vhostforce": false, "vhost": false, "queues": 4, "ifname": "tap0", "type": "tap", "id": "netdev.1"}}
>> {"return": {}}
>> {"execute": "device_add", "arguments": {"disable-legacy": "off", "bus": "s1", "netdev": "netdev.1", "driver": "virtio-net-pci", "vectors": 18, "mq": true, "romfile": "", "mac": "d6:0d:75:f8:0f:b7", "id": "vnet.1"}}
>> could not disable queue
>> qemu-system-x86_64: ../hw/net/virtio-net.c:771: virtio_net_set_queue_pairs: Assertion `!r' failed.
>> fish: Job 1, 'sudo build/qemu-system-x86_64 -…' terminated by signal SIGABRT (Abort)
>>
>> So, it crashes on device_add..
>>
>> Second, I've come a long way, backporting you TAP v1 series together with needed parts of CPR and migration channels to QEMU 7.2, fixing different issues (like, avoid reinitialization of vnet_hdr length on target, avoid simultaneous use of tap on source an target, avoid making the fd blocking again on target), and it finally started to work.
>>
>> But next, I went to support similar migration for vhost-user-blk, and that was a lot more complex. No reason to pass an fd in preliminary stage, when source is running (like in CPR), because:
>>
>> 1. we just can't use the fd on target at all, until we stop use it on source, otherwise we just break vhost-user-blk protocol on the wire (unlike TAP, where some ioctls called on target doesn't break source)
>> 2. we have to pass enough additional variables, which are simpler to pass through normal migration channel (how to pass anything except fds through cpr channel?)

You can pass extra state through the cpr channel.  See for example vmstate_cpr_vfio_device,
and how vmstate_cpr_vfio_devices is defined as a sub-section of vmstate_cpr_state.

>> So, I decided to go another way, and just migrate everything backend-related including fds through main migration channel. Of course, this requires deep reworking of device initialization in case of incoming migration (but for vhost-user-blk we need it anyway). The feature is in my series "[PATCH 00/33] vhost-user-blk: live-backend local migration" (you are in CC).

You did a lot of work in those series!
I suspect much less rework of initialization is required if you pass variables in cpr state.

>> The success with vhost-user-blk (of-course) make me rethink TAP migration too: try to avoid using additional cpr channel and unusual waiting for QMP interface on target. And, I've just sent an RFC: "[RFC 0/7] virtio-net: live-TAP local migration"

Is there a use case for this outside of CPR?
CPR is intended to be the "local migration" solution that does it all :)
But if you do proceed with your local migration tap solution, I would want
to see that CPR could also use your code paths.

- Steve
From 191521210222638940c17d4daefc313d4ad61aa3 Mon Sep 17 00:00:00 2001
From: Steve Sistare <steven.sistare@oracle.com>
Date: Thu, 28 Aug 2025 13:47:15 -0700
Subject: [PATCH] tap: cpr fixes

Fix "virtio_net_set_queue_pairs: Assertion `!r' failed."
Fix "virtio-net: saved image requires vnet_hdr=on"
Do not change blocking mode of incoming cpr fd's.

Reported-by: Ben Chaney <bchaney@akamai.com>
Reported-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
---
 hw/net/virtio-net.c | 6 ++++++
 io/channel-socket.c | 5 ++++-
 net/tap.c           | 2 ++
 stubs/cpr.c         | 5 +++++
 4 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 7c40cca..7dd8a80 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -37,6 +37,7 @@
 #include "qapi/qapi-types-migration.h"
 #include "qapi/qapi-events-migration.h"
 #include "hw/virtio/virtio-access.h"
+#include "migration/cpr.h"
 #include "migration/misc.h"
 #include "standard-headers/linux/ethtool.h"
 #include "system/system.h"
@@ -758,6 +759,11 @@ static void virtio_net_set_queue_pairs(VirtIONet *n)
     int i;
     int r;
 
+    if (cpr_is_incoming()) {
+        /* peers are already attached, do nothing */
+        return;
+    }
+
     if (n->nic->peer_deleted) {
         return;
     }
diff --git a/io/channel-socket.c b/io/channel-socket.c
index 3b7ca92..736d39d 100644
--- a/io/channel-socket.c
+++ b/io/channel-socket.c
@@ -24,6 +24,7 @@
 #include "io/channel-socket.h"
 #include "io/channel-util.h"
 #include "io/channel-watch.h"
+#include "migration/cpr.h"
 #include "trace.h"
 #include "qapi/clone-visitor.h"
 #ifdef CONFIG_LINUX
@@ -498,7 +499,9 @@ static void qio_channel_socket_copy_fds(struct msghdr *msg,
             }
 
             /* O_NONBLOCK is preserved across SCM_RIGHTS so reset it */
-            qemu_socket_set_block(fd);
+            if (!cpr_is_incoming()) {
+                qemu_socket_set_block(fd);
+            }
 
 #ifndef MSG_CMSG_CLOEXEC
             qemu_set_cloexec(fd);
diff --git a/net/tap.c b/net/tap.c
index 25ff96f..f95ffe9 100644
--- a/net/tap.c
+++ b/net/tap.c
@@ -1042,6 +1042,8 @@ free_fail:
                 if (cpr && fd >= 0) {
                     cpr_save_fd(name, TAP_FD_INDEX(i), fd);
                 }
+            } else {
+                vnet_hdr = tap->has_vnet_hdr ? tap->vnet_hdr : 1;
             }
             if (fd == -1) {
                 ret = -1;
diff --git a/stubs/cpr.c b/stubs/cpr.c
index 6a5c320..86a507c 100644
--- a/stubs/cpr.c
+++ b/stubs/cpr.c
@@ -19,3 +19,8 @@ int cpr_find_fd(const char *name, int id)
 {
     return -1;
 }
+
+bool cpr_is_incoming(void)
+{
+    return false;
+}
-- 
1.8.3.1

Re: [RFC V2 0/8] Live update: tap and vhost
Posted by Vladimir Sementsov-Ogievskiy 2 months, 2 weeks ago
On 29.08.25 22:37, Steven Sistare wrote:
> On 8/28/2025 11:48 AM, Steven Sistare wrote:
>> On 8/23/2025 5:53 PM, Vladimir Sementsov-Ogievskiy wrote:
>>> On 17.07.25 21:39, Steve Sistare wrote:
>>>> Tap and vhost devices can be preserved during cpr-transfer using
>>>> traditional live migration methods, wherein the management layer
>>>> creates new interfaces for the target and fiddles with 'ip link'
>>>> to deactivate the old interface and activate the new.
>>>>
>>>> However, CPR can simply send the file descriptors to new QEMU,
>>>> with no special management actions required.  The user enables
>>>> this behavior by specifing '-netdev tap,cpr=on'.  The default
>>>> is cpr=off.
>>>
>>> Hi Steve!
>>>
>>> First, me trying to test the series:
>>
>> Thank-you Vladimir for all the work you are doing in this area.  I have
>> reproduced the "virtio_net_set_queue_pairs: Assertion `!r' failed." bug.
>> Let me dig into that before I study the larger questions you pose
>> about preserving tap/vhost-user-blk in local migration versus cpr.
> 
> I have reproduced your journey!  I fixed the assertion, the vnet_hdr, and
> the blocking fd problems which you allude to.  The attached patch fixes
> them, and will be squashed into the series.
> 
> Ben, you also reported the !r assertion failure, so this fix should help
> you also.
> 
>>> SOURCE:
>>>
>>> sudo build/qemu-system-x86_64 -display none -vga none -device pxb-pcie,bus_nr=128,bus=pcie.0,id=pcie.1 -device pcie-root-port,id=s0,slot=0,bus=pcie.1 -device pcie-root-port,id=s1,slot=1,bus=pcie.1 -device pcie-root-port,id=s2,slot=2,bus=pcie.1 -hda /home/vsementsov/work/vms/newfocal.raw -m 4G -enable-kvm -M q35 -vnc :0 -nodefaults -vga std -qmp stdio -msg timestamp -S -object memory-backend-file,id=ram0,size=4G,mem-path=/dev/shm/ram0,share=on -machine memory-backend=ram0 -machine aux-ram-share=on
>>>
>>> {"execute": "qmp_capabilities"}
>>> {"return": {}}
>>> {"execute": "netdev_add", "arguments": {"cpr": true, "script": "no", "downscript": "no", "vhostforce": false, "vhost": false, "queues": 4, "ifname": "tap0", "type": "tap", "id": "netdev.1"}}
>>> {"return": {}}
>>> {"execute": "device_add", "arguments": {"disable-legacy": "off", "bus": "s1", "netdev": "netdev.1", "driver": "virtio-net-pci", "vectors": 18, "mq": true, "romfile": "", "mac": "d6:0d:75:f8:0f:b7", "id": "vnet.1"}}
>>> {"return": {}}
>>> {"execute": "cont"}
>>> {"timestamp": {"seconds": 1755977653, "microseconds": 248749}, "event": "RESUME"}
>>> {"return": {}}
>>> {"timestamp": {"seconds": 1755977657, "microseconds": 366274}, "event": "NIC_RX_FILTER_CHANGED", "data": {"name": "vnet.1", "path": "/machine/peripheral/vnet.1/virtio-backend"}}
>>> {"execute": "migrate-set-parameters", "arguments": {"mode": "cpr-transfer"}}
>>> {"return": {}}
>>> {"execute": "migrate", "arguments": {"channels": [{"channel-type": "main", "addr": {"path": "/tmp/migr.sock", "transport": "socket", "type": "unix"}}, {"channel-type": "cpr", "addr": {"path": "/tmp/cpr.sock", "transport": "socket", "type": "unix"}}]}}
>>> {"timestamp": {"seconds": 1755977767, "microseconds": 835571}, "event": "STOP"}
>>> {"return": {}}
>>>
>>> TARGET:
>>>
>>> sudo build/qemu-system-x86_64 -display none -vga none -device pxb-pcie,bus_nr=128,bus=pcie.0,id=pcie.1 -device pcie-root-port,id=s0,slot=0,bus=pcie.1 -device pcie-root-port,id=s1,slot=1,bus=pcie.1 -device pcie-root-port,id=s2,slot=2,bus=pcie.1 -hda /home/vsementsov/work/vms/newfocal.raw -m 4G -enable-kvm -M q35 -vnc :1 -nodefaults -vga std -qmp stdio -S -object memory-backend-file,id=ram0,size=4G,mem-p
>>> ath=/dev/shm/ram0,share=on -machine memory-backend=ram0 -machine aux-ram-share=on -incoming defer -incoming '{"channel-type": "cpr","addr": { "transport": "socket","type": "unix", "path": "/tmp/cpr.sock"}}'
>>>
>>> <need to wait until "migrate" on source>
>>>
>>> {"execute": "qmp_capabilities"}
>>> {"return": {}}
>>> {"execute": "netdev_add", "arguments": {"cpr": true, "script": "no", "downscript": "no", "vhostforce": false, "vhost": false, "queues": 4, "ifname": "tap0", "type": "tap", "id": "netdev.1"}}
>>> {"return": {}}
>>> {"execute": "device_add", "arguments": {"disable-legacy": "off", "bus": "s1", "netdev": "netdev.1", "driver": "virtio-net-pci", "vectors": 18, "mq": true, "romfile": "", "mac": "d6:0d:75:f8:0f:b7", "id": "vnet.1"}}
>>> could not disable queue
>>> qemu-system-x86_64: ../hw/net/virtio-net.c:771: virtio_net_set_queue_pairs: Assertion `!r' failed.
>>> fish: Job 1, 'sudo build/qemu-system-x86_64 -…' terminated by signal SIGABRT (Abort)
>>>
>>> So, it crashes on device_add..
>>>
>>> Second, I've come a long way, backporting you TAP v1 series together with needed parts of CPR and migration channels to QEMU 7.2, fixing different issues (like, avoid reinitialization of vnet_hdr length on target, avoid simultaneous use of tap on source an target, avoid making the fd blocking again on target), and it finally started to work.
>>>
>>> But next, I went to support similar migration for vhost-user-blk, and that was a lot more complex. No reason to pass an fd in preliminary stage, when source is running (like in CPR), because:
>>>
>>> 1. we just can't use the fd on target at all, until we stop use it on source, otherwise we just break vhost-user-blk protocol on the wire (unlike TAP, where some ioctls called on target doesn't break source)
>>> 2. we have to pass enough additional variables, which are simpler to pass through normal migration channel (how to pass anything except fds through cpr channel?)
> 
> You can pass extra state through the cpr channel.  See for example vmstate_cpr_vfio_device,
> and how vmstate_cpr_vfio_devices is defined as a sub-section of vmstate_cpr_state.

O, I missed this.

Hmm. Still, finally CPR becomes just an additional stage of migration, which is done prior device initialization on target..

Didn't you think of integrating it to the common scheme: so that devices may have .vmsd_cpr in addition to .vmsd? This way we don't need a global CPR state, and CPR stage of migration will work the same way as normal migration?

Still2, if we pass some state in CPR it should be a kind of constant. We need a guarantee that it will not change between migration start and source stop.

> 
>>> So, I decided to go another way, and just migrate everything backend-related including fds through main migration channel. Of course, this requires deep reworking of device initialization in case of incoming migration (but for vhost-user-blk we need it anyway). The feature is in my series "[PATCH 00/33] vhost-user-blk: live-backend local migration" (you are in CC).
> 
> You did a lot of work in those series!
> I suspect much less rework of initialization is required if you pass variables in cpr state.

Not sure. I had to rework initialization anyway, as initialization damaged the connection. And this lead me to idea "if rework anyway, why not to go with one migration channel".

> 
>>> The success with vhost-user-blk (of-course) make me rethink TAP migration too: try to avoid using additional cpr channel and unusual waiting for QMP interface on target. And, I've just sent an RFC: "[RFC 0/7] virtio-net: live-TAP local migration"
> 
> Is there a use case for this outside of CPR?

It just works without CPR.. Will CPR bring more benefit if I enable it in the setup with my local-tap + local-vhost-user-blk capabilities ( + ignore-shared of-course)?


> CPR is intended to be the "local migration" solution that does it all :)
> But if you do proceed with your local migration tap solution, I would want
> to see that CPR could also use your code paths.
> 
CPR can transparently use my code: you may enable both CPR and local-tap capability and it should work. Some devices will migrate their fds through CPR, TAP fds amd state will migrate through main migration channel. Making both channels to be unix-sockets should not be a considerable overhead I think.

Why I like my solution more:

- no additional channel
- no additional logic in management software (to handle target start with no QMP access until "migrate" command on source)
- less code to backport (that's personal, of course not an argument for final upstream solution)

It seems that CPR is simpler to support as we don't need to do deep rework of initialization code.. But in reality, there is a lot of work anyway: TAP, vhost-user-blk cases proves this. You series about vfio are also huge.

What is the benefit of CPR against simple (unix-socket) migration?


-- 
Best regards,
Vladimir

Re: [RFC V2 0/8] Live update: tap and vhost
Posted by Steven Sistare 2 months, 1 week ago
On 9/1/2025 7:44 AM, Vladimir Sementsov-Ogievskiy wrote:
> On 29.08.25 22:37, Steven Sistare wrote:
>> On 8/28/2025 11:48 AM, Steven Sistare wrote:
>>> On 8/23/2025 5:53 PM, Vladimir Sementsov-Ogievskiy wrote:
>>>> On 17.07.25 21:39, Steve Sistare wrote:
>>>>> Tap and vhost devices can be preserved during cpr-transfer using
>>>>> traditional live migration methods, wherein the management layer
>>>>> creates new interfaces for the target and fiddles with 'ip link'
>>>>> to deactivate the old interface and activate the new.
>>>>>
>>>>> However, CPR can simply send the file descriptors to new QEMU,
>>>>> with no special management actions required.  The user enables
>>>>> this behavior by specifing '-netdev tap,cpr=on'.  The default
>>>>> is cpr=off.
>>>>
>>>> Hi Steve!
>>>>
>>>> First, me trying to test the series:
>>>
>>> Thank-you Vladimir for all the work you are doing in this area.  I have
>>> reproduced the "virtio_net_set_queue_pairs: Assertion `!r' failed." bug.
>>> Let me dig into that before I study the larger questions you pose
>>> about preserving tap/vhost-user-blk in local migration versus cpr.
>>
>> I have reproduced your journey!  I fixed the assertion, the vnet_hdr, and
>> the blocking fd problems which you allude to.  The attached patch fixes
>> them, and will be squashed into the series.
>>
>> Ben, you also reported the !r assertion failure, so this fix should help
>> you also.
>>
>>>> SOURCE:
>>>>
>>>> sudo build/qemu-system-x86_64 -display none -vga none -device pxb-pcie,bus_nr=128,bus=pcie.0,id=pcie.1 -device pcie-root-port,id=s0,slot=0,bus=pcie.1 -device pcie-root-port,id=s1,slot=1,bus=pcie.1 -device pcie-root-port,id=s2,slot=2,bus=pcie.1 -hda /home/vsementsov/work/vms/newfocal.raw -m 4G -enable-kvm -M q35 -vnc :0 -nodefaults -vga std -qmp stdio -msg timestamp -S -object memory-backend-file,id=ram0,size=4G,mem-path=/dev/shm/ram0,share=on -machine memory-backend=ram0 -machine aux-ram-share=on
>>>>
>>>> {"execute": "qmp_capabilities"}
>>>> {"return": {}}
>>>> {"execute": "netdev_add", "arguments": {"cpr": true, "script": "no", "downscript": "no", "vhostforce": false, "vhost": false, "queues": 4, "ifname": "tap0", "type": "tap", "id": "netdev.1"}}
>>>> {"return": {}}
>>>> {"execute": "device_add", "arguments": {"disable-legacy": "off", "bus": "s1", "netdev": "netdev.1", "driver": "virtio-net-pci", "vectors": 18, "mq": true, "romfile": "", "mac": "d6:0d:75:f8:0f:b7", "id": "vnet.1"}}
>>>> {"return": {}}
>>>> {"execute": "cont"}
>>>> {"timestamp": {"seconds": 1755977653, "microseconds": 248749}, "event": "RESUME"}
>>>> {"return": {}}
>>>> {"timestamp": {"seconds": 1755977657, "microseconds": 366274}, "event": "NIC_RX_FILTER_CHANGED", "data": {"name": "vnet.1", "path": "/machine/peripheral/vnet.1/virtio-backend"}}
>>>> {"execute": "migrate-set-parameters", "arguments": {"mode": "cpr-transfer"}}
>>>> {"return": {}}
>>>> {"execute": "migrate", "arguments": {"channels": [{"channel-type": "main", "addr": {"path": "/tmp/migr.sock", "transport": "socket", "type": "unix"}}, {"channel-type": "cpr", "addr": {"path": "/tmp/cpr.sock", "transport": "socket", "type": "unix"}}]}}
>>>> {"timestamp": {"seconds": 1755977767, "microseconds": 835571}, "event": "STOP"}
>>>> {"return": {}}
>>>>
>>>> TARGET:
>>>>
>>>> sudo build/qemu-system-x86_64 -display none -vga none -device pxb-pcie,bus_nr=128,bus=pcie.0,id=pcie.1 -device pcie-root-port,id=s0,slot=0,bus=pcie.1 -device pcie-root-port,id=s1,slot=1,bus=pcie.1 -device pcie-root-port,id=s2,slot=2,bus=pcie.1 -hda /home/vsementsov/work/vms/newfocal.raw -m 4G -enable-kvm -M q35 -vnc :1 -nodefaults -vga std -qmp stdio -S -object memory-backend-file,id=ram0,size=4G,mem-p
>>>> ath=/dev/shm/ram0,share=on -machine memory-backend=ram0 -machine aux-ram-share=on -incoming defer -incoming '{"channel-type": "cpr","addr": { "transport": "socket","type": "unix", "path": "/tmp/cpr.sock"}}'
>>>>
>>>> <need to wait until "migrate" on source>
>>>>
>>>> {"execute": "qmp_capabilities"}
>>>> {"return": {}}
>>>> {"execute": "netdev_add", "arguments": {"cpr": true, "script": "no", "downscript": "no", "vhostforce": false, "vhost": false, "queues": 4, "ifname": "tap0", "type": "tap", "id": "netdev.1"}}
>>>> {"return": {}}
>>>> {"execute": "device_add", "arguments": {"disable-legacy": "off", "bus": "s1", "netdev": "netdev.1", "driver": "virtio-net-pci", "vectors": 18, "mq": true, "romfile": "", "mac": "d6:0d:75:f8:0f:b7", "id": "vnet.1"}}
>>>> could not disable queue
>>>> qemu-system-x86_64: ../hw/net/virtio-net.c:771: virtio_net_set_queue_pairs: Assertion `!r' failed.
>>>> fish: Job 1, 'sudo build/qemu-system-x86_64 -…' terminated by signal SIGABRT (Abort)
>>>>
>>>> So, it crashes on device_add..
>>>>
>>>> Second, I've come a long way, backporting you TAP v1 series together with needed parts of CPR and migration channels to QEMU 7.2, fixing different issues (like, avoid reinitialization of vnet_hdr length on target, avoid simultaneous use of tap on source an target, avoid making the fd blocking again on target), and it finally started to work.
>>>>
>>>> But next, I went to support similar migration for vhost-user-blk, and that was a lot more complex. No reason to pass an fd in preliminary stage, when source is running (like in CPR), because:
>>>>
>>>> 1. we just can't use the fd on target at all, until we stop use it on source, otherwise we just break vhost-user-blk protocol on the wire (unlike TAP, where some ioctls called on target doesn't break source)
>>>> 2. we have to pass enough additional variables, which are simpler to pass through normal migration channel (how to pass anything except fds through cpr channel?)
>>
>> You can pass extra state through the cpr channel.  See for example vmstate_cpr_vfio_device,
>> and how vmstate_cpr_vfio_devices is defined as a sub-section of vmstate_cpr_state.
> 
> O, I missed this.
> 
> Hmm. Still, finally CPR becomes just an additional stage of migration, which is done prior device initialization on target..
> 
> Didn't you think of integrating it to the common scheme: so that devices may have .vmsd_cpr in addition to .vmsd? This way we don't need a global CPR state, and CPR stage of migration will work the same way as normal migration?

I proposed a single migration stream containing pre-create state that was read early,
but that was rejected as too complex.

I also proposed refactoring initialization so the monitor and migration streams
could be opened earlier, but again rejected as too complex and/or not consistent with
a long term vision for reworking initialization.

> Still2, if we pass some state in CPR it should be a kind of constant. We need a guarantee that it will not change between migration start and source stop.
> 
>>
>>>> So, I decided to go another way, and just migrate everything backend-related including fds through main migration channel. Of course, this requires deep reworking of device initialization in case of incoming migration (but for vhost-user-blk we need it anyway). The feature is in my series "[PATCH 00/33] vhost-user-blk: live-backend local migration" (you are in CC).
>>
>> You did a lot of work in those series!
>> I suspect much less rework of initialization is required if you pass variables in cpr state.
> 
> Not sure. I had to rework initialization anyway, as initialization damaged the connection. And this lead me to idea "if rework anyway, why not to go with one migration channel".
> 
>>
>>>> The success with vhost-user-blk (of-course) make me rethink TAP migration too: try to avoid using additional cpr channel and unusual waiting for QMP interface on target. And, I've just sent an RFC: "[RFC 0/7] virtio-net: live-TAP local migration"
>>
>> Is there a use case for this outside of CPR?
> 
> It just works without CPR.. Will CPR bring more benefit if I enable it in the setup with my local-tap + local-vhost-user-blk capabilities ( + ignore-shared of-course)?
> 
> 
>> CPR is intended to be the "local migration" solution that does it all :)
>> But if you do proceed with your local migration tap solution, I would want
>> to see that CPR could also use your code paths.
>>
> CPR can transparently use my code: you may enable both CPR and local-tap capability and it should work. Some devices will migrate their fds through CPR, TAP fds amd state will migrate through main migration channel. 

OK, I believe that.

I also care about cpr-exec mode.  We use it internally, and I am trying to push
it upstream:
   https://lore.kernel.org/qemu-devel/1755191843-283480-8-git-send-email-steven.sistare@oracle.com/
I believe it would work with your code.  Migrated fd's in both the cpr channel and
the main migration channel would be handled differently as shown in vmstate-types.c
get_fd() and put_fd().  The fd is kept open across execv(), and vmstate represents
the fd by its value (eg a small integer), rather than as an object in the unix channel.

> Making both channels to be unix-sockets should not be a considerable overhead I think.
> 
> Why I like my solution more:
> 
> - no additional channel
> - no additional logic in management software (to handle target start with no QMP access until "migrate" command on source)
> - less code to backport (that's personal, of course not an argument for final upstream solution)
> 
> It seems that CPR is simpler to support as we don't need to do deep rework of initialization code.. But in reality, there is a lot of work anyway: TAP, vhost-user-blk cases proves this. You series about vfio are also huge.

TAP is the only case where we can compare both approaches, and the numbers tell
the story:

   TAP initialization refactoring: 277 insertions(+), 308 deletions(-)
   live-TAP local migration:       681 insertions(+), 72 deletions(-)
                         total:    958 insertions(+), 380 deletions(-)

   Live update tap and vhost:      223 insertions(+), 55 deletions(-)

For any given system, if the maintainers accept the larger amount of change,
then local migration is cool (and CPR made it possible by adding fd support
to vmstate+QEMUFile).  But the amount of change is a harder sell.
> What is the benefit of CPR against simple (unix-socket) migration?CPR supports vfio, iommufd, and pinned memory.  Memory backend objects are
created early, before the main migration stream is read, and squashing
CPR into migration for those cases would require a major change in how
qemu creates objects during live migration.

Hence CPR is the method that works for all types of objects.  The mgmt
layer does not need to support multiple methods of live update, depending
on what devices the VM contains.

- Steve

Re: [RFC V2 0/8] Live update: tap and vhost
Posted by Vladimir Sementsov-Ogievskiy 2 months, 1 week ago
On 02.09.25 18:33, Steven Sistare wrote:
> On 9/1/2025 7:44 AM, Vladimir Sementsov-Ogievskiy wrote:
>> On 29.08.25 22:37, Steven Sistare wrote:
>>> On 8/28/2025 11:48 AM, Steven Sistare wrote:
>>>> On 8/23/2025 5:53 PM, Vladimir Sementsov-Ogievskiy wrote:
>>>>> On 17.07.25 21:39, Steve Sistare wrote:
>>>>>> Tap and vhost devices can be preserved during cpr-transfer using
>>>>>> traditional live migration methods, wherein the management layer
>>>>>> creates new interfaces for the target and fiddles with 'ip link'
>>>>>> to deactivate the old interface and activate the new.
>>>>>>
>>>>>> However, CPR can simply send the file descriptors to new QEMU,
>>>>>> with no special management actions required.  The user enables
>>>>>> this behavior by specifing '-netdev tap,cpr=on'.  The default
>>>>>> is cpr=off.
>>>>>
>>>>> Hi Steve!
>>>>>
>>>>> First, me trying to test the series:
>>>>
>>>> Thank-you Vladimir for all the work you are doing in this area.  I have
>>>> reproduced the "virtio_net_set_queue_pairs: Assertion `!r' failed." bug.
>>>> Let me dig into that before I study the larger questions you pose
>>>> about preserving tap/vhost-user-blk in local migration versus cpr.
>>>
>>> I have reproduced your journey!  I fixed the assertion, the vnet_hdr, and
>>> the blocking fd problems which you allude to.  The attached patch fixes
>>> them, and will be squashed into the series.
>>>
>>> Ben, you also reported the !r assertion failure, so this fix should help
>>> you also.
>>>
>>>>> SOURCE:
>>>>>
>>>>> sudo build/qemu-system-x86_64 -display none -vga none -device pxb-pcie,bus_nr=128,bus=pcie.0,id=pcie.1 -device pcie-root-port,id=s0,slot=0,bus=pcie.1 -device pcie-root-port,id=s1,slot=1,bus=pcie.1 -device pcie-root-port,id=s2,slot=2,bus=pcie.1 -hda /home/vsementsov/work/vms/newfocal.raw -m 4G -enable-kvm -M q35 -vnc :0 -nodefaults -vga std -qmp stdio -msg timestamp -S -object memory-backend-file,id=ram0,size=4G,mem-path=/dev/shm/ram0,share=on -machine memory-backend=ram0 -machine aux-ram-share=on
>>>>>
>>>>> {"execute": "qmp_capabilities"}
>>>>> {"return": {}}
>>>>> {"execute": "netdev_add", "arguments": {"cpr": true, "script": "no", "downscript": "no", "vhostforce": false, "vhost": false, "queues": 4, "ifname": "tap0", "type": "tap", "id": "netdev.1"}}
>>>>> {"return": {}}
>>>>> {"execute": "device_add", "arguments": {"disable-legacy": "off", "bus": "s1", "netdev": "netdev.1", "driver": "virtio-net-pci", "vectors": 18, "mq": true, "romfile": "", "mac": "d6:0d:75:f8:0f:b7", "id": "vnet.1"}}
>>>>> {"return": {}}
>>>>> {"execute": "cont"}
>>>>> {"timestamp": {"seconds": 1755977653, "microseconds": 248749}, "event": "RESUME"}
>>>>> {"return": {}}
>>>>> {"timestamp": {"seconds": 1755977657, "microseconds": 366274}, "event": "NIC_RX_FILTER_CHANGED", "data": {"name": "vnet.1", "path": "/machine/peripheral/vnet.1/virtio-backend"}}
>>>>> {"execute": "migrate-set-parameters", "arguments": {"mode": "cpr-transfer"}}
>>>>> {"return": {}}
>>>>> {"execute": "migrate", "arguments": {"channels": [{"channel-type": "main", "addr": {"path": "/tmp/migr.sock", "transport": "socket", "type": "unix"}}, {"channel-type": "cpr", "addr": {"path": "/tmp/cpr.sock", "transport": "socket", "type": "unix"}}]}}
>>>>> {"timestamp": {"seconds": 1755977767, "microseconds": 835571}, "event": "STOP"}
>>>>> {"return": {}}
>>>>>
>>>>> TARGET:
>>>>>
>>>>> sudo build/qemu-system-x86_64 -display none -vga none -device pxb-pcie,bus_nr=128,bus=pcie.0,id=pcie.1 -device pcie-root-port,id=s0,slot=0,bus=pcie.1 -device pcie-root-port,id=s1,slot=1,bus=pcie.1 -device pcie-root-port,id=s2,slot=2,bus=pcie.1 -hda /home/vsementsov/work/vms/newfocal.raw -m 4G -enable-kvm -M q35 -vnc :1 -nodefaults -vga std -qmp stdio -S -object memory-backend-file,id=ram0,size=4G,mem-p
>>>>> ath=/dev/shm/ram0,share=on -machine memory-backend=ram0 -machine aux-ram-share=on -incoming defer -incoming '{"channel-type": "cpr","addr": { "transport": "socket","type": "unix", "path": "/tmp/cpr.sock"}}'
>>>>>
>>>>> <need to wait until "migrate" on source>
>>>>>
>>>>> {"execute": "qmp_capabilities"}
>>>>> {"return": {}}
>>>>> {"execute": "netdev_add", "arguments": {"cpr": true, "script": "no", "downscript": "no", "vhostforce": false, "vhost": false, "queues": 4, "ifname": "tap0", "type": "tap", "id": "netdev.1"}}
>>>>> {"return": {}}
>>>>> {"execute": "device_add", "arguments": {"disable-legacy": "off", "bus": "s1", "netdev": "netdev.1", "driver": "virtio-net-pci", "vectors": 18, "mq": true, "romfile": "", "mac": "d6:0d:75:f8:0f:b7", "id": "vnet.1"}}
>>>>> could not disable queue
>>>>> qemu-system-x86_64: ../hw/net/virtio-net.c:771: virtio_net_set_queue_pairs: Assertion `!r' failed.
>>>>> fish: Job 1, 'sudo build/qemu-system-x86_64 -…' terminated by signal SIGABRT (Abort)
>>>>>
>>>>> So, it crashes on device_add..
>>>>>
>>>>> Second, I've come a long way, backporting you TAP v1 series together with needed parts of CPR and migration channels to QEMU 7.2, fixing different issues (like, avoid reinitialization of vnet_hdr length on target, avoid simultaneous use of tap on source an target, avoid making the fd blocking again on target), and it finally started to work.
>>>>>
>>>>> But next, I went to support similar migration for vhost-user-blk, and that was a lot more complex. No reason to pass an fd in preliminary stage, when source is running (like in CPR), because:
>>>>>
>>>>> 1. we just can't use the fd on target at all, until we stop use it on source, otherwise we just break vhost-user-blk protocol on the wire (unlike TAP, where some ioctls called on target doesn't break source)
>>>>> 2. we have to pass enough additional variables, which are simpler to pass through normal migration channel (how to pass anything except fds through cpr channel?)
>>>
>>> You can pass extra state through the cpr channel.  See for example vmstate_cpr_vfio_device,
>>> and how vmstate_cpr_vfio_devices is defined as a sub-section of vmstate_cpr_state.
>>
>> O, I missed this.
>>
>> Hmm. Still, finally CPR becomes just an additional stage of migration, which is done prior device initialization on target..
>>
>> Didn't you think of integrating it to the common scheme: so that devices may have .vmsd_cpr in addition to .vmsd? This way we don't need a global CPR state, and CPR stage of migration will work the same way as normal migration?
> 
> I proposed a single migration stream containing pre-create state that was read early,
> but that was rejected as too complex.
> 
> I also proposed refactoring initialization so the monitor and migration streams
> could be opened earlier, but again rejected as too complex and/or not consistent with
> a long term vision for reworking initialization.
> 
>> Still2, if we pass some state in CPR it should be a kind of constant. We need a guarantee that it will not change between migration start and source stop.
>>
>>>
>>>>> So, I decided to go another way, and just migrate everything backend-related including fds through main migration channel. Of course, this requires deep reworking of device initialization in case of incoming migration (but for vhost-user-blk we need it anyway). The feature is in my series "[PATCH 00/33] vhost-user-blk: live-backend local migration" (you are in CC).
>>>
>>> You did a lot of work in those series!
>>> I suspect much less rework of initialization is required if you pass variables in cpr state.
>>
>> Not sure. I had to rework initialization anyway, as initialization damaged the connection. And this lead me to idea "if rework anyway, why not to go with one migration channel".
>>
>>>
>>>>> The success with vhost-user-blk (of-course) make me rethink TAP migration too: try to avoid using additional cpr channel and unusual waiting for QMP interface on target. And, I've just sent an RFC: "[RFC 0/7] virtio-net: live-TAP local migration"
>>>
>>> Is there a use case for this outside of CPR?
>>
>> It just works without CPR.. Will CPR bring more benefit if I enable it in the setup with my local-tap + local-vhost-user-blk capabilities ( + ignore-shared of-course)?
>>
>>
>>> CPR is intended to be the "local migration" solution that does it all :)
>>> But if you do proceed with your local migration tap solution, I would want
>>> to see that CPR could also use your code paths.
>>>
>> CPR can transparently use my code: you may enable both CPR and local-tap capability and it should work. Some devices will migrate their fds through CPR, TAP fds amd state will migrate through main migration channel. 
> 
> OK, I believe that.
> 
> I also care about cpr-exec mode.  We use it internally, and I am trying to push
> it upstream:
>    https://lore.kernel.org/qemu-devel/1755191843-283480-8-git-send-email-steven.sistare@oracle.com/
> I believe it would work with your code.  Migrated fd's in both the cpr channel and
> the main migration channel would be handled differently as shown in vmstate-types.c
> get_fd() and put_fd().  The fd is kept open across execv(), and vmstate represents
> the fd by its value (eg a small integer), rather than as an object in the unix channel.

I'm close to publish new version, which will include

> 
>> Making both channels to be unix-sockets should not be a considerable overhead I think.
>>
>> Why I like my solution more:
>>
>> - no additional channel
>> - no additional logic in management software (to handle target start with no QMP access until "migrate" command on source)
>> - less code to backport (that's personal, of course not an argument for final upstream solution)
>>
>> It seems that CPR is simpler to support as we don't need to do deep rework of initialization code.. But in reality, there is a lot of work anyway: TAP, vhost-user-blk cases proves this. You series about vfio are also huge.
> 
> TAP is the only case where we can compare both approaches, and the numbers tell
> the story:
> 
>    TAP initialization refactoring: 277 insertions(+), 308 deletions(-)

Actually, I've done a lot more refactoring than required for TAP local migration, trying to make the whole initialization more clear and consistent. And it's a good base for any modification of TAP device I think.

>    live-TAP local migration:       681 insertions(+), 72 deletions(-)

But 369 is last patch which is not for commit, and 65 a first patch with tracepoints (look at it tap_dump_packet() - thanks to AI, really helps to debug network problems, when you see packet dumps in QEMU log)
So, more honest estimate is ~250, which is in good accordance with Live update tap.

>                          total:    958 insertions(+), 380 deletions(-)
> 
>    Live update tap and vhost:      223 insertions(+), 55 deletions(-)
> 
> For any given system, if the maintainers accept the larger amount of change,
> then local migration is cool (and CPR made it possible by adding fd support
> to vmstate+QEMUFile)

Yes, native support for fds in migration API opens the doors:)

>.  But the amount of change is a harder sell.

Yes, that's right. But live-TAP isn't really big. Unlike live-vhost-user-blk unfortunately.

>> What is the benefit of CPR against simple (unix-socket) migration?

> CPR supports vfio, iommufd, and pinned memory.  Memory backend objects are
> created early, before the main migration stream is read, and squashing
> CPR into migration for those cases would require a major change in how
> qemu creates objects during live migration.

Yes, understand: less things to change in initialization code = we can cover more things..

For my downstream I need TAP, vhost-user-blk and vfio. So vfio would be the most interesting challenge, if I try to make a kind of live-vfio local migration.

- it already supported by CPR, so it would be really hard to cell 1-2 thousands of additional code lines) But I'll see, may be it will not be so much.
- we already have support in downstream, which we've never tried to send. It based on getting fds from source and passing them to target by management software.. But of course one day we should sync with upstream.

> 
> Hence CPR is the method that works for all types of objects.  The mgmt
> layer does not need to support multiple methods of live update, depending
> on what devices the VM contains.
> 

So, I'll keep my live-migrations not to be a separate mode. And they may be used as part of CPR, and I'm saving a chance to switch to CPR if needed.

Thanks for detailed answer!

-- 
Best regards,
Vladimir

Re: [RFC V2 0/8] Live update: tap and vhost
Posted by Peter Xu 2 months, 1 week ago
On Tue, Sep 02, 2025 at 08:09:44PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> On 02.09.25 18:33, Steven Sistare wrote:
> > On 9/1/2025 7:44 AM, Vladimir Sementsov-Ogievskiy wrote:
> > > On 29.08.25 22:37, Steven Sistare wrote:
> > > > On 8/28/2025 11:48 AM, Steven Sistare wrote:
> > > > > On 8/23/2025 5:53 PM, Vladimir Sementsov-Ogievskiy wrote:
> > > > > > On 17.07.25 21:39, Steve Sistare wrote:
> > > > > > > Tap and vhost devices can be preserved during cpr-transfer using
> > > > > > > traditional live migration methods, wherein the management layer
> > > > > > > creates new interfaces for the target and fiddles with 'ip link'
> > > > > > > to deactivate the old interface and activate the new.
> > > > > > > 
> > > > > > > However, CPR can simply send the file descriptors to new QEMU,
> > > > > > > with no special management actions required.  The user enables
> > > > > > > this behavior by specifing '-netdev tap,cpr=on'.  The default
> > > > > > > is cpr=off.
> > > > > > 
> > > > > > Hi Steve!
> > > > > > 
> > > > > > First, me trying to test the series:
> > > > > 
> > > > > Thank-you Vladimir for all the work you are doing in this area.  I have
> > > > > reproduced the "virtio_net_set_queue_pairs: Assertion `!r' failed." bug.
> > > > > Let me dig into that before I study the larger questions you pose
> > > > > about preserving tap/vhost-user-blk in local migration versus cpr.
> > > > 
> > > > I have reproduced your journey!  I fixed the assertion, the vnet_hdr, and
> > > > the blocking fd problems which you allude to.  The attached patch fixes
> > > > them, and will be squashed into the series.
> > > > 
> > > > Ben, you also reported the !r assertion failure, so this fix should help
> > > > you also.
> > > > 
> > > > > > SOURCE:
> > > > > > 
> > > > > > sudo build/qemu-system-x86_64 -display none -vga none -device pxb-pcie,bus_nr=128,bus=pcie.0,id=pcie.1 -device pcie-root-port,id=s0,slot=0,bus=pcie.1 -device pcie-root-port,id=s1,slot=1,bus=pcie.1 -device pcie-root-port,id=s2,slot=2,bus=pcie.1 -hda /home/vsementsov/work/vms/newfocal.raw -m 4G -enable-kvm -M q35 -vnc :0 -nodefaults -vga std -qmp stdio -msg timestamp -S -object memory-backend-file,id=ram0,size=4G,mem-path=/dev/shm/ram0,share=on -machine memory-backend=ram0 -machine aux-ram-share=on
> > > > > > 
> > > > > > {"execute": "qmp_capabilities"}
> > > > > > {"return": {}}
> > > > > > {"execute": "netdev_add", "arguments": {"cpr": true, "script": "no", "downscript": "no", "vhostforce": false, "vhost": false, "queues": 4, "ifname": "tap0", "type": "tap", "id": "netdev.1"}}
> > > > > > {"return": {}}
> > > > > > {"execute": "device_add", "arguments": {"disable-legacy": "off", "bus": "s1", "netdev": "netdev.1", "driver": "virtio-net-pci", "vectors": 18, "mq": true, "romfile": "", "mac": "d6:0d:75:f8:0f:b7", "id": "vnet.1"}}
> > > > > > {"return": {}}
> > > > > > {"execute": "cont"}
> > > > > > {"timestamp": {"seconds": 1755977653, "microseconds": 248749}, "event": "RESUME"}
> > > > > > {"return": {}}
> > > > > > {"timestamp": {"seconds": 1755977657, "microseconds": 366274}, "event": "NIC_RX_FILTER_CHANGED", "data": {"name": "vnet.1", "path": "/machine/peripheral/vnet.1/virtio-backend"}}
> > > > > > {"execute": "migrate-set-parameters", "arguments": {"mode": "cpr-transfer"}}
> > > > > > {"return": {}}
> > > > > > {"execute": "migrate", "arguments": {"channels": [{"channel-type": "main", "addr": {"path": "/tmp/migr.sock", "transport": "socket", "type": "unix"}}, {"channel-type": "cpr", "addr": {"path": "/tmp/cpr.sock", "transport": "socket", "type": "unix"}}]}}
> > > > > > {"timestamp": {"seconds": 1755977767, "microseconds": 835571}, "event": "STOP"}
> > > > > > {"return": {}}
> > > > > > 
> > > > > > TARGET:
> > > > > > 
> > > > > > sudo build/qemu-system-x86_64 -display none -vga none -device pxb-pcie,bus_nr=128,bus=pcie.0,id=pcie.1 -device pcie-root-port,id=s0,slot=0,bus=pcie.1 -device pcie-root-port,id=s1,slot=1,bus=pcie.1 -device pcie-root-port,id=s2,slot=2,bus=pcie.1 -hda /home/vsementsov/work/vms/newfocal.raw -m 4G -enable-kvm -M q35 -vnc :1 -nodefaults -vga std -qmp stdio -S -object memory-backend-file,id=ram0,size=4G,mem-p
> > > > > > ath=/dev/shm/ram0,share=on -machine memory-backend=ram0 -machine aux-ram-share=on -incoming defer -incoming '{"channel-type": "cpr","addr": { "transport": "socket","type": "unix", "path": "/tmp/cpr.sock"}}'
> > > > > > 
> > > > > > <need to wait until "migrate" on source>
> > > > > > 
> > > > > > {"execute": "qmp_capabilities"}
> > > > > > {"return": {}}
> > > > > > {"execute": "netdev_add", "arguments": {"cpr": true, "script": "no", "downscript": "no", "vhostforce": false, "vhost": false, "queues": 4, "ifname": "tap0", "type": "tap", "id": "netdev.1"}}
> > > > > > {"return": {}}
> > > > > > {"execute": "device_add", "arguments": {"disable-legacy": "off", "bus": "s1", "netdev": "netdev.1", "driver": "virtio-net-pci", "vectors": 18, "mq": true, "romfile": "", "mac": "d6:0d:75:f8:0f:b7", "id": "vnet.1"}}
> > > > > > could not disable queue
> > > > > > qemu-system-x86_64: ../hw/net/virtio-net.c:771: virtio_net_set_queue_pairs: Assertion `!r' failed.
> > > > > > fish: Job 1, 'sudo build/qemu-system-x86_64 -…' terminated by signal SIGABRT (Abort)
> > > > > > 
> > > > > > So, it crashes on device_add..
> > > > > > 
> > > > > > Second, I've come a long way, backporting you TAP v1 series together with needed parts of CPR and migration channels to QEMU 7.2, fixing different issues (like, avoid reinitialization of vnet_hdr length on target, avoid simultaneous use of tap on source an target, avoid making the fd blocking again on target), and it finally started to work.
> > > > > > 
> > > > > > But next, I went to support similar migration for vhost-user-blk, and that was a lot more complex. No reason to pass an fd in preliminary stage, when source is running (like in CPR), because:
> > > > > > 
> > > > > > 1. we just can't use the fd on target at all, until we stop use it on source, otherwise we just break vhost-user-blk protocol on the wire (unlike TAP, where some ioctls called on target doesn't break source)
> > > > > > 2. we have to pass enough additional variables, which are simpler to pass through normal migration channel (how to pass anything except fds through cpr channel?)
> > > > 
> > > > You can pass extra state through the cpr channel.  See for example vmstate_cpr_vfio_device,
> > > > and how vmstate_cpr_vfio_devices is defined as a sub-section of vmstate_cpr_state.
> > > 
> > > O, I missed this.
> > > 
> > > Hmm. Still, finally CPR becomes just an additional stage of migration, which is done prior device initialization on target..
> > > 
> > > Didn't you think of integrating it to the common scheme: so that devices may have .vmsd_cpr in addition to .vmsd? This way we don't need a global CPR state, and CPR stage of migration will work the same way as normal migration?
> > 
> > I proposed a single migration stream containing pre-create state that was read early,
> > but that was rejected as too complex.
> > 
> > I also proposed refactoring initialization so the monitor and migration streams
> > could be opened earlier, but again rejected as too complex and/or not consistent with
> > a long term vision for reworking initialization.
> > 
> > > Still2, if we pass some state in CPR it should be a kind of constant. We need a guarantee that it will not change between migration start and source stop.
> > > 
> > > > 
> > > > > > So, I decided to go another way, and just migrate everything backend-related including fds through main migration channel. Of course, this requires deep reworking of device initialization in case of incoming migration (but for vhost-user-blk we need it anyway). The feature is in my series "[PATCH 00/33] vhost-user-blk: live-backend local migration" (you are in CC).
> > > > 
> > > > You did a lot of work in those series!
> > > > I suspect much less rework of initialization is required if you pass variables in cpr state.
> > > 
> > > Not sure. I had to rework initialization anyway, as initialization damaged the connection. And this lead me to idea "if rework anyway, why not to go with one migration channel".
> > > 
> > > > 
> > > > > > The success with vhost-user-blk (of-course) make me rethink TAP migration too: try to avoid using additional cpr channel and unusual waiting for QMP interface on target. And, I've just sent an RFC: "[RFC 0/7] virtio-net: live-TAP local migration"
> > > > 
> > > > Is there a use case for this outside of CPR?
> > > 
> > > It just works without CPR.. Will CPR bring more benefit if I enable it in the setup with my local-tap + local-vhost-user-blk capabilities ( + ignore-shared of-course)?
> > > 
> > > 
> > > > CPR is intended to be the "local migration" solution that does it all :)
> > > > But if you do proceed with your local migration tap solution, I would want
> > > > to see that CPR could also use your code paths.
> > > > 
> > > CPR can transparently use my code: you may enable both CPR and
> > > local-tap capability and it should work. Some devices will migrate
> > > their fds through CPR, TAP fds amd state will migrate through main
> > > migration channel.
> > 
> > OK, I believe that.
> > 
> > I also care about cpr-exec mode.  We use it internally, and I am trying to push
> > it upstream:
> >    https://lore.kernel.org/qemu-devel/1755191843-283480-8-git-send-email-steven.sistare@oracle.com/
> > I believe it would work with your code.  Migrated fd's in both the cpr channel and
> > the main migration channel would be handled differently as shown in vmstate-types.c
> > get_fd() and put_fd().  The fd is kept open across execv(), and vmstate represents
> > the fd by its value (eg a small integer), rather than as an object in the unix channel.
> 
> I'm close to publish new version, which will include
> 
> > 
> > > Making both channels to be unix-sockets should not be a considerable overhead I think.
> > > 
> > > Why I like my solution more:
> > > 
> > > - no additional channel
> > > - no additional logic in management software (to handle target start with no QMP access until "migrate" command on source)
> > > - less code to backport (that's personal, of course not an argument for final upstream solution)
> > > 
> > > It seems that CPR is simpler to support as we don't need to do deep rework of initialization code.. But in reality, there is a lot of work anyway: TAP, vhost-user-blk cases proves this. You series about vfio are also huge.
> > 
> > TAP is the only case where we can compare both approaches, and the numbers tell
> > the story:
> > 
> >    TAP initialization refactoring: 277 insertions(+), 308 deletions(-)
> 
> Actually, I've done a lot more refactoring than required for TAP local migration, trying to make the whole initialization more clear and consistent. And it's a good base for any modification of TAP device I think.
> 
> >    live-TAP local migration:       681 insertions(+), 72 deletions(-)
> 
> But 369 is last patch which is not for commit, and 65 a first patch with tracepoints (look at it tap_dump_packet() - thanks to AI, really helps to debug network problems, when you see packet dumps in QEMU log)
> So, more honest estimate is ~250, which is in good accordance with Live update tap.
> 
> >                          total:    958 insertions(+), 380 deletions(-)
> > 
> >    Live update tap and vhost:      223 insertions(+), 55 deletions(-)
> > 
> > For any given system, if the maintainers accept the larger amount of change,
> > then local migration is cool (and CPR made it possible by adding fd support
> > to vmstate+QEMUFile)
> 
> Yes, native support for fds in migration API opens the doors:)
> 
> > .  But the amount of change is a harder sell.
> 
> Yes, that's right. But live-TAP isn't really big. Unlike live-vhost-user-blk unfortunately.
> 
> > > What is the benefit of CPR against simple (unix-socket) migration?
> 
> > CPR supports vfio, iommufd, and pinned memory.  Memory backend objects are
> > created early, before the main migration stream is read, and squashing
> > CPR into migration for those cases would require a major change in how
> > qemu creates objects during live migration.
> 
> Yes, understand: less things to change in initialization code = we can cover more things..
> 
> For my downstream I need TAP, vhost-user-blk and vfio. So vfio would be the most interesting challenge, if I try to make a kind of live-vfio local migration.
> 
> - it already supported by CPR, so it would be really hard to cell 1-2 thousands of additional code lines) But I'll see, may be it will not be so much.
> - we already have support in downstream, which we've never tried to send. It based on getting fds from source and passing them to target by management software.. But of course one day we should sync with upstream.

Sorry to jump in as late.  Just want to say that using LOCs to compare
solutions is not fair above, IMHO: we could have hacks that is a single
liner, but maintaining those can be nightmare.

PS: totally not saying that CPR is hackish! :-)

I didn't read any new code at all, I apologize if I would say stupid
things, but.. if we have cleaner way to do all of these, and if that can
happen in one channel that sounds ideal.

IIUC then we can save the is_cpr_incoming() checks all over the places -
frankly, that's part of pure hack.  It's extremely hard to maintain
longterm, IMO.

I wished devices could opt-in to provide its own model so that it is
prepared to boot the QEMU without FDs being there and pause itself at that
stage if a load would happen.  If all such is possible for all device
emulations that we would care, it'll be perfect, IMHO.  More LOCs would
deserve such refactoring (and if there're even more benefits besides
migration, which I don't know about device code but I feel so).

So I wished more of Vladimir's work land, if my understanding is correct,
and if that can competely replace the early channel some day (when every
device FDs will be able to be migrated via main channel - is it possible)?

-- 
Peter Xu


Re: [RFC V2 0/8] Live update: tap and vhost
Posted by Vladimir Sementsov-Ogievskiy 2 months, 1 week ago
On 05.09.25 19:16, Peter Xu wrote:
> On Tue, Sep 02, 2025 at 08:09:44PM +0300, Vladimir Sementsov-Ogievskiy wrote:
>> On 02.09.25 18:33, Steven Sistare wrote:
>>> On 9/1/2025 7:44 AM, Vladimir Sementsov-Ogievskiy wrote:
>>>> On 29.08.25 22:37, Steven Sistare wrote:
>>>>> On 8/28/2025 11:48 AM, Steven Sistare wrote:
>>>>>> On 8/23/2025 5:53 PM, Vladimir Sementsov-Ogievskiy wrote:
>>>>>>> On 17.07.25 21:39, Steve Sistare wrote:
>>>>>>>> Tap and vhost devices can be preserved during cpr-transfer using
>>>>>>>> traditional live migration methods, wherein the management layer
>>>>>>>> creates new interfaces for the target and fiddles with 'ip link'
>>>>>>>> to deactivate the old interface and activate the new.
>>>>>>>>
>>>>>>>> However, CPR can simply send the file descriptors to new QEMU,
>>>>>>>> with no special management actions required.  The user enables
>>>>>>>> this behavior by specifing '-netdev tap,cpr=on'.  The default
>>>>>>>> is cpr=off.
>>>>>>>
>>>>>>> Hi Steve!
>>>>>>>
>>>>>>> First, me trying to test the series:
>>>>>>
>>>>>> Thank-you Vladimir for all the work you are doing in this area.  I have
>>>>>> reproduced the "virtio_net_set_queue_pairs: Assertion `!r' failed." bug.
>>>>>> Let me dig into that before I study the larger questions you pose
>>>>>> about preserving tap/vhost-user-blk in local migration versus cpr.
>>>>>
>>>>> I have reproduced your journey!  I fixed the assertion, the vnet_hdr, and
>>>>> the blocking fd problems which you allude to.  The attached patch fixes
>>>>> them, and will be squashed into the series.
>>>>>
>>>>> Ben, you also reported the !r assertion failure, so this fix should help
>>>>> you also.
>>>>>
>>>>>>> SOURCE:
>>>>>>>
>>>>>>> sudo build/qemu-system-x86_64 -display none -vga none -device pxb-pcie,bus_nr=128,bus=pcie.0,id=pcie.1 -device pcie-root-port,id=s0,slot=0,bus=pcie.1 -device pcie-root-port,id=s1,slot=1,bus=pcie.1 -device pcie-root-port,id=s2,slot=2,bus=pcie.1 -hda /home/vsementsov/work/vms/newfocal.raw -m 4G -enable-kvm -M q35 -vnc :0 -nodefaults -vga std -qmp stdio -msg timestamp -S -object memory-backend-file,id=ram0,size=4G,mem-path=/dev/shm/ram0,share=on -machine memory-backend=ram0 -machine aux-ram-share=on
>>>>>>>
>>>>>>> {"execute": "qmp_capabilities"}
>>>>>>> {"return": {}}
>>>>>>> {"execute": "netdev_add", "arguments": {"cpr": true, "script": "no", "downscript": "no", "vhostforce": false, "vhost": false, "queues": 4, "ifname": "tap0", "type": "tap", "id": "netdev.1"}}
>>>>>>> {"return": {}}
>>>>>>> {"execute": "device_add", "arguments": {"disable-legacy": "off", "bus": "s1", "netdev": "netdev.1", "driver": "virtio-net-pci", "vectors": 18, "mq": true, "romfile": "", "mac": "d6:0d:75:f8:0f:b7", "id": "vnet.1"}}
>>>>>>> {"return": {}}
>>>>>>> {"execute": "cont"}
>>>>>>> {"timestamp": {"seconds": 1755977653, "microseconds": 248749}, "event": "RESUME"}
>>>>>>> {"return": {}}
>>>>>>> {"timestamp": {"seconds": 1755977657, "microseconds": 366274}, "event": "NIC_RX_FILTER_CHANGED", "data": {"name": "vnet.1", "path": "/machine/peripheral/vnet.1/virtio-backend"}}
>>>>>>> {"execute": "migrate-set-parameters", "arguments": {"mode": "cpr-transfer"}}
>>>>>>> {"return": {}}
>>>>>>> {"execute": "migrate", "arguments": {"channels": [{"channel-type": "main", "addr": {"path": "/tmp/migr.sock", "transport": "socket", "type": "unix"}}, {"channel-type": "cpr", "addr": {"path": "/tmp/cpr.sock", "transport": "socket", "type": "unix"}}]}}
>>>>>>> {"timestamp": {"seconds": 1755977767, "microseconds": 835571}, "event": "STOP"}
>>>>>>> {"return": {}}
>>>>>>>
>>>>>>> TARGET:
>>>>>>>
>>>>>>> sudo build/qemu-system-x86_64 -display none -vga none -device pxb-pcie,bus_nr=128,bus=pcie.0,id=pcie.1 -device pcie-root-port,id=s0,slot=0,bus=pcie.1 -device pcie-root-port,id=s1,slot=1,bus=pcie.1 -device pcie-root-port,id=s2,slot=2,bus=pcie.1 -hda /home/vsementsov/work/vms/newfocal.raw -m 4G -enable-kvm -M q35 -vnc :1 -nodefaults -vga std -qmp stdio -S -object memory-backend-file,id=ram0,size=4G,mem-p
>>>>>>> ath=/dev/shm/ram0,share=on -machine memory-backend=ram0 -machine aux-ram-share=on -incoming defer -incoming '{"channel-type": "cpr","addr": { "transport": "socket","type": "unix", "path": "/tmp/cpr.sock"}}'
>>>>>>>
>>>>>>> <need to wait until "migrate" on source>
>>>>>>>
>>>>>>> {"execute": "qmp_capabilities"}
>>>>>>> {"return": {}}
>>>>>>> {"execute": "netdev_add", "arguments": {"cpr": true, "script": "no", "downscript": "no", "vhostforce": false, "vhost": false, "queues": 4, "ifname": "tap0", "type": "tap", "id": "netdev.1"}}
>>>>>>> {"return": {}}
>>>>>>> {"execute": "device_add", "arguments": {"disable-legacy": "off", "bus": "s1", "netdev": "netdev.1", "driver": "virtio-net-pci", "vectors": 18, "mq": true, "romfile": "", "mac": "d6:0d:75:f8:0f:b7", "id": "vnet.1"}}
>>>>>>> could not disable queue
>>>>>>> qemu-system-x86_64: ../hw/net/virtio-net.c:771: virtio_net_set_queue_pairs: Assertion `!r' failed.
>>>>>>> fish: Job 1, 'sudo build/qemu-system-x86_64 -…' terminated by signal SIGABRT (Abort)
>>>>>>>
>>>>>>> So, it crashes on device_add..
>>>>>>>
>>>>>>> Second, I've come a long way, backporting you TAP v1 series together with needed parts of CPR and migration channels to QEMU 7.2, fixing different issues (like, avoid reinitialization of vnet_hdr length on target, avoid simultaneous use of tap on source an target, avoid making the fd blocking again on target), and it finally started to work.
>>>>>>>
>>>>>>> But next, I went to support similar migration for vhost-user-blk, and that was a lot more complex. No reason to pass an fd in preliminary stage, when source is running (like in CPR), because:
>>>>>>>
>>>>>>> 1. we just can't use the fd on target at all, until we stop use it on source, otherwise we just break vhost-user-blk protocol on the wire (unlike TAP, where some ioctls called on target doesn't break source)
>>>>>>> 2. we have to pass enough additional variables, which are simpler to pass through normal migration channel (how to pass anything except fds through cpr channel?)
>>>>>
>>>>> You can pass extra state through the cpr channel.  See for example vmstate_cpr_vfio_device,
>>>>> and how vmstate_cpr_vfio_devices is defined as a sub-section of vmstate_cpr_state.
>>>>
>>>> O, I missed this.
>>>>
>>>> Hmm. Still, finally CPR becomes just an additional stage of migration, which is done prior device initialization on target..
>>>>
>>>> Didn't you think of integrating it to the common scheme: so that devices may have .vmsd_cpr in addition to .vmsd? This way we don't need a global CPR state, and CPR stage of migration will work the same way as normal migration?
>>>
>>> I proposed a single migration stream containing pre-create state that was read early,
>>> but that was rejected as too complex.
>>>
>>> I also proposed refactoring initialization so the monitor and migration streams
>>> could be opened earlier, but again rejected as too complex and/or not consistent with
>>> a long term vision for reworking initialization.
>>>
>>>> Still2, if we pass some state in CPR it should be a kind of constant. We need a guarantee that it will not change between migration start and source stop.
>>>>
>>>>>
>>>>>>> So, I decided to go another way, and just migrate everything backend-related including fds through main migration channel. Of course, this requires deep reworking of device initialization in case of incoming migration (but for vhost-user-blk we need it anyway). The feature is in my series "[PATCH 00/33] vhost-user-blk: live-backend local migration" (you are in CC).
>>>>>
>>>>> You did a lot of work in those series!
>>>>> I suspect much less rework of initialization is required if you pass variables in cpr state.
>>>>
>>>> Not sure. I had to rework initialization anyway, as initialization damaged the connection. And this lead me to idea "if rework anyway, why not to go with one migration channel".
>>>>
>>>>>
>>>>>>> The success with vhost-user-blk (of-course) make me rethink TAP migration too: try to avoid using additional cpr channel and unusual waiting for QMP interface on target. And, I've just sent an RFC: "[RFC 0/7] virtio-net: live-TAP local migration"
>>>>>
>>>>> Is there a use case for this outside of CPR?
>>>>
>>>> It just works without CPR.. Will CPR bring more benefit if I enable it in the setup with my local-tap + local-vhost-user-blk capabilities ( + ignore-shared of-course)?
>>>>
>>>>
>>>>> CPR is intended to be the "local migration" solution that does it all :)
>>>>> But if you do proceed with your local migration tap solution, I would want
>>>>> to see that CPR could also use your code paths.
>>>>>
>>>> CPR can transparently use my code: you may enable both CPR and
>>>> local-tap capability and it should work. Some devices will migrate
>>>> their fds through CPR, TAP fds amd state will migrate through main
>>>> migration channel.
>>>
>>> OK, I believe that.
>>>
>>> I also care about cpr-exec mode.  We use it internally, and I am trying to push
>>> it upstream:
>>>     https://lore.kernel.org/qemu-devel/1755191843-283480-8-git-send-email-steven.sistare@oracle.com/
>>> I believe it would work with your code.  Migrated fd's in both the cpr channel and
>>> the main migration channel would be handled differently as shown in vmstate-types.c
>>> get_fd() and put_fd().  The fd is kept open across execv(), and vmstate represents
>>> the fd by its value (eg a small integer), rather than as an object in the unix channel.
>>
>> I'm close to publish new version, which will include
>>
>>>
>>>> Making both channels to be unix-sockets should not be a considerable overhead I think.
>>>>
>>>> Why I like my solution more:
>>>>
>>>> - no additional channel
>>>> - no additional logic in management software (to handle target start with no QMP access until "migrate" command on source)
>>>> - less code to backport (that's personal, of course not an argument for final upstream solution)
>>>>
>>>> It seems that CPR is simpler to support as we don't need to do deep rework of initialization code.. But in reality, there is a lot of work anyway: TAP, vhost-user-blk cases proves this. You series about vfio are also huge.
>>>
>>> TAP is the only case where we can compare both approaches, and the numbers tell
>>> the story:
>>>
>>>     TAP initialization refactoring: 277 insertions(+), 308 deletions(-)
>>
>> Actually, I've done a lot more refactoring than required for TAP local migration, trying to make the whole initialization more clear and consistent. And it's a good base for any modification of TAP device I think.
>>
>>>     live-TAP local migration:       681 insertions(+), 72 deletions(-)
>>
>> But 369 is last patch which is not for commit, and 65 a first patch with tracepoints (look at it tap_dump_packet() - thanks to AI, really helps to debug network problems, when you see packet dumps in QEMU log)
>> So, more honest estimate is ~250, which is in good accordance with Live update tap.
>>
>>>                           total:    958 insertions(+), 380 deletions(-)
>>>
>>>     Live update tap and vhost:      223 insertions(+), 55 deletions(-)
>>>
>>> For any given system, if the maintainers accept the larger amount of change,
>>> then local migration is cool (and CPR made it possible by adding fd support
>>> to vmstate+QEMUFile)
>>
>> Yes, native support for fds in migration API opens the doors:)
>>
>>> .  But the amount of change is a harder sell.
>>
>> Yes, that's right. But live-TAP isn't really big. Unlike live-vhost-user-blk unfortunately.
>>
>>>> What is the benefit of CPR against simple (unix-socket) migration?
>>
>>> CPR supports vfio, iommufd, and pinned memory.  Memory backend objects are
>>> created early, before the main migration stream is read, and squashing
>>> CPR into migration for those cases would require a major change in how
>>> qemu creates objects during live migration.
>>
>> Yes, understand: less things to change in initialization code = we can cover more things..
>>
>> For my downstream I need TAP, vhost-user-blk and vfio. So vfio would be the most interesting challenge, if I try to make a kind of live-vfio local migration.
>>
>> - it already supported by CPR, so it would be really hard to cell 1-2 thousands of additional code lines) But I'll see, may be it will not be so much.
>> - we already have support in downstream, which we've never tried to send. It based on getting fds from source and passing them to target by management software.. But of course one day we should sync with upstream.
> 
> Sorry to jump in as late.  Just want to say that using LOCs to compare
> solutions is not fair above, IMHO: we could have hacks that is a single
> liner, but maintaining those can be nightmare.

Of-course )

> 
> PS: totally not saying that CPR is hackish! :-)
> 
> I didn't read any new code at all, I apologize if I would say stupid
> things, but.. if we have cleaner way to do all of these, and if that can
> happen in one channel that sounds ideal.

I believe that's possible. At least it works for vhost-user-blk and TAP.
Still such approach may require more work to refactor initialization code.

> 
> IIUC then we can save the is_cpr_incoming() checks all over the places -
> frankly, that's part of pure hack.  It's extremely hard to maintain
> longterm, IMO.

Unfortunately, my solution comes with similar checks here and there (for enabled capabilites, etc.).

That's because we have to distinguish usual initialization/starting of device vs
incoming migration, when we should postpone some part of initialization code up to
post-load of migration.

> 
> I wished devices could opt-in to provide its own model so that it is
> prepared to boot the QEMU without FDs being there and pause itself at that
> stage if a load would happen. 

So, you suggest to postpone the initialization up to "start" even for "normal start"
of QEMU, to avoid these endless "if (we are in our special local-incoming/CPR mode)".

Actually, that's how normal migratable devices live: we don't have "if (incoming)" for
every step of initialization/start currently.

I'll see, could I apply the concept to TAP local migration series.

> If all such is possible for all device
> emulations that we would care, it'll be perfect, IMHO.  More LOCs would
> deserve such refactoring (and if there're even more benefits besides
> migration, which I don't know about device code but I feel so).
> 
> So I wished more of Vladimir's work land, if my understanding is correct,
> and if that can competely replace the early channel some day (when every
> device FDs will be able to be migrated via main channel - is it possible)?
> 

Thanks!

-- 
Best regards,
Vladimir

Re: [RFC V2 0/8] Live update: tap and vhost
Posted by Vladimir Sementsov-Ogievskiy 2 months ago
On 08.09.25 12:55, Vladimir Sementsov-Ogievskiy wrote:
> On 05.09.25 19:16, Peter Xu wrote:
>> On Tue, Sep 02, 2025 at 08:09:44PM +0300, Vladimir Sementsov-Ogievskiy wrote:
>>> On 02.09.25 18:33, Steven Sistare wrote:
>>>> On 9/1/2025 7:44 AM, Vladimir Sementsov-Ogievskiy wrote:
>>>>> On 29.08.25 22:37, Steven Sistare wrote:
>>>>>> On 8/28/2025 11:48 AM, Steven Sistare wrote:
>>>>>>> On 8/23/2025 5:53 PM, Vladimir Sementsov-Ogievskiy wrote:
>>>>>>>> On 17.07.25 21:39, Steve Sistare wrote:
>>>>>>>>> Tap and vhost devices can be preserved during cpr-transfer using
>>>>>>>>> traditional live migration methods, wherein the management layer
>>>>>>>>> creates new interfaces for the target and fiddles with 'ip link'
>>>>>>>>> to deactivate the old interface and activate the new.
>>>>>>>>>
>>>>>>>>> However, CPR can simply send the file descriptors to new QEMU,
>>>>>>>>> with no special management actions required.  The user enables
>>>>>>>>> this behavior by specifing '-netdev tap,cpr=on'.  The default
>>>>>>>>> is cpr=off.
>>>>>>>>
>>>>>>>> Hi Steve!
>>>>>>>>
>>>>>>>> First, me trying to test the series:
>>>>>>>
>>>>>>> Thank-you Vladimir for all the work you are doing in this area.  I have
>>>>>>> reproduced the "virtio_net_set_queue_pairs: Assertion `!r' failed." bug.
>>>>>>> Let me dig into that before I study the larger questions you pose
>>>>>>> about preserving tap/vhost-user-blk in local migration versus cpr.
>>>>>>
>>>>>> I have reproduced your journey!  I fixed the assertion, the vnet_hdr, and
>>>>>> the blocking fd problems which you allude to.  The attached patch fixes
>>>>>> them, and will be squashed into the series.
>>>>>>
>>>>>> Ben, you also reported the !r assertion failure, so this fix should help
>>>>>> you also.
>>>>>>
>>>>>>>> SOURCE:
>>>>>>>>
>>>>>>>> sudo build/qemu-system-x86_64 -display none -vga none -device pxb-pcie,bus_nr=128,bus=pcie.0,id=pcie.1 -device pcie-root-port,id=s0,slot=0,bus=pcie.1 -device pcie-root-port,id=s1,slot=1,bus=pcie.1 -device pcie-root-port,id=s2,slot=2,bus=pcie.1 -hda /home/vsementsov/work/vms/newfocal.raw -m 4G -enable-kvm -M q35 -vnc :0 -nodefaults -vga std -qmp stdio -msg timestamp -S -object memory-backend-file,id=ram0,size=4G,mem-path=/dev/shm/ram0,share=on -machine memory-backend=ram0 -machine aux-ram-share=on
>>>>>>>>
>>>>>>>> {"execute": "qmp_capabilities"}
>>>>>>>> {"return": {}}
>>>>>>>> {"execute": "netdev_add", "arguments": {"cpr": true, "script": "no", "downscript": "no", "vhostforce": false, "vhost": false, "queues": 4, "ifname": "tap0", "type": "tap", "id": "netdev.1"}}
>>>>>>>> {"return": {}}
>>>>>>>> {"execute": "device_add", "arguments": {"disable-legacy": "off", "bus": "s1", "netdev": "netdev.1", "driver": "virtio-net-pci", "vectors": 18, "mq": true, "romfile": "", "mac": "d6:0d:75:f8:0f:b7", "id": "vnet.1"}}
>>>>>>>> {"return": {}}
>>>>>>>> {"execute": "cont"}
>>>>>>>> {"timestamp": {"seconds": 1755977653, "microseconds": 248749}, "event": "RESUME"}
>>>>>>>> {"return": {}}
>>>>>>>> {"timestamp": {"seconds": 1755977657, "microseconds": 366274}, "event": "NIC_RX_FILTER_CHANGED", "data": {"name": "vnet.1", "path": "/machine/peripheral/vnet.1/virtio-backend"}}
>>>>>>>> {"execute": "migrate-set-parameters", "arguments": {"mode": "cpr-transfer"}}
>>>>>>>> {"return": {}}
>>>>>>>> {"execute": "migrate", "arguments": {"channels": [{"channel-type": "main", "addr": {"path": "/tmp/migr.sock", "transport": "socket", "type": "unix"}}, {"channel-type": "cpr", "addr": {"path": "/tmp/cpr.sock", "transport": "socket", "type": "unix"}}]}}
>>>>>>>> {"timestamp": {"seconds": 1755977767, "microseconds": 835571}, "event": "STOP"}
>>>>>>>> {"return": {}}
>>>>>>>>
>>>>>>>> TARGET:
>>>>>>>>
>>>>>>>> sudo build/qemu-system-x86_64 -display none -vga none -device pxb-pcie,bus_nr=128,bus=pcie.0,id=pcie.1 -device pcie-root-port,id=s0,slot=0,bus=pcie.1 -device pcie-root-port,id=s1,slot=1,bus=pcie.1 -device pcie-root-port,id=s2,slot=2,bus=pcie.1 -hda /home/vsementsov/work/vms/newfocal.raw -m 4G -enable-kvm -M q35 -vnc :1 -nodefaults -vga std -qmp stdio -S -object memory-backend-file,id=ram0,size=4G,mem-p
>>>>>>>> ath=/dev/shm/ram0,share=on -machine memory-backend=ram0 -machine aux-ram-share=on -incoming defer -incoming '{"channel-type": "cpr","addr": { "transport": "socket","type": "unix", "path": "/tmp/cpr.sock"}}'
>>>>>>>>
>>>>>>>> <need to wait until "migrate" on source>
>>>>>>>>
>>>>>>>> {"execute": "qmp_capabilities"}
>>>>>>>> {"return": {}}
>>>>>>>> {"execute": "netdev_add", "arguments": {"cpr": true, "script": "no", "downscript": "no", "vhostforce": false, "vhost": false, "queues": 4, "ifname": "tap0", "type": "tap", "id": "netdev.1"}}
>>>>>>>> {"return": {}}
>>>>>>>> {"execute": "device_add", "arguments": {"disable-legacy": "off", "bus": "s1", "netdev": "netdev.1", "driver": "virtio-net-pci", "vectors": 18, "mq": true, "romfile": "", "mac": "d6:0d:75:f8:0f:b7", "id": "vnet.1"}}
>>>>>>>> could not disable queue
>>>>>>>> qemu-system-x86_64: ../hw/net/virtio-net.c:771: virtio_net_set_queue_pairs: Assertion `!r' failed.
>>>>>>>> fish: Job 1, 'sudo build/qemu-system-x86_64 -…' terminated by signal SIGABRT (Abort)
>>>>>>>>
>>>>>>>> So, it crashes on device_add..
>>>>>>>>
>>>>>>>> Second, I've come a long way, backporting you TAP v1 series together with needed parts of CPR and migration channels to QEMU 7.2, fixing different issues (like, avoid reinitialization of vnet_hdr length on target, avoid simultaneous use of tap on source an target, avoid making the fd blocking again on target), and it finally started to work.
>>>>>>>>
>>>>>>>> But next, I went to support similar migration for vhost-user-blk, and that was a lot more complex. No reason to pass an fd in preliminary stage, when source is running (like in CPR), because:
>>>>>>>>
>>>>>>>> 1. we just can't use the fd on target at all, until we stop use it on source, otherwise we just break vhost-user-blk protocol on the wire (unlike TAP, where some ioctls called on target doesn't break source)
>>>>>>>> 2. we have to pass enough additional variables, which are simpler to pass through normal migration channel (how to pass anything except fds through cpr channel?)
>>>>>>
>>>>>> You can pass extra state through the cpr channel.  See for example vmstate_cpr_vfio_device,
>>>>>> and how vmstate_cpr_vfio_devices is defined as a sub-section of vmstate_cpr_state.
>>>>>
>>>>> O, I missed this.
>>>>>
>>>>> Hmm. Still, finally CPR becomes just an additional stage of migration, which is done prior device initialization on target..
>>>>>
>>>>> Didn't you think of integrating it to the common scheme: so that devices may have .vmsd_cpr in addition to .vmsd? This way we don't need a global CPR state, and CPR stage of migration will work the same way as normal migration?
>>>>
>>>> I proposed a single migration stream containing pre-create state that was read early,
>>>> but that was rejected as too complex.
>>>>
>>>> I also proposed refactoring initialization so the monitor and migration streams
>>>> could be opened earlier, but again rejected as too complex and/or not consistent with
>>>> a long term vision for reworking initialization.
>>>>
>>>>> Still2, if we pass some state in CPR it should be a kind of constant. We need a guarantee that it will not change between migration start and source stop.
>>>>>
>>>>>>
>>>>>>>> So, I decided to go another way, and just migrate everything backend-related including fds through main migration channel. Of course, this requires deep reworking of device initialization in case of incoming migration (but for vhost-user-blk we need it anyway). The feature is in my series "[PATCH 00/33] vhost-user-blk: live-backend local migration" (you are in CC).
>>>>>>
>>>>>> You did a lot of work in those series!
>>>>>> I suspect much less rework of initialization is required if you pass variables in cpr state.
>>>>>
>>>>> Not sure. I had to rework initialization anyway, as initialization damaged the connection. And this lead me to idea "if rework anyway, why not to go with one migration channel".
>>>>>
>>>>>>
>>>>>>>> The success with vhost-user-blk (of-course) make me rethink TAP migration too: try to avoid using additional cpr channel and unusual waiting for QMP interface on target. And, I've just sent an RFC: "[RFC 0/7] virtio-net: live-TAP local migration"
>>>>>>
>>>>>> Is there a use case for this outside of CPR?
>>>>>
>>>>> It just works without CPR.. Will CPR bring more benefit if I enable it in the setup with my local-tap + local-vhost-user-blk capabilities ( + ignore-shared of-course)?
>>>>>
>>>>>
>>>>>> CPR is intended to be the "local migration" solution that does it all :)
>>>>>> But if you do proceed with your local migration tap solution, I would want
>>>>>> to see that CPR could also use your code paths.
>>>>>>
>>>>> CPR can transparently use my code: you may enable both CPR and
>>>>> local-tap capability and it should work. Some devices will migrate
>>>>> their fds through CPR, TAP fds amd state will migrate through main
>>>>> migration channel.
>>>>
>>>> OK, I believe that.
>>>>
>>>> I also care about cpr-exec mode.  We use it internally, and I am trying to push
>>>> it upstream:
>>>>     https://lore.kernel.org/qemu-devel/1755191843-283480-8-git-send-email-steven.sistare@oracle.com/
>>>> I believe it would work with your code.  Migrated fd's in both the cpr channel and
>>>> the main migration channel would be handled differently as shown in vmstate-types.c
>>>> get_fd() and put_fd().  The fd is kept open across execv(), and vmstate represents
>>>> the fd by its value (eg a small integer), rather than as an object in the unix channel.
>>>
>>> I'm close to publish new version, which will include
>>>
>>>>
>>>>> Making both channels to be unix-sockets should not be a considerable overhead I think.
>>>>>
>>>>> Why I like my solution more:
>>>>>
>>>>> - no additional channel
>>>>> - no additional logic in management software (to handle target start with no QMP access until "migrate" command on source)
>>>>> - less code to backport (that's personal, of course not an argument for final upstream solution)
>>>>>
>>>>> It seems that CPR is simpler to support as we don't need to do deep rework of initialization code.. But in reality, there is a lot of work anyway: TAP, vhost-user-blk cases proves this. You series about vfio are also huge.
>>>>
>>>> TAP is the only case where we can compare both approaches, and the numbers tell
>>>> the story:
>>>>
>>>>     TAP initialization refactoring: 277 insertions(+), 308 deletions(-)
>>>
>>> Actually, I've done a lot more refactoring than required for TAP local migration, trying to make the whole initialization more clear and consistent. And it's a good base for any modification of TAP device I think.
>>>
>>>>     live-TAP local migration:       681 insertions(+), 72 deletions(-)
>>>
>>> But 369 is last patch which is not for commit, and 65 a first patch with tracepoints (look at it tap_dump_packet() - thanks to AI, really helps to debug network problems, when you see packet dumps in QEMU log)
>>> So, more honest estimate is ~250, which is in good accordance with Live update tap.
>>>
>>>>                           total:    958 insertions(+), 380 deletions(-)
>>>>
>>>>     Live update tap and vhost:      223 insertions(+), 55 deletions(-)
>>>>
>>>> For any given system, if the maintainers accept the larger amount of change,
>>>> then local migration is cool (and CPR made it possible by adding fd support
>>>> to vmstate+QEMUFile)
>>>
>>> Yes, native support for fds in migration API opens the doors:)
>>>
>>>> .  But the amount of change is a harder sell.
>>>
>>> Yes, that's right. But live-TAP isn't really big. Unlike live-vhost-user-blk unfortunately.
>>>
>>>>> What is the benefit of CPR against simple (unix-socket) migration?
>>>
>>>> CPR supports vfio, iommufd, and pinned memory.  Memory backend objects are
>>>> created early, before the main migration stream is read, and squashing
>>>> CPR into migration for those cases would require a major change in how
>>>> qemu creates objects during live migration.
>>>
>>> Yes, understand: less things to change in initialization code = we can cover more things..
>>>
>>> For my downstream I need TAP, vhost-user-blk and vfio. So vfio would be the most interesting challenge, if I try to make a kind of live-vfio local migration.
>>>
>>> - it already supported by CPR, so it would be really hard to cell 1-2 thousands of additional code lines) But I'll see, may be it will not be so much.
>>> - we already have support in downstream, which we've never tried to send. It based on getting fds from source and passing them to target by management software.. But of course one day we should sync with upstream.
>>
>> Sorry to jump in as late.  Just want to say that using LOCs to compare
>> solutions is not fair above, IMHO: we could have hacks that is a single
>> liner, but maintaining those can be nightmare.
> 
> Of-course )
> 
>>
>> PS: totally not saying that CPR is hackish! :-)
>>
>> I didn't read any new code at all, I apologize if I would say stupid
>> things, but.. if we have cleaner way to do all of these, and if that can
>> happen in one channel that sounds ideal.
> 
> I believe that's possible. At least it works for vhost-user-blk and TAP.
> Still such approach may require more work to refactor initialization code.
> 
>>
>> IIUC then we can save the is_cpr_incoming() checks all over the places -
>> frankly, that's part of pure hack.  It's extremely hard to maintain
>> longterm, IMO.
> 
> Unfortunately, my solution comes with similar checks here and there (for enabled capabilites, etc.).
> 
> That's because we have to distinguish usual initialization/starting of device vs
> incoming migration, when we should postpone some part of initialization code up to
> post-load of migration.
> 
>>
>> I wished devices could opt-in to provide its own model so that it is
>> prepared to boot the QEMU without FDs being there and pause itself at that
>> stage if a load would happen. 
> 
> So, you suggest to postpone the initialization up to "start" even for "normal start"
> of QEMU, to avoid these endless "if (we are in our special local-incoming/CPR mode)".
> 
> Actually, that's how normal migratable devices live: we don't have "if (incoming)" for
> every step of initialization/start currently.
> 
> I'll see, could I apply the concept to TAP local migration series.


Hmm, not so simple.

OK, my current series behave like this:

init:  if tap.local_incoming then do nothing else open(/dev/net/tun)

incoming migration: get fd, and continue initialization


Assume, we want to avoid extra "if"s, and just postpone the initialization to vm start point, like

init: do nothing. set fd=-1

incmoing migration: get fd (if cap-fd-passing enabled)

start: open(), if fd==-1, continue initialization


But that mean that we postpone possible errors up to start as well, when we cannot rollback the
migration..


Alternatively, we can postpone open() to post-load.. But what for normal start of vm?

init: if INMIGRATE then do nothing, else open()

incoming: get fd (if cap-fd-passing)

post-load: open(), if fd==-1, continue initialization

start: if fd is still -1, open(), continue initialization

that avoids extra tap.local_incoming option, but:

- seems even more complicated
- open() and some initialization is done in downtime, when we don't enable cap-fd-passing


So, now I think, that my current approach with additional "local-incoming" per-device option is better.

What do you think?


Probably I'm trying to optimize wrong "if". As "if local-incomging .." in generic layer is a lot
more expensive than checking the options in device code.

But the idea is generic: for non-fd migration, we do as much initialization at start as possible,
to get early errors and to decrease further downtime. For fd migration, we postpone fd-initialization
up to post-load stage. So, we have "if"s in device code to handle it, and we have "if"s in generic
code to support device, which doesn't still have fully initialized backend (no fds during init).



> 
>> If all such is possible for all device
>> emulations that we would care, it'll be perfect, IMHO.  More LOCs would
>> deserve such refactoring (and if there're even more benefits besides
>> migration, which I don't know about device code but I feel so).
>>
>> So I wished more of Vladimir's work land, if my understanding is correct,
>> and if that can competely replace the early channel some day (when every
>> device FDs will be able to be migrated via main channel - is it possible)?
>>
> 
> Thanks!
> 


-- 
Best regards,
Vladimir

Re: [RFC V2 0/8] Live update: tap and vhost
Posted by Peter Xu 2 months ago
On Wed, Sep 10, 2025 at 12:35:10AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> > > I wished devices could opt-in to provide its own model so that it is
> > > prepared to boot the QEMU without FDs being there and pause itself at that
> > > stage if a load would happen.
> > 
> > So, you suggest to postpone the initialization up to "start" even for "normal start"
> > of QEMU, to avoid these endless "if (we are in our special local-incoming/CPR mode)".
> > 
> > Actually, that's how normal migratable devices live: we don't have "if (incoming)" for
> > every step of initialization/start currently.
> > 
> > I'll see, could I apply the concept to TAP local migration series.
> 
> 
> Hmm, not so simple.
> 
> OK, my current series behave like this:
> 
> init:  if tap.local_incoming then do nothing else open(/dev/net/tun)
> 
> incoming migration: get fd, and continue initialization
> 
> 
> Assume, we want to avoid extra "if"s, and just postpone the initialization to vm start point, like
> 
> init: do nothing. set fd=-1
> 
> incmoing migration: get fd (if cap-fd-passing enabled)
> 
> start: open(), if fd==-1, continue initialization
> 
> 
> But that mean that we postpone possible errors up to start as well, when we cannot rollback the
> migration..

Yep, doesn't sound like a good idea.  We also don't want to slow down VM
starts.

> 
> 
> Alternatively, we can postpone open() to post-load.. But what for normal start of vm?
> 
> init: if INMIGRATE then do nothing, else open()
> 
> incoming: get fd (if cap-fd-passing)
> 
> post-load: open(), if fd==-1, continue initialization
> 
> start: if fd is still -1, open(), continue initialization
> 
> that avoids extra tap.local_incoming option, but:
> 
> - seems even more complicated
> - open() and some initialization is done in downtime, when we don't enable cap-fd-passing
> 
> 
> So, now I think, that my current approach with additional "local-incoming" per-device option is better.
> 
> What do you think?
> 
> 
> Probably I'm trying to optimize wrong "if". As "if local-incomging .." in generic layer is a lot
> more expensive than checking the options in device code.
> 
> But the idea is generic: for non-fd migration, we do as much initialization at start as possible,

AFAIU, the non-fd migrations works simply because the portion that VMSD
loads will always be over-writeable.  When it's not, a pre_load() or
post_load() would make it work.

> to get early errors and to decrease further downtime. For fd migration, we postpone fd-initialization
> up to post-load stage. So, we have "if"s in device code to handle it, and we have "if"s in generic
> code to support device, which doesn't still have fully initialized backend (no fds during init).

What I meant is, IMHO we should try to not use things like
cpr_is_incoming() too deep into the device stack, and we should use it as
less frequent as possible.

In many cases, IIUC it's because the current device emulation code is not
yet separating the FD installation (and also whatever that can be relevant
to the FD) from the realize() process.  Hence a quick way to make it work
is to add cpr_is_incoming() or similar helpers either to skip some process,
or do something different with an existing FD.

If we can have device emulation be prepared with such, in an ideal world
and just to show what I am thinking.. it could be:

  - realize()
    - realize_frontend()
    - if migration is incoming, and backend should be postponed (for fd
      loading, or maybe something else)?
      - ... realize_backend() postponed until post_load()...
    - else
      - realize_backend()

If all of the devices would support such split of realize() process
v.s. FDs / backends, _maybe_ we can remove all cpr_is_incoming() but move
it upper and upper until qdev code, like:

device_set_realized():
        if (migration_incoming_XXX() && dc->realize_prepare) {
            /*
             * This is only part of realize(), rest done in a separate VMSD
             * post_load().
             */
            dc->realize_prepare(dev, &local_err);
            if (local_err != NULL) {
                goto fail;
            }
        } else if (dc->realize) {
            dc->realize(dev, &local_err);
            if (local_err != NULL) {
                goto fail;
            }
        }

In general, that "whether is incoming fd migration" concept will be passed
down from higher the stack, rather than randomly checked very deep in
stack.  That should IMHO make code more maintenable.

But that's only my two cents.. so please take that with a grain of salt.  I
don't really know device code well to say.

Thanks,

-- 
Peter Xu
Re: [RFC V2 0/8] Live update: tap and vhost
Posted by Vladimir Sementsov-Ogievskiy 2 months ago
On 10.09.25 19:58, Peter Xu wrote:
> On Wed, Sep 10, 2025 at 12:35:10AM +0300, Vladimir Sementsov-Ogievskiy wrote:
>>>> I wished devices could opt-in to provide its own model so that it is
>>>> prepared to boot the QEMU without FDs being there and pause itself at that
>>>> stage if a load would happen.
>>>
>>> So, you suggest to postpone the initialization up to "start" even for "normal start"
>>> of QEMU, to avoid these endless "if (we are in our special local-incoming/CPR mode)".
>>>
>>> Actually, that's how normal migratable devices live: we don't have "if (incoming)" for
>>> every step of initialization/start currently.
>>>
>>> I'll see, could I apply the concept to TAP local migration series.
>>
>>
>> Hmm, not so simple.
>>
>> OK, my current series behave like this:
>>
>> init:  if tap.local_incoming then do nothing else open(/dev/net/tun)
>>
>> incoming migration: get fd, and continue initialization
>>
>>
>> Assume, we want to avoid extra "if"s, and just postpone the initialization to vm start point, like
>>
>> init: do nothing. set fd=-1
>>
>> incmoing migration: get fd (if cap-fd-passing enabled)
>>
>> start: open(), if fd==-1, continue initialization
>>
>>
>> But that mean that we postpone possible errors up to start as well, when we cannot rollback the
>> migration..
> 
> Yep, doesn't sound like a good idea.  We also don't want to slow down VM
> starts.
> 
>>
>>
>> Alternatively, we can postpone open() to post-load.. But what for normal start of vm?
>>
>> init: if INMIGRATE then do nothing, else open()
>>
>> incoming: get fd (if cap-fd-passing)
>>
>> post-load: open(), if fd==-1, continue initialization
>>
>> start: if fd is still -1, open(), continue initialization
>>
>> that avoids extra tap.local_incoming option, but:
>>
>> - seems even more complicated
>> - open() and some initialization is done in downtime, when we don't enable cap-fd-passing
>>
>>
>> So, now I think, that my current approach with additional "local-incoming" per-device option is better.
>>
>> What do you think?
>>
>>
>> Probably I'm trying to optimize wrong "if". As "if local-incomging .." in generic layer is a lot
>> more expensive than checking the options in device code.
>>
>> But the idea is generic: for non-fd migration, we do as much initialization at start as possible,
> 
> AFAIU, the non-fd migrations works simply because the portion that VMSD
> loads will always be over-writeable.  When it's not, a pre_load() or
> post_load() would make it work.
> 
>> to get early errors and to decrease further downtime. For fd migration, we postpone fd-initialization
>> up to post-load stage. So, we have "if"s in device code to handle it, and we have "if"s in generic
>> code to support device, which doesn't still have fully initialized backend (no fds during init).
> 
> What I meant is, IMHO we should try to not use things like
> cpr_is_incoming() too deep into the device stack, and we should use it as
> less frequent as possible.
> 
> In many cases, IIUC it's because the current device emulation code is not
> yet separating the FD installation (and also whatever that can be relevant
> to the FD) from the realize() process.  Hence a quick way to make it work
> is to add cpr_is_incoming() or similar helpers either to skip some process,
> or do something different with an existing FD.
> 
> If we can have device emulation be prepared with such, in an ideal world
> and just to show what I am thinking.. it could be:
> 
>    - realize()
>      - realize_frontend()
>      - if migration is incoming, and backend should be postponed (for fd
>        loading, or maybe something else)?
>        - ... realize_backend() postponed until post_load()...
>      - else
>        - realize_backend()
> 
> If all of the devices would support such split of realize() process
> v.s. FDs / backends, _maybe_ we can remove all cpr_is_incoming() but move
> it upper and upper until qdev code, like:
> 
> device_set_realized():
>          if (migration_incoming_XXX() && dc->realize_prepare) {
>              /*
>               * This is only part of realize(), rest done in a separate VMSD
>               * post_load().
>               */
>              dc->realize_prepare(dev, &local_err);
>              if (local_err != NULL) {
>                  goto fail;
>              }
>          } else if (dc->realize) {
>              dc->realize(dev, &local_err);
>              if (local_err != NULL) {
>                  goto fail;
>              }
>          }
> 
> In general, that "whether is incoming fd migration" concept will be passed
> down from higher the stack, rather than randomly checked very deep in
> stack.  That should IMHO make code more maintenable.
> 
> But that's only my two cents.. so please take that with a grain of salt.  I
> don't really know device code well to say.
> 


Thanks for explanation, I see the idea now. Will see, how much I can apply it to
TAP series. I believe, TAP is good chance to make good design, as it's a lot simpler than
vhost-user-blk or vfio.


-- 
Best regards,
Vladimir
Re: [RFC V2 0/8] Live update: tap and vhost
Posted by Fabiano Rosas 3 months, 1 week ago
Steve Sistare <steven.sistare@oracle.com> writes:

> Tap and vhost devices can be preserved during cpr-transfer using
> traditional live migration methods, wherein the management layer
> creates new interfaces for the target and fiddles with 'ip link'
> to deactivate the old interface and activate the new.
>
> However, CPR can simply send the file descriptors to new QEMU,
> with no special management actions required.  The user enables
> this behavior by specifing '-netdev tap,cpr=on'.  The default
> is cpr=off.
>
> Steve Sistare (8):
>   migration: stop vm earlier for cpr
>   migration: cpr setup notifier
>   vhost: reset vhost devices for cpr
>   cpr: delete all fds
>   Revert "vhost-backend: remove vhost_kernel_reset_device()"
>   tap: common return label
>   tap: cpr support
>   tap: postload fix for cpr
>
>  qapi/net.json             |   5 +-
>  include/hw/virtio/vhost.h |   1 +
>  include/migration/cpr.h   |   3 +-
>  include/net/tap.h         |   1 +
>  hw/net/virtio-net.c       |  20 +++++++
>  hw/vfio/device.c          |   2 +-
>  hw/virtio/vhost-backend.c |   6 ++
>  hw/virtio/vhost.c         |  32 +++++++++++
>  migration/cpr.c           |  24 ++++++--
>  migration/migration.c     |  38 ++++++++-----
>  net/tap-win32.c           |   5 ++
>  net/tap.c                 | 141 +++++++++++++++++++++++++++++++++++-----------
>  12 files changed, 223 insertions(+), 55 deletions(-)

Hi Steve,

Patches 1-2 seem to potentially interact with your arm pending
interrupts fix. Do we want them together?
Re: [RFC V2 0/8] Live update: tap and vhost
Posted by Steven Sistare 3 months, 1 week ago
On 8/5/2025 9:54 AM, Fabiano Rosas wrote:
> Steve Sistare <steven.sistare@oracle.com> writes:
> 
>> Tap and vhost devices can be preserved during cpr-transfer using
>> traditional live migration methods, wherein the management layer
>> creates new interfaces for the target and fiddles with 'ip link'
>> to deactivate the old interface and activate the new.
>>
>> However, CPR can simply send the file descriptors to new QEMU,
>> with no special management actions required.  The user enables
>> this behavior by specifing '-netdev tap,cpr=on'.  The default
>> is cpr=off.
>>
>> Steve Sistare (8):
>>    migration: stop vm earlier for cpr
>>    migration: cpr setup notifier
>>    vhost: reset vhost devices for cpr
>>    cpr: delete all fds
>>    Revert "vhost-backend: remove vhost_kernel_reset_device()"
>>    tap: common return label
>>    tap: cpr support
>>    tap: postload fix for cpr
>>
>>   qapi/net.json             |   5 +-
>>   include/hw/virtio/vhost.h |   1 +
>>   include/migration/cpr.h   |   3 +-
>>   include/net/tap.h         |   1 +
>>   hw/net/virtio-net.c       |  20 +++++++
>>   hw/vfio/device.c          |   2 +-
>>   hw/virtio/vhost-backend.c |   6 ++
>>   hw/virtio/vhost.c         |  32 +++++++++++
>>   migration/cpr.c           |  24 ++++++--
>>   migration/migration.c     |  38 ++++++++-----
>>   net/tap-win32.c           |   5 ++
>>   net/tap.c                 | 141 +++++++++++++++++++++++++++++++++++-----------
>>   12 files changed, 223 insertions(+), 55 deletions(-)
> 
> Hi Steve,
> 
> Patches 1-2 seem to potentially interact with your arm pending
> interrupts fix. Do we want them together?

Good observation, thanks!.  I may need patches 1-2 to completely close
the dropped interrupt race.  I will do more testing to verify that.

- Steve
Re: [RFC V2 0/8] Live update: tap and vhost
Posted by Steven Sistare 3 months ago
On 8/5/2025 3:53 PM, Steven Sistare wrote:
> On 8/5/2025 9:54 AM, Fabiano Rosas wrote:
>> Steve Sistare <steven.sistare@oracle.com> writes:
>>
>>> Tap and vhost devices can be preserved during cpr-transfer using
>>> traditional live migration methods, wherein the management layer
>>> creates new interfaces for the target and fiddles with 'ip link'
>>> to deactivate the old interface and activate the new.
>>>
>>> However, CPR can simply send the file descriptors to new QEMU,
>>> with no special management actions required.  The user enables
>>> this behavior by specifing '-netdev tap,cpr=on'.  The default
>>> is cpr=off.
>>>
>>> Steve Sistare (8):
>>>    migration: stop vm earlier for cpr
>>>    migration: cpr setup notifier
>>>    vhost: reset vhost devices for cpr
>>>    cpr: delete all fds
>>>    Revert "vhost-backend: remove vhost_kernel_reset_device()"
>>>    tap: common return label
>>>    tap: cpr support
>>>    tap: postload fix for cpr
>>>
>>>   qapi/net.json             |   5 +-
>>>   include/hw/virtio/vhost.h |   1 +
>>>   include/migration/cpr.h   |   3 +-
>>>   include/net/tap.h         |   1 +
>>>   hw/net/virtio-net.c       |  20 +++++++
>>>   hw/vfio/device.c          |   2 +-
>>>   hw/virtio/vhost-backend.c |   6 ++
>>>   hw/virtio/vhost.c         |  32 +++++++++++
>>>   migration/cpr.c           |  24 ++++++--
>>>   migration/migration.c     |  38 ++++++++-----
>>>   net/tap-win32.c           |   5 ++
>>>   net/tap.c                 | 141 +++++++++++++++++++++++++++++++++++-----------
>>>   12 files changed, 223 insertions(+), 55 deletions(-)
>>
>> Hi Steve,
>>
>> Patches 1-2 seem to potentially interact with your arm pending
>> interrupts fix. Do we want them together?
> 
> Good observation, thanks!.  I may need patches 1-2 to completely close
> the dropped interrupt race.  I will do more testing to verify that.

aarch64 does not need patches 1-2 to fix interrupts.
It only relies on MIG_EVENT_PRECOPY_DONE, whose relative ordering is not
affected by patches 1-2.

So, the arm patch gicv3 V3 can be pulled.

I am still looking at patches 1-2 wrt x86 interrupts.

- Steve

Re: [RFC V2 0/8] Live update: tap and vhost
Posted by Peter Xu 3 months, 1 week ago
On Tue, Aug 05, 2025 at 03:53:09PM -0400, Steven Sistare wrote:
> On 8/5/2025 9:54 AM, Fabiano Rosas wrote:
> > Steve Sistare <steven.sistare@oracle.com> writes:
> > 
> > > Tap and vhost devices can be preserved during cpr-transfer using
> > > traditional live migration methods, wherein the management layer
> > > creates new interfaces for the target and fiddles with 'ip link'
> > > to deactivate the old interface and activate the new.
> > > 
> > > However, CPR can simply send the file descriptors to new QEMU,
> > > with no special management actions required.  The user enables
> > > this behavior by specifing '-netdev tap,cpr=on'.  The default
> > > is cpr=off.
> > > 
> > > Steve Sistare (8):
> > >    migration: stop vm earlier for cpr
> > >    migration: cpr setup notifier
> > >    vhost: reset vhost devices for cpr
> > >    cpr: delete all fds
> > >    Revert "vhost-backend: remove vhost_kernel_reset_device()"
> > >    tap: common return label
> > >    tap: cpr support
> > >    tap: postload fix for cpr
> > > 
> > >   qapi/net.json             |   5 +-
> > >   include/hw/virtio/vhost.h |   1 +
> > >   include/migration/cpr.h   |   3 +-
> > >   include/net/tap.h         |   1 +
> > >   hw/net/virtio-net.c       |  20 +++++++
> > >   hw/vfio/device.c          |   2 +-
> > >   hw/virtio/vhost-backend.c |   6 ++
> > >   hw/virtio/vhost.c         |  32 +++++++++++
> > >   migration/cpr.c           |  24 ++++++--
> > >   migration/migration.c     |  38 ++++++++-----
> > >   net/tap-win32.c           |   5 ++
> > >   net/tap.c                 | 141 +++++++++++++++++++++++++++++++++++-----------
> > >   12 files changed, 223 insertions(+), 55 deletions(-)
> > 
> > Hi Steve,
> > 
> > Patches 1-2 seem to potentially interact with your arm pending
> > interrupts fix. Do we want them together?
> 
> Good observation, thanks!.  I may need patches 1-2 to completely close
> the dropped interrupt race.  I will do more testing to verify that.

Sorry to respond late.. Could I request (for each of the patches 1 & 2)
in-code comments explaining the order of events?

For example, patch 1 moved stop_vm even earlier for CPR.  It used to be
early because it wants to avoid dirty tracking: this is something I didn't
realize but remembered after re-read the doc..  Now it further needs to
avoid the notifiers.  A comment above stop_vm for cpr explaining all these
order of events would be really helpful (including any necessary doc update).

-- 
Peter Xu
Re: [RFC V2 0/8] Live update: tap and vhost
Posted by Lei Yang 3 months, 4 weeks ago
Hi Steve

I tested your patch which hit a problem under enable/disable nic mq
state(The full test scenario is live migration vm at local under
enable/disable vm nic mq state):
Run command: /qemu-img info /home/images/vm1.qcow2 --output=json
Error info: qemu-img: Could not open '/home/images/vm1.qcow2': Failed
to get shared "write" lock

In the same environment if I remove the enable/disable nic mq steps,
it can work well.

Note: This issue has not been reproduced 100%, but in order to confirm
this is really a bug, I test the following:
1. Live migration vm at local under enable/disable nic mq state: 3/5
reproduce ratio
2. Only live migration 5 times are all passed.

Thanks
Lei

On Fri, Jul 18, 2025 at 4:48 AM Steve Sistare <steven.sistare@oracle.com> wrote:
>
> Tap and vhost devices can be preserved during cpr-transfer using
> traditional live migration methods, wherein the management layer
> creates new interfaces for the target and fiddles with 'ip link'
> to deactivate the old interface and activate the new.
>
> However, CPR can simply send the file descriptors to new QEMU,
> with no special management actions required.  The user enables
> this behavior by specifing '-netdev tap,cpr=on'.  The default
> is cpr=off.
>
> Steve Sistare (8):
>   migration: stop vm earlier for cpr
>   migration: cpr setup notifier
>   vhost: reset vhost devices for cpr
>   cpr: delete all fds
>   Revert "vhost-backend: remove vhost_kernel_reset_device()"
>   tap: common return label
>   tap: cpr support
>   tap: postload fix for cpr
>
>  qapi/net.json             |   5 +-
>  include/hw/virtio/vhost.h |   1 +
>  include/migration/cpr.h   |   3 +-
>  include/net/tap.h         |   1 +
>  hw/net/virtio-net.c       |  20 +++++++
>  hw/vfio/device.c          |   2 +-
>  hw/virtio/vhost-backend.c |   6 ++
>  hw/virtio/vhost.c         |  32 +++++++++++
>  migration/cpr.c           |  24 ++++++--
>  migration/migration.c     |  38 ++++++++-----
>  net/tap-win32.c           |   5 ++
>  net/tap.c                 | 141 +++++++++++++++++++++++++++++++++++-----------
>  12 files changed, 223 insertions(+), 55 deletions(-)
>
> --
> 1.8.3.1
>
>
Re: [RFC V2 0/8] Live update: tap and vhost
Posted by Lei Yang 3 months, 3 weeks ago
On Fri, Jul 18, 2025 at 4:48 PM Lei Yang <leiyang@redhat.com> wrote:
>
> Hi Steve
>
> I tested your patch which hit a problem under enable/disable nic mq
> state(The full test scenario is live migration vm at local under
> enable/disable vm nic mq state):
> Run command: /qemu-img info /home/images/vm1.qcow2 --output=json
> Error info: qemu-img: Could not open '/home/images/vm1.qcow2': Failed
> to get shared "write" lock
>
> In the same environment if I remove the enable/disable nic mq steps,
> it can work well.
>
> Note: This issue has not been reproduced 100%, but in order to confirm
> this is really a bug, I test the following:
> 1. Live migration vm at local under enable/disable nic mq state: 3/5
> reproduce ratio
> 2. Only live migration 5 times are all passed.
>
After double confirm tests, the above mentioned problem should not be
related to the current series of patches, because I also hit this
problem after multiple tests using the upstream master branch.

Tested-by: Lei Yang <leiyang@redhat.com>
> Thanks
> Lei
>
> On Fri, Jul 18, 2025 at 4:48 AM Steve Sistare <steven.sistare@oracle.com> wrote:
> >
> > Tap and vhost devices can be preserved during cpr-transfer using
> > traditional live migration methods, wherein the management layer
> > creates new interfaces for the target and fiddles with 'ip link'
> > to deactivate the old interface and activate the new.
> >
> > However, CPR can simply send the file descriptors to new QEMU,
> > with no special management actions required.  The user enables
> > this behavior by specifing '-netdev tap,cpr=on'.  The default
> > is cpr=off.
> >
> > Steve Sistare (8):
> >   migration: stop vm earlier for cpr
> >   migration: cpr setup notifier
> >   vhost: reset vhost devices for cpr
> >   cpr: delete all fds
> >   Revert "vhost-backend: remove vhost_kernel_reset_device()"
> >   tap: common return label
> >   tap: cpr support
> >   tap: postload fix for cpr
> >
> >  qapi/net.json             |   5 +-
> >  include/hw/virtio/vhost.h |   1 +
> >  include/migration/cpr.h   |   3 +-
> >  include/net/tap.h         |   1 +
> >  hw/net/virtio-net.c       |  20 +++++++
> >  hw/vfio/device.c          |   2 +-
> >  hw/virtio/vhost-backend.c |   6 ++
> >  hw/virtio/vhost.c         |  32 +++++++++++
> >  migration/cpr.c           |  24 ++++++--
> >  migration/migration.c     |  38 ++++++++-----
> >  net/tap-win32.c           |   5 ++
> >  net/tap.c                 | 141 +++++++++++++++++++++++++++++++++++-----------
> >  12 files changed, 223 insertions(+), 55 deletions(-)
> >
> > --
> > 1.8.3.1
> >
> >
Re: [RFC V2 0/8] Live update: tap and vhost
Posted by Steven Sistare 3 months, 4 weeks ago
On 7/18/2025 4:48 AM, Lei Yang wrote:
> Hi Steve
> 
> I tested your patch which hit a problem under enable/disable nic mq
> state(The full test scenario is live migration vm at local under
> enable/disable vm nic mq state):
> Run command: /qemu-img info /home/images/vm1.qcow2 --output=json
> Error info: qemu-img: Could not open '/home/images/vm1.qcow2': Failed
> to get shared "write" lock
> 
> In the same environment if I remove the enable/disable nic mq steps,
> it can work well.
> 
> Note: This issue has not been reproduced 100%, but in order to confirm
> this is really a bug, I test the following:
> 1. Live migration vm at local under enable/disable nic mq state: 3/5
> reproduce ratio
> 2. Only live migration 5 times are all passed.

Thank you Lei for testing this.  I would like to debug it, but I need
more information:

* did you add the cpr=on option for tap devices, or is this just
   a regression test of existing functionality?

* What platform: linux or windows?

* Is this a publicly available test that I can run?  I cannot find any
   reference to "Live migration vm at local under enable/disable nic mq state"

If this is not a public test, can you provide more detail on the commands
issued at each step in the test?

Thanks!

- Steve

> On Fri, Jul 18, 2025 at 4:48 AM Steve Sistare <steven.sistare@oracle.com> wrote:
>>
>> Tap and vhost devices can be preserved during cpr-transfer using
>> traditional live migration methods, wherein the management layer
>> creates new interfaces for the target and fiddles with 'ip link'
>> to deactivate the old interface and activate the new.
>>
>> However, CPR can simply send the file descriptors to new QEMU,
>> with no special management actions required.  The user enables
>> this behavior by specifing '-netdev tap,cpr=on'.  The default
>> is cpr=off.
>>
>> Steve Sistare (8):
>>    migration: stop vm earlier for cpr
>>    migration: cpr setup notifier
>>    vhost: reset vhost devices for cpr
>>    cpr: delete all fds
>>    Revert "vhost-backend: remove vhost_kernel_reset_device()"
>>    tap: common return label
>>    tap: cpr support
>>    tap: postload fix for cpr
>>
>>   qapi/net.json             |   5 +-
>>   include/hw/virtio/vhost.h |   1 +
>>   include/migration/cpr.h   |   3 +-
>>   include/net/tap.h         |   1 +
>>   hw/net/virtio-net.c       |  20 +++++++
>>   hw/vfio/device.c          |   2 +-
>>   hw/virtio/vhost-backend.c |   6 ++
>>   hw/virtio/vhost.c         |  32 +++++++++++
>>   migration/cpr.c           |  24 ++++++--
>>   migration/migration.c     |  38 ++++++++-----
>>   net/tap-win32.c           |   5 ++
>>   net/tap.c                 | 141 +++++++++++++++++++++++++++++++++++-----------
>>   12 files changed, 223 insertions(+), 55 deletions(-)
>>
>> --
>> 1.8.3.1
>>
>>
>