[PATCH 00/33] vhost-user-blk: live-backend local migration

Vladimir Sementsov-Ogievskiy posted 33 patches 3 months ago
Failed in applying to current master (apply log)
Maintainers: "Michael S. Tsirkin" <mst@redhat.com>, Stefano Garzarella <sgarzare@redhat.com>, "Gonglei (Arei)" <arei.gonglei@huawei.com>, Zhenwei Pi <pizhenwei@bytedance.com>, "Marc-André Lureau" <marcandre.lureau@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>, Kevin Wolf <kwolf@redhat.com>, Hanna Reitz <hreitz@redhat.com>, Raphael Norwitz <raphael@enfabrica.net>, Jason Wang <jasowang@redhat.com>, Fam Zheng <fam@euphon.net>, "Alex Bennée" <alex.bennee@linaro.org>, "Daniel P. Berrangé" <berrange@redhat.com>, Peter Xu <peterx@redhat.com>, Fabiano Rosas <farosas@suse.de>, Eric Blake <eblake@redhat.com>, Markus Armbruster <armbru@redhat.com>, Thomas Huth <thuth@redhat.com>, "Philippe Mathieu-Daudé" <philmd@linaro.org>, Laurent Vivier <lvivier@redhat.com>
There is a newer version of this series
backends/cryptodev-vhost.c                    |   1 -
chardev/char-socket.c                         | 101 +++-
hw/block/trace-events                         |  10 +
hw/block/vhost-user-blk.c                     | 201 ++++++--
hw/display/vhost-user-gpu.c                   |  11 +-
hw/net/vhost_net.c                            |  27 +-
hw/scsi/vhost-scsi.c                          |   1 -
hw/scsi/vhost-user-scsi.c                     |   1 -
hw/virtio/trace-events                        |  12 +-
hw/virtio/vdpa-dev.c                          |   3 +-
hw/virtio/vhost-user-base.c                   |   8 +-
hw/virtio/vhost-user.c                        | 326 +++++++++---
hw/virtio/vhost.c                             | 474 ++++++++++++------
hw/virtio/virtio-bus.c                        |  20 +-
hw/virtio/virtio-hmp-cmds.c                   |   2 -
hw/virtio/virtio-mmio.c                       |  41 +-
hw/virtio/virtio-pci.c                        |  34 +-
hw/virtio/virtio-qmp.c                        |  10 +-
hw/virtio/virtio.c                            | 120 ++++-
include/chardev/char-socket.h                 |   3 +
include/hw/virtio/vhost-backend.h             |  10 +
include/hw/virtio/vhost-user-blk.h            |   2 +
include/hw/virtio/vhost.h                     |  42 +-
include/hw/virtio/virtio-pci.h                |   3 -
include/hw/virtio/virtio.h                    |  11 +-
include/io/channel-socket.h                   |   3 +
io/channel-socket.c                           |  16 +-
migration/options.c                           |  14 +
migration/options.h                           |   2 +
migration/socket.c                            |   1 +
net/vhost-vdpa.c                              |   7 +-
qapi/char.json                                |  16 +-
qapi/migration.json                           |  19 +-
qapi/virtio.json                              |   3 -
stubs/meson.build                             |   1 +
stubs/qemu_file.c                             |  15 +
stubs/vmstate.c                               |   6 +
tests/functional/qemu_test/cmd.py             |   7 +-
...test_x86_64_vhost_user_blk_fd_migration.py | 279 +++++++++++
tests/qtest/meson.build                       |   2 +-
tests/unit/meson.build                        |   4 +-
41 files changed, 1420 insertions(+), 449 deletions(-)
create mode 100644 stubs/qemu_file.c
create mode 100644 tests/functional/test_x86_64_vhost_user_blk_fd_migration.py
[PATCH 00/33] vhost-user-blk: live-backend local migration
Posted by Vladimir Sementsov-Ogievskiy 3 months ago
Hi all!

Local migration of vhost-user-blk requires non-trivial actions
from management layer, it should provide a new connection for new
QEMU process and handle disk operation movement from one connection
to another.

Such switching, including reinitialization of vhost-user connection,
draining disk requests, etc, adds significant value to local migration
downtime.

This all leads to an idea: why not to just pass all we need from
old QEMU process to the new one (including open file descriptors),
and don't touch the backend at all? This way, the vhost user backend
server will not even know, that QEMU process is changed, as live
vhost-user connection is migrated.

So this series realize the idea. No requests are done to backend
during migration, instead all backend-related state and all related
file descriptors (vhost-user connection, guest/host notifiers,
inflight region) are passed to new process. Of course, migration
should go through unix socket.

The most of the series are refactoring patches. The core feature is
spread between 24, 28-31 patches.

Why not CPR-transfer?

1. In the new mode of local migration we need to pass not only
file descriptors, but additional parts of backend-related state,
which we don't want (or even can't) reinitialize in target process.
And it's a lot simpler to add new fields to common migration stream.
And why not to pass fds in the same stream?

2. No benefit of vhost-user connection fd passed to target in early
stage before device creation: we can't use it together with source
QEMU process anyway. So, we need a moment, when source qemu stops using
the fd, and target start doing it. And native place for this moment is
usual save/load of the device in migration process. And yes, we have to
deeply update initialization/starting of the device to not reinitialize
the backend, but just continue to work with it in a new QEMU process.

3. So, if we can't actually use fd, passed early before device creation,
no reason to care about:
- non-working QMP connection on target until "migrate" command on source
- additional migration channel
- implementing code to pass additional non-fd fields together with fds in CPR

However, the series doesn't conflict with CPR-transfer, as it's actually
a usual migration with some additional capabilities. The only
requirement is that main migration channel should be a unix socket.

Vladimir Sementsov-Ogievskiy (33):
  vhost: introduce vhost_ops->vhost_set_vring_enable_supported method
  vhost: drop backend_features field
  vhost-user: introduce vhost_user_has_prot() helper
  vhost: move protocol_features to vhost_user
  vhost-user-gpu: drop code duplication
  vhost: make vhost_dev.features private
  virtio: move common part of _set_guest_notifier to generic code
  virtio: drop *_set_guest_notifier_fd_handler() helpers
  vhost-user: keep QIOChannelSocket for backend channel
  vhost: vhost_virtqueue_start(): fix failure path
  vhost: make vhost_memory_unmap() null-safe
  vhost: simplify calls to vhost_memory_unmap()
  vhost: move vrings mapping to the top of vhost_virtqueue_start()
  vhost: vhost_virtqueue_start(): drop extra local variables
  vhost: final refactoring of vhost vrings map/unmap
  vhost: simplify vhost_dev_init() error-path
  vhost: move busyloop timeout initialization to vhost_virtqueue_init()
  vhost: introduce check_memslots() helper
  vhost: vhost_dev_init(): drop extra features variable
  hw/virtio/virtio-bus: refactor virtio_bus_set_host_notifier()
  vhost-user: make trace events more readable
  vhost-user-blk: add some useful trace-points
  vhost: add some useful trace-points
  chardev-add: support local migration
  virtio: introduce .skip_vhost_migration_log() handler
  io/channel-socket: introduce qio_channel_socket_keep_nonblock()
  migration/socket: keep fds non-block
  vhost: introduce backend migration
  vhost-user: support backend migration
  virtio: support vhost backend migration
  vhost-user-blk: support vhost backend migration
  test/functional: exec_command_and_wait_for_pattern: add vm arg
  tests/functional: add test_x86_64_vhost_user_blk_fd_migration.py

 backends/cryptodev-vhost.c                    |   1 -
 chardev/char-socket.c                         | 101 +++-
 hw/block/trace-events                         |  10 +
 hw/block/vhost-user-blk.c                     | 201 ++++++--
 hw/display/vhost-user-gpu.c                   |  11 +-
 hw/net/vhost_net.c                            |  27 +-
 hw/scsi/vhost-scsi.c                          |   1 -
 hw/scsi/vhost-user-scsi.c                     |   1 -
 hw/virtio/trace-events                        |  12 +-
 hw/virtio/vdpa-dev.c                          |   3 +-
 hw/virtio/vhost-user-base.c                   |   8 +-
 hw/virtio/vhost-user.c                        | 326 +++++++++---
 hw/virtio/vhost.c                             | 474 ++++++++++++------
 hw/virtio/virtio-bus.c                        |  20 +-
 hw/virtio/virtio-hmp-cmds.c                   |   2 -
 hw/virtio/virtio-mmio.c                       |  41 +-
 hw/virtio/virtio-pci.c                        |  34 +-
 hw/virtio/virtio-qmp.c                        |  10 +-
 hw/virtio/virtio.c                            | 120 ++++-
 include/chardev/char-socket.h                 |   3 +
 include/hw/virtio/vhost-backend.h             |  10 +
 include/hw/virtio/vhost-user-blk.h            |   2 +
 include/hw/virtio/vhost.h                     |  42 +-
 include/hw/virtio/virtio-pci.h                |   3 -
 include/hw/virtio/virtio.h                    |  11 +-
 include/io/channel-socket.h                   |   3 +
 io/channel-socket.c                           |  16 +-
 migration/options.c                           |  14 +
 migration/options.h                           |   2 +
 migration/socket.c                            |   1 +
 net/vhost-vdpa.c                              |   7 +-
 qapi/char.json                                |  16 +-
 qapi/migration.json                           |  19 +-
 qapi/virtio.json                              |   3 -
 stubs/meson.build                             |   1 +
 stubs/qemu_file.c                             |  15 +
 stubs/vmstate.c                               |   6 +
 tests/functional/qemu_test/cmd.py             |   7 +-
 ...test_x86_64_vhost_user_blk_fd_migration.py | 279 +++++++++++
 tests/qtest/meson.build                       |   2 +-
 tests/unit/meson.build                        |   4 +-
 41 files changed, 1420 insertions(+), 449 deletions(-)
 create mode 100644 stubs/qemu_file.c
 create mode 100644 tests/functional/test_x86_64_vhost_user_blk_fd_migration.py

-- 
2.48.1
Re: [PATCH 00/33] vhost-user-blk: live-backend local migration
Posted by Raphael Norwitz 1 month ago
My apologies for the late review here. I appreciate the need to work
around these issues but I do feel the approach complicates Qemu
significantly and it may be possible to achieve similar results
managing state inside the backend. More comments inline.

I like a lot of the cleanups here - maybe consider breaking out a
series with some of the cleanups?

On Wed, Aug 13, 2025 at 12:56 PM Vladimir Sementsov-Ogievskiy
<vsementsov@yandex-team.ru> wrote:
>
> Hi all!
>
> Local migration of vhost-user-blk requires non-trivial actions
> from management layer, it should provide a new connection for new
> QEMU process and handle disk operation movement from one connection
> to another.
>
> Such switching, including reinitialization of vhost-user connection,
> draining disk requests, etc, adds significant value to local migration
> downtime.

I see how draining IO requests adds downtime and is impactful. That
said, we need to start-stop the device anyways so I'm not convinced
that setting up mappings and sending messages back and forth are
impactful enough to warrant adding a whole new migration mode. Am I
missing anything here?

>
> This all leads to an idea: why not to just pass all we need from
> old QEMU process to the new one (including open file descriptors),
> and don't touch the backend at all? This way, the vhost user backend
> server will not even know, that QEMU process is changed, as live
> vhost-user connection is migrated.

Alternatively, if it really is about avoiding IO draining, what if
Qemu advertised a new vhost-user protocol feature which would query
whether the backend already has state for the device? Then, if the
backend indicates that it does, Qemu and the backend can take a
different path in vhost-user, exchanging relevant information,
including the descriptor indexes for the VQs such that draining can be
avoided. I expect that could be implemented to cut down a lot of the
other vhost-user overhead anyways (i.e. you could skip setting the
memory table). If nothing else it would probably help other device
types take advantage of this without adding more options to Qemu.

Thoughts?

>
> So this series realize the idea. No requests are done to backend
> during migration, instead all backend-related state and all related
> file descriptors (vhost-user connection, guest/host notifiers,
> inflight region) are passed to new process. Of course, migration
> should go through unix socket.
>
> The most of the series are refactoring patches. The core feature is
> spread between 24, 28-31 patches.
>
> Why not CPR-transfer?
>
> 1. In the new mode of local migration we need to pass not only
> file descriptors, but additional parts of backend-related state,
> which we don't want (or even can't) reinitialize in target process.
> And it's a lot simpler to add new fields to common migration stream.
> And why not to pass fds in the same stream?
>
> 2. No benefit of vhost-user connection fd passed to target in early
> stage before device creation: we can't use it together with source
> QEMU process anyway. So, we need a moment, when source qemu stops using
> the fd, and target start doing it. And native place for this moment is
> usual save/load of the device in migration process. And yes, we have to
> deeply update initialization/starting of the device to not reinitialize
> the backend, but just continue to work with it in a new QEMU process.
>
> 3. So, if we can't actually use fd, passed early before device creation,
> no reason to care about:
> - non-working QMP connection on target until "migrate" command on source
> - additional migration channel
> - implementing code to pass additional non-fd fields together with fds in CPR
>
> However, the series doesn't conflict with CPR-transfer, as it's actually
> a usual migration with some additional capabilities. The only
> requirement is that main migration channel should be a unix socket.
>
> Vladimir Sementsov-Ogievskiy (33):
>   vhost: introduce vhost_ops->vhost_set_vring_enable_supported method
>   vhost: drop backend_features field
>   vhost-user: introduce vhost_user_has_prot() helper
>   vhost: move protocol_features to vhost_user
>   vhost-user-gpu: drop code duplication
>   vhost: make vhost_dev.features private
>   virtio: move common part of _set_guest_notifier to generic code
>   virtio: drop *_set_guest_notifier_fd_handler() helpers
>   vhost-user: keep QIOChannelSocket for backend channel
>   vhost: vhost_virtqueue_start(): fix failure path
>   vhost: make vhost_memory_unmap() null-safe
>   vhost: simplify calls to vhost_memory_unmap()
>   vhost: move vrings mapping to the top of vhost_virtqueue_start()
>   vhost: vhost_virtqueue_start(): drop extra local variables
>   vhost: final refactoring of vhost vrings map/unmap
>   vhost: simplify vhost_dev_init() error-path
>   vhost: move busyloop timeout initialization to vhost_virtqueue_init()
>   vhost: introduce check_memslots() helper
>   vhost: vhost_dev_init(): drop extra features variable
>   hw/virtio/virtio-bus: refactor virtio_bus_set_host_notifier()
>   vhost-user: make trace events more readable
>   vhost-user-blk: add some useful trace-points
>   vhost: add some useful trace-points
>   chardev-add: support local migration
>   virtio: introduce .skip_vhost_migration_log() handler
>   io/channel-socket: introduce qio_channel_socket_keep_nonblock()
>   migration/socket: keep fds non-block
>   vhost: introduce backend migration
>   vhost-user: support backend migration
>   virtio: support vhost backend migration
>   vhost-user-blk: support vhost backend migration
>   test/functional: exec_command_and_wait_for_pattern: add vm arg
>   tests/functional: add test_x86_64_vhost_user_blk_fd_migration.py
>
>  backends/cryptodev-vhost.c                    |   1 -
>  chardev/char-socket.c                         | 101 +++-
>  hw/block/trace-events                         |  10 +
>  hw/block/vhost-user-blk.c                     | 201 ++++++--
>  hw/display/vhost-user-gpu.c                   |  11 +-
>  hw/net/vhost_net.c                            |  27 +-
>  hw/scsi/vhost-scsi.c                          |   1 -
>  hw/scsi/vhost-user-scsi.c                     |   1 -
>  hw/virtio/trace-events                        |  12 +-
>  hw/virtio/vdpa-dev.c                          |   3 +-
>  hw/virtio/vhost-user-base.c                   |   8 +-
>  hw/virtio/vhost-user.c                        | 326 +++++++++---
>  hw/virtio/vhost.c                             | 474 ++++++++++++------
>  hw/virtio/virtio-bus.c                        |  20 +-
>  hw/virtio/virtio-hmp-cmds.c                   |   2 -
>  hw/virtio/virtio-mmio.c                       |  41 +-
>  hw/virtio/virtio-pci.c                        |  34 +-
>  hw/virtio/virtio-qmp.c                        |  10 +-
>  hw/virtio/virtio.c                            | 120 ++++-
>  include/chardev/char-socket.h                 |   3 +
>  include/hw/virtio/vhost-backend.h             |  10 +
>  include/hw/virtio/vhost-user-blk.h            |   2 +
>  include/hw/virtio/vhost.h                     |  42 +-
>  include/hw/virtio/virtio-pci.h                |   3 -
>  include/hw/virtio/virtio.h                    |  11 +-
>  include/io/channel-socket.h                   |   3 +
>  io/channel-socket.c                           |  16 +-
>  migration/options.c                           |  14 +
>  migration/options.h                           |   2 +
>  migration/socket.c                            |   1 +
>  net/vhost-vdpa.c                              |   7 +-
>  qapi/char.json                                |  16 +-
>  qapi/migration.json                           |  19 +-
>  qapi/virtio.json                              |   3 -
>  stubs/meson.build                             |   1 +
>  stubs/qemu_file.c                             |  15 +
>  stubs/vmstate.c                               |   6 +
>  tests/functional/qemu_test/cmd.py             |   7 +-
>  ...test_x86_64_vhost_user_blk_fd_migration.py | 279 +++++++++++
>  tests/qtest/meson.build                       |   2 +-
>  tests/unit/meson.build                        |   4 +-
>  41 files changed, 1420 insertions(+), 449 deletions(-)
>  create mode 100644 stubs/qemu_file.c
>  create mode 100644 tests/functional/test_x86_64_vhost_user_blk_fd_migration.py
>
> --
> 2.48.1
>
>
Re: [PATCH 00/33] vhost-user-blk: live-backend local migration
Posted by Vladimir Sementsov-Ogievskiy 1 month ago
On 09.10.25 22:16, Raphael Norwitz wrote:
> My apologies for the late review here. I appreciate the need to work
> around these issues but I do feel the approach complicates Qemu
> significantly and it may be possible to achieve similar results
> managing state inside the backend. More comments inline.
> 
> I like a lot of the cleanups here - maybe consider breaking out a
> series with some of the cleanups?

Of course, I thought about that too.

> 
> On Wed, Aug 13, 2025 at 12:56 PM Vladimir Sementsov-Ogievskiy
> <vsementsov@yandex-team.ru> wrote:
>>
>> Hi all!
>>
>> Local migration of vhost-user-blk requires non-trivial actions
>> from management layer, it should provide a new connection for new
>> QEMU process and handle disk operation movement from one connection
>> to another.
>>
>> Such switching, including reinitialization of vhost-user connection,
>> draining disk requests, etc, adds significant value to local migration
>> downtime.
> 
> I see how draining IO requests adds downtime and is impactful. That
> said, we need to start-stop the device anyways

No, with this series and new feature enabled we don't have this drain,
see

     if (dev->backend_transfer) {
         return 0;
     }

at start of do_vhost_virtqueue_stop().

> so I'm not convinced
> that setting up mappings and sending messages back and forth are
> impactful enough to warrant adding a whole new migration mode. Am I
> missing anything here?

In management layer we have to manage two end-points for remote
disk, and accompany a safe switch from one to another. That's
complicated and often long procedure, which contributes an
average delay of 0.6 seconds, and (which is worse) ~2.4 seconds
in p99.

Of course, you may say "just rewrite your management layer to
work better":) But that's not simple, and we came to idea, that
we can do the whole local migration at QEMU side, not touching
backend at all.

The main benefit: fewer participants. We don't rely on management layer
and vhost-user server to do proper things for migration. Backend even
don't know, that QEMU is updated. This makes the whole process
simpler and therefore safer.

The disk service may also be temporarily down at some time, which of course has
a bad effect on live migration and its freeze-time. We avoid this
issue with my series (as we don't communicate to the backend in
any way during migration, and disk service should not manage any
endpoints switching)

Note also, that my series is not a precedent in QEMU, and not a totally new
mode.

Steve Sistare works on the idea to pass backends through UNIX socket, and it
is now merged as cpr-transfer and cpr-exec migration modes, and supports
VFIO devices.

So, my work shares this existing concept on vhost-user-blk and virtio-net,
and may be used as part of cpr-transfer / cpr-exec, or in separate.

> 
>>
>> This all leads to an idea: why not to just pass all we need from
>> old QEMU process to the new one (including open file descriptors),
>> and don't touch the backend at all? This way, the vhost user backend
>> server will not even know, that QEMU process is changed, as live
>> vhost-user connection is migrated.
> 
> Alternatively, if it really is about avoiding IO draining, what if
> Qemu advertised a new vhost-user protocol feature which would query
> whether the backend already has state for the device? Then, if the
> backend indicates that it does, Qemu and the backend can take a
> different path in vhost-user, exchanging relevant information,
> including the descriptor indexes for the VQs such that draining can be
> avoided. I expect that could be implemented to cut down a lot of the
> other vhost-user overhead anyways (i.e. you could skip setting the
> memory table). If nothing else it would probably help other device
> types take advantage of this without adding more options to Qemu.
> 

Hmm, if say only about draining, as I understand, the only thing we need
is support migrating of "inflight region". This done in the series,
and we are also preparing a separate feature to support migrating
inflight region for remote migration.

But, for local migration we want more: remove disk service from
the process at all, to have a guaranteed small downtime for live updates.
independent of any problems which may occur on disk service side.

Why freeze-time is more sensitive for live-updates than for remote
migration? Because we have to run a lot of live-update operations:
simply update all the vms in the cloud to a new version. Remote
migration happens much less frequently: when we need to move all
vms from physical server to reboot it (or repair it, serve it, etc).

So, I still believe, that migrating backend states through QEMU migration
stream makes sense in general, and for vhost-user-blk it works well too.


-- 
Best regards,
Vladimir

Re: [PATCH 00/33] vhost-user-blk: live-backend local migration
Posted by Raphael Norwitz 1 month ago
Thanks for the detailed response here, it does clear up the intent.

I agree it's much better to keep the management layer from having to
make API calls back and forth to the backend so that the migration
looks like a reconnect from the backend's perspective. I'm not totally
clear on the fundamental reason why the management layer would have to
call out to the backend, as opposed to having the vhost-user code in
the backend figure out that it's a local migration when the new
destination QEMU tries to connect and respond accordingly.

That said, I haven't followed the work here all that closely. If MST
or other maintainers have blessed this as the right way I'm ok with
it.

On Thu, Oct 9, 2025 at 6:43 PM Vladimir Sementsov-Ogievskiy
<vsementsov@yandex-team.ru> wrote:
>
> On 09.10.25 22:16, Raphael Norwitz wrote:
> > My apologies for the late review here. I appreciate the need to work
> > around these issues but I do feel the approach complicates Qemu
> > significantly and it may be possible to achieve similar results
> > managing state inside the backend. More comments inline.
> >
> > I like a lot of the cleanups here - maybe consider breaking out a
> > series with some of the cleanups?
>
> Of course, I thought about that too.
>
> >
> > On Wed, Aug 13, 2025 at 12:56 PM Vladimir Sementsov-Ogievskiy
> > <vsementsov@yandex-team.ru> wrote:
> >>
> >> Hi all!
> >>
> >> Local migration of vhost-user-blk requires non-trivial actions
> >> from management layer, it should provide a new connection for new
> >> QEMU process and handle disk operation movement from one connection
> >> to another.
> >>
> >> Such switching, including reinitialization of vhost-user connection,
> >> draining disk requests, etc, adds significant value to local migration
> >> downtime.
> >
> > I see how draining IO requests adds downtime and is impactful. That
> > said, we need to start-stop the device anyways
>
> No, with this series and new feature enabled we don't have this drain,
> see
>
>      if (dev->backend_transfer) {
>          return 0;
>      }
>
> at start of do_vhost_virtqueue_stop().
>
> > so I'm not convinced
> > that setting up mappings and sending messages back and forth are
> > impactful enough to warrant adding a whole new migration mode. Am I
> > missing anything here?
>
> In management layer we have to manage two end-points for remote
> disk, and accompany a safe switch from one to another. That's
> complicated and often long procedure, which contributes an
> average delay of 0.6 seconds, and (which is worse) ~2.4 seconds
> in p99.
>
> Of course, you may say "just rewrite your management layer to
> work better":) But that's not simple, and we came to idea, that
> we can do the whole local migration at QEMU side, not touching
> backend at all.
>
> The main benefit: fewer participants. We don't rely on management layer
> and vhost-user server to do proper things for migration. Backend even
> don't know, that QEMU is updated. This makes the whole process
> simpler and therefore safer.
>
> The disk service may also be temporarily down at some time, which of course has
> a bad effect on live migration and its freeze-time. We avoid this
> issue with my series (as we don't communicate to the backend in
> any way during migration, and disk service should not manage any
> endpoints switching)
>
> Note also, that my series is not a precedent in QEMU, and not a totally new
> mode.
>
> Steve Sistare works on the idea to pass backends through UNIX socket, and it
> is now merged as cpr-transfer and cpr-exec migration modes, and supports
> VFIO devices.
>
> So, my work shares this existing concept on vhost-user-blk and virtio-net,
> and may be used as part of cpr-transfer / cpr-exec, or in separate.
>
> >
> >>
> >> This all leads to an idea: why not to just pass all we need from
> >> old QEMU process to the new one (including open file descriptors),
> >> and don't touch the backend at all? This way, the vhost user backend
> >> server will not even know, that QEMU process is changed, as live
> >> vhost-user connection is migrated.
> >
> > Alternatively, if it really is about avoiding IO draining, what if
> > Qemu advertised a new vhost-user protocol feature which would query
> > whether the backend already has state for the device? Then, if the
> > backend indicates that it does, Qemu and the backend can take a
> > different path in vhost-user, exchanging relevant information,
> > including the descriptor indexes for the VQs such that draining can be
> > avoided. I expect that could be implemented to cut down a lot of the
> > other vhost-user overhead anyways (i.e. you could skip setting the
> > memory table). If nothing else it would probably help other device
> > types take advantage of this without adding more options to Qemu.
> >
>
> Hmm, if say only about draining, as I understand, the only thing we need
> is support migrating of "inflight region". This done in the series,
> and we are also preparing a separate feature to support migrating
> inflight region for remote migration.
>
> But, for local migration we want more: remove disk service from
> the process at all, to have a guaranteed small downtime for live updates.
> independent of any problems which may occur on disk service side.
>
> Why freeze-time is more sensitive for live-updates than for remote
> migration? Because we have to run a lot of live-update operations:
> simply update all the vms in the cloud to a new version. Remote
> migration happens much less frequently: when we need to move all
> vms from physical server to reboot it (or repair it, serve it, etc).
>
> So, I still believe, that migrating backend states through QEMU migration
> stream makes sense in general, and for vhost-user-blk it works well too.
>
>
> --
> Best regards,
> Vladimir
Re: [PATCH 00/33] vhost-user-blk: live-backend local migration
Posted by Vladimir Sementsov-Ogievskiy 1 month ago
On 10.10.25 02:28, Raphael Norwitz wrote:
> Thanks for the detailed response here, it does clear up the intent.
> 
> I agree it's much better to keep the management layer from having to
> make API calls back and forth to the backend so that the migration
> looks like a reconnect from the backend's perspective. I'm not totally
> clear on the fundamental reason why the management layer would have to
> call out to the backend, as opposed to having the vhost-user code in
> the backend figure out that it's a local migration when the new
> destination QEMU tries to connect and respond accordingly.
> 

Handling this in vhost-user-server without the management layer would
actually mean handling two connections in parallel. This doesn't seem
to fit well into the vhost-user protocol.

However, we already have this support (as we have live update for VMs
with vhost-user-blk) in the disk service by accepting a new connection
on an additional Unix socket servicing the same disk but in readonly
mode until the initial connection terminates. The problem isn't with
the separate socket itself, but with safely switching the disk backend
from one connection to another. We would have to perform this switch
regardless, even if we managed both connections within the context of a
single server or a single Unix socket. The only difference is that this
way, we might avoid communication from the management layer to the disk
service. Instead of saying, "Hey, disk service, we're going to migrate
this QEMU - prepare for an endpoint switch," we'd just proceed with the
migration, and the disk service would detect it when it sees a second
connection to the Unix socket.

But this extra communication isn't the real issue. The real challenge
is that we still have to switch between connections on the backend
side. And we have to account for the possible temporary unavailability
of the disk service (the migration freeze time would just include this
period of unavailability).

With this series, we're saying: "Hold on. We already have everything
working and set up—the backend is ready, the dataplane is out of QEMU,
and the control plane isn't doing anything. And we're migrating to the
same host. Why not just keep everything as is? Just pass the file
descriptors to the new QEMU process and continue execution."

This way, we make the QEMU live-update operation independent of the
disk service's lifecycle, which improves reliability. And we maintain
only one connection instead of two, making the model simpler.

This doesn't even account for the extra time spent reconfiguring the
connection. Setting up mappings isn't free and becomes more costly for
large VMs (with significant RAM), when using hugetlbfs, or when the
system is under memory pressure.


> That said, I haven't followed the work here all that closely. If MST
> or other maintainers have blessed this as the right way I'm ok with
> it.
> 



-- 
Best regards,
Vladimir

Re: [PATCH 00/33] vhost-user-blk: live-backend local migration
Posted by Raphael Norwitz 1 month ago
Thanks for the extensive follow up here. I was hoping there would be
some way to move more of the logic into all vhost-user generic code
both to help other backends support local migration more easily and
have fewer "if backend is doing a local migration" checks in
vhost-user-blk code. As a straw man design, I would think there could
be some way of having the backend coordinate a handoff by signaling
the source Qemu and then the source Qemu could stop the device and ACK
with a message before the destination Qemu is allowed to start the
device.

Anyways, it seems like other maintainers have blessed this approach so
I'll leave it at that.

On Fri, Oct 10, 2025 at 4:47 AM Vladimir Sementsov-Ogievskiy
<vsementsov@yandex-team.ru> wrote:
>
> On 10.10.25 02:28, Raphael Norwitz wrote:
> > Thanks for the detailed response here, it does clear up the intent.
> >
> > I agree it's much better to keep the management layer from having to
> > make API calls back and forth to the backend so that the migration
> > looks like a reconnect from the backend's perspective. I'm not totally
> > clear on the fundamental reason why the management layer would have to
> > call out to the backend, as opposed to having the vhost-user code in
> > the backend figure out that it's a local migration when the new
> > destination QEMU tries to connect and respond accordingly.
> >
>
> Handling this in vhost-user-server without the management layer would
> actually mean handling two connections in parallel. This doesn't seem
> to fit well into the vhost-user protocol.
>
> However, we already have this support (as we have live update for VMs
> with vhost-user-blk) in the disk service by accepting a new connection
> on an additional Unix socket servicing the same disk but in readonly
> mode until the initial connection terminates. The problem isn't with
> the separate socket itself, but with safely switching the disk backend
> from one connection to another. We would have to perform this switch
> regardless, even if we managed both connections within the context of a
> single server or a single Unix socket. The only difference is that this
> way, we might avoid communication from the management layer to the disk
> service. Instead of saying, "Hey, disk service, we're going to migrate
> this QEMU - prepare for an endpoint switch," we'd just proceed with the
> migration, and the disk service would detect it when it sees a second
> connection to the Unix socket.
>
> But this extra communication isn't the real issue. The real challenge
> is that we still have to switch between connections on the backend
> side. And we have to account for the possible temporary unavailability
> of the disk service (the migration freeze time would just include this
> period of unavailability).
>
> With this series, we're saying: "Hold on. We already have everything
> working and set up—the backend is ready, the dataplane is out of QEMU,
> and the control plane isn't doing anything. And we're migrating to the
> same host. Why not just keep everything as is? Just pass the file
> descriptors to the new QEMU process and continue execution."
>
> This way, we make the QEMU live-update operation independent of the
> disk service's lifecycle, which improves reliability. And we maintain
> only one connection instead of two, making the model simpler.
>
> This doesn't even account for the extra time spent reconfiguring the
> connection. Setting up mappings isn't free and becomes more costly for
> large VMs (with significant RAM), when using hugetlbfs, or when the
> system is under memory pressure.
>
>
> > That said, I haven't followed the work here all that closely. If MST
> > or other maintainers have blessed this as the right way I'm ok with
> > it.
> >
>
>
>
> --
> Best regards,
> Vladimir