There's currently no documentation for multifd, we can at least
provide an overview of the feature.
Signed-off-by: Fabiano Rosas <farosas@suse.de>
---
Keep in mind the feature grew organically over the years and it has
had bugs that required reinventing some concepts, specially on the
sync part, so there's still some amount of inconsistency in the code
and that's not going to be fixed by documentation.
---
docs/devel/migration/features.rst | 1 +
docs/devel/migration/multifd.rst | 254 ++++++++++++++++++++++++++++++
2 files changed, 255 insertions(+)
create mode 100644 docs/devel/migration/multifd.rst
diff --git a/docs/devel/migration/features.rst b/docs/devel/migration/features.rst
index 8f431d52f9..249d653124 100644
--- a/docs/devel/migration/features.rst
+++ b/docs/devel/migration/features.rst
@@ -15,3 +15,4 @@ Migration has plenty of features to support different use cases.
qpl-compression
uadk-compression
qatzip-compression
+ multifd
diff --git a/docs/devel/migration/multifd.rst b/docs/devel/migration/multifd.rst
new file mode 100644
index 0000000000..8f5ec840cb
--- /dev/null
+++ b/docs/devel/migration/multifd.rst
@@ -0,0 +1,254 @@
+Multifd
+=======
+
+Multifd is the name given for the migration capability that enables
+data transfer using multiple threads. Multifd supports all the
+transport types currently in use with migration (inet, unix, vsock,
+fd, file).
+
+Usage
+-----
+
+On both source and destination, enable the ``multifd`` capability:
+
+ ``migrate_set_capability multifd on``
+
+Define a number of channels to use (default is 2, but 8 usually
+provides best performance).
+
+ ``migrate_set_parameter multifd-channels 8``
+
+Restrictions
+------------
+
+For migration to a file, support is conditional on the presence of the
+mapped-ram capability, see `mapped-ram`.
+
+Snapshots are currently not supported.
+
+`postcopy` migration is currently not supported.
+
+Components
+----------
+
+Multifd consists of:
+
+- A client that produces the data on the migration source side and
+ consumes it on the destination. Currently the main client code is
+ ram.c, which selects the RAM pages for migration;
+
+- A shared data structure (``MultiFDSendData``), used to transfer data
+ between multifd and the client. On the source side, this structure
+ is further subdivided into payload types (``MultiFDPayload``);
+
+- An API operating on the shared data structure to allow the client
+ code to interact with multifd;
+
+ - ``multifd_send/recv()``: Transfers work to/from the channels.
+
+ - ``multifd_*payload_*`` and ``MultiFDPayloadType``: Support
+ defining an opaque payload. The payload is always wrapped by
+ ``MultiFD*Data``.
+
+ - ``multifd_send_data_*``: Used to manage the memory for the shared
+ data structure.
+
+ - ``multifd_*_sync_main()``: See :ref:`synchronization` below.
+
+- A set of threads (aka channels, due to a 1:1 mapping to QIOChannels)
+ responsible for doing I/O. Each multifd channel supports callbacks
+ (``MultiFDMethods``) that can be used for fine-grained processing of
+ the payload, such as compression and zero page detection.
+
+- A packet which is the final result of all the data aggregation
+ and/or transformation. The packet contains: a *header* with magic and
+ version numbers and flags that inform of special processing needed
+ on the destination; a *payload-specific header* with metadata referent
+ to the packet's data portion, e.g. page counts; and a variable-size
+ *data portion* which contains the actual opaque payload data.
+
+ Note that due to historical reasons, the terminology around multifd
+ packets is inconsistent.
+
+ The `mapped-ram` feature ignores packets entirely.
+
+Operation
+---------
+
+The multifd channels operate in parallel with the main migration
+thread. The transfer of data from a client code into multifd happens
+from the main migration thread using the multifd API.
+
+The interaction between the client code and the multifd channels
+happens in the ``multifd_send()`` and ``multifd_recv()``
+methods. These are reponsible for selecting the next idle channel and
+making the shared data structure containing the payload accessible to
+that channel. The client code receives back an empty object which it
+then uses for the next iteration of data transfer.
+
+The selection of idle channels is simply a round-robin over the idle
+channels (``!p->pending_job``). Channels wait at a semaphore and once
+a channel is released it starts operating on the data immediately.
+
+Aside from eventually transmitting the data over the underlying
+QIOChannel, a channel's operation also includes calling back to the
+client code at pre-determined points to allow for client-specific
+handling such as data transformation (e.g. compression), creation of
+the packet header and arranging the data into iovs (``struct
+iovec``). Iovs are the type of data on which the QIOChannel operates.
+
+A high-level flow for each thread is:
+
+Migration thread:
+
+#. Populate shared structure with opaque data (e.g. ram pages)
+#. Call ``multifd_send()``
+
+ #. Loop over the channels until one is idle
+ #. Switch pointers between client data and channel data
+ #. Release channel semaphore
+#. Receive back empty object
+#. Repeat
+
+Multifd thread:
+
+#. Channel idle
+#. Gets released by ``multifd_send()``
+#. Call ``MultiFDMethods`` methods to fill iov
+
+ #. Compression may happen
+ #. Zero page detection may happen
+ #. Packet is written
+ #. iov is written
+#. Pass iov into QIOChannel for transferring (I/O happens here)
+#. Repeat
+
+The destination side operates similarly but with ``multifd_recv()``,
+decompression instead of compression, etc. One important aspect is
+that when receiving the data, the iov will contain host virtual
+addresses, so guest memory is written to directly from multifd
+threads.
+
+About flags
+-----------
+The main thread orchestrates the migration by issuing control flags on
+the migration stream (``QEMU_VM_*``).
+
+The main memory is migrated by ram.c and includes specific control
+flags that are also put on the main migration stream
+(``RAM_SAVE_FLAG_*``).
+
+Multifd has its own set of flags (``MULTIFD_FLAG_*``) that are
+included into each packet. These may inform about properties such as
+the compression algorithm used if the data is compressed.
+
+.. _synchronization:
+
+Synchronization
+---------------
+
+Data sent through multifd may arrive out of order and with different
+timing. Some clients may also have synchronization requirements to
+ensure data consistency, e.g. the RAM migration must ensure that
+memory pages received by the destination machine are ordered in
+relation to previous iterations of dirty tracking.
+
+Some cleanup tasks such as memory deallocation or error handling may
+need to happen only after all channels have finished sending/receiving
+the data.
+
+Multifd provides the ``multifd_send_sync_main()`` and
+``multifd_recv_sync_main()`` helpers to synchronize the main migration
+thread with the multifd channels. In addition, these helpers also
+trigger the emission of a sync packet (``MULTIFD_FLAG_SYNC``) which
+carries the synchronization command to the remote side of the
+migration.
+
+After the channels have been put into a wait state by the sync
+functions, the client code may continue to transmit additional data by
+issuing ``multifd_send()`` once again.
+
+Note:
+
+- the RAM migration does, effectively, a global synchronization by
+ chaining a call to ``multifd_send_sync_main()`` with the emission of a
+ flag on the main migration channel (``RAM_SAVE_FLAG_MULTIFD_FLUSH``)
+ which in turn causes ``multifd_recv_sync_main()`` to be called on the
+ destination.
+
+ There are also backward compatibility concerns expressed by
+ ``multifd_ram_sync_per_section()`` and
+ ``multifd_ram_sync_per_round()``. See the code for detailed
+ documentation.
+
+- the `mapped-ram` feature has different requirements because it's an
+ asynchronous migration (source and destination not migrating at the
+ same time). For that feature, only the sync between the channels is
+ relevant to prevent cleanup to happen before data is completely
+ written to (or read from) the migration file.
+
+Data transformation
+-------------------
+
+The ``MultiFDMethods`` structure defines callbacks that allow the
+client code to perform operations on the data at key points. These
+operations could be client-specific (e.g. compression), but also
+include a few required steps such as moving data into an iovs. See the
+struct's definition for more detailed documentation.
+
+Historically, the only client for multifd has been the RAM migration,
+so the ``MultiFDMethods`` are pre-registered in two categories,
+compression and no-compression, with the latter being the regular,
+uncompressed ram migration.
+
+Zero page detection
++++++++++++++++++++
+
+The migration without compression has a further specificity of
+possibly doing zero page detection. It involves doing the detection of
+a zero page directly in the multifd channels instead of beforehand on
+the main migration thread (as it's been done in the past). This is the
+default behavior and can be disabled with:
+
+ ``migrate_set_parameter zero-page-detection legacy``
+
+or to disable zero page detection completely:
+
+ ``migrate_set_parameter zero-page-detection none``
+
+Error handling
+--------------
+
+Any part of multifd code can be made to exit by setting the
+``exiting`` atomic flag of the multifd state. Whenever a multifd
+channel has an error, it should break out of its loop, set the flag to
+indicate other channels to exit as well and set the migration error
+with ``migrate_set_error()``.
+
+For clean exiting (triggered from outside the channels), the
+``multifd_send|recv_terminate_threads()`` functions set the
+``exiting`` flag and additionally release any channels that may be
+idle or waiting for a sync.
+
+Code structure
+--------------
+
+Multifd code is divided into:
+
+The main file containing the core routines
+
+- multifd.c
+
+RAM migration
+
+- multifd-nocomp.c (nocomp, for "no compression")
+- multifd-zero-page.c
+- ram.c (also involved in non-multifd migrations & snapshots)
+
+Compressors
+
+- multifd-uadk.c
+- multifd-qatzip.c
+- multifd-zlib.c
+- multifd-qpl.c
+- multifd-zstd.c
--
2.35.3
Hello Fabiano, * First big thank you for starting/writing this document. It is a great resource. On Fri, 7 Mar 2025 at 19:13, Fabiano Rosas <farosas@suse.de> wrote: > +++ b/docs/devel/migration/multifd.rst > @@ -0,0 +1,254 @@ > +Multifd > +Multifd is the name given for the migration capability that enables > +data transfer using multiple threads. Multifd supports all the > +transport types currently in use with migration (inet, unix, vsock, > +fd, file). * Multifd is Multiple File Descriptors, right? Ie. Does it work with one thread but multiple file descriptors? OR one thread per file descriptor is always the case? I have not used/tried 'multifd + file://' migration, but I imagined there one thread might be able to read/write to multiple file descriptors at a time. > +Usage > +----- > + > +On both source and destination, enable the ``multifd`` capability: > + > + ``migrate_set_capability multifd on`` > + > +Define a number of channels to use (default is 2, but 8 usually > +provides best performance). > + > + ``migrate_set_parameter multifd-channels 8`` > + * I get that this is a QEMU documentation, but for users/reader's convenience it'll help to point to libvirt:virsh migrate usage here -> https://www.libvirt.org/manpages/virsh.html#migrate , just as an alternative. Because doing migration via QMP commands is not as straightforward, I wonder who might do that and why. > +Restrictions > +------------ > + > +For migration to a file, support is conditional on the presence of the > +mapped-ram capability, see `mapped-ram`. > + > +Snapshots are currently not supported. * Maybe: Sanpshot using multiple threads (multifd) is not supported. > +`postcopy` migration is currently not supported. * Maybe - 'postcopy' migration using multiple threads (multifd) is not supported. ie. 'postcopy' uses a single thread to transfer migration data. * Reason for these suggestions: as a writer it is easy to think everything written in this page is to be taken with multifd context, but readers may not do that, they may take sentences in isolation. (just sharing thoughts) > +Multifd consists of: > + > +- A client that produces the data on the migration source side and > + consumes it on the destination. Currently the main client code is > + ram.c, which selects the RAM pages for migration; * So multifd mechanism can be used to transfer non-ram data as well? I thought it's only used for RAM migration. Are device/gpu states etc bits also transferred via multifd threads? > +- A packet which is the final result of all the data aggregation > + and/or transformation. The packet contains: a *header* with magic and > + version numbers and flags that inform of special processing needed > + on the destination; a *payload-specific header* with metadata referent > + to the packet's data portion, e.g. page counts; and a variable-size > + *data portion* which contains the actual opaque payload data. * It'll help to define the exact packet format here. Like they do in RFCs. Thank you for writing this. --- - Prasad
Prasad Pandit <ppandit@redhat.com> writes: > Hello Fabiano, > > * First big thank you for starting/writing this document. It is a > great resource. > > On Fri, 7 Mar 2025 at 19:13, Fabiano Rosas <farosas@suse.de> wrote: >> +++ b/docs/devel/migration/multifd.rst >> @@ -0,0 +1,254 @@ >> +Multifd >> +Multifd is the name given for the migration capability that enables >> +data transfer using multiple threads. Multifd supports all the >> +transport types currently in use with migration (inet, unix, vsock, >> +fd, file). > > * Multifd is Multiple File Descriptors, right? Ie. Does it work with > one thread but multiple file descriptors? OR one thread per file > descriptor is always the case? I have not used/tried 'multifd + > file://' migration, but I imagined there one thread might be able to > read/write to multiple file descriptors at a time. > Technically both can happen. But that would just be the case of file:fdset migration which requires an extra fd for O_DIRECT. So "multiple" in the usual sense of "more is better" is only fd-per-thread. IOW, using multiple fds is an implementation detail IMO, what people really care about is medium saturation, which we can only get (with multifd) via parallelization. >> +Usage >> +----- >> + >> +On both source and destination, enable the ``multifd`` capability: >> + >> + ``migrate_set_capability multifd on`` >> + >> +Define a number of channels to use (default is 2, but 8 usually >> +provides best performance). >> + >> + ``migrate_set_parameter multifd-channels 8`` >> + > > * I get that this is a QEMU documentation, but for users/reader's > convenience it'll help to point to libvirt:virsh migrate usage here -> > https://www.libvirt.org/manpages/virsh.html#migrate , just as an > alternative. AFAIK, we tend to not do that in QEMU docs. > Because doing migration via QMP commands is not as > straightforward, I wonder who might do that and why. > All of QEMU developers, libvirt developers, cloud software developers, kernel developers etc. > >> +Restrictions >> +------------ >> + >> +For migration to a file, support is conditional on the presence of the >> +mapped-ram capability, see `mapped-ram`. >> + >> +Snapshots are currently not supported. > > * Maybe: Sanpshot using multiple threads (multifd) is not supported. > >> +`postcopy` migration is currently not supported. > > * Maybe - 'postcopy' migration using multiple threads (multifd) is not > supported. ie. 'postcopy' uses a single thread to transfer migration > data. > > * Reason for these suggestions: as a writer it is easy to think > everything written in this page is to be taken with multifd context, > but readers may not do that, they may take sentences in isolation. > (just sharing thoughts) > Sure, I can expand on those. >> +Multifd consists of: >> + >> +- A client that produces the data on the migration source side and >> + consumes it on the destination. Currently the main client code is >> + ram.c, which selects the RAM pages for migration; > > * So multifd mechanism can be used to transfer non-ram data as well? I > thought it's only used for RAM migration. Are device/gpu states etc > bits also transferred via multifd threads? > device state migration with multifd has been merged for 10.0 <rant> If it were up to me, we'd have a pool of multifd threads that transmit everything migration-related. Unfortunately, that's not so straight-forward to implement without rewriting a lot of code, multifd requires too much entanglement from the data producer. We're constantly dealing with details of data transmission getting in the way of data production/consumption (e.g. try to change ram.c to produce multiple pages at once and watch everyting explode). I've been experimenting with a MultiFDIov payload type to allow separation between the data type handling details and multifd inner workings. However in order for that to be useful we'd need to have a sync that doesn't depend on control data on the main migration thread. That's why I've been asking about a multifd-only sync with Peter in the other thread. There's a bunch of other issues as well: - no clear distinction between what should go in the header and what should go in the packet. - the header taking up one slot in the iov, which should in theory be responsibility of the client - the whole multifd_ops situation which doesn't allow a clear interface between multifd and client - the lack of uniformity between send/recv in regards to doing I/O from multifd code or from client code - the recv having two different modes of operation, socket and file the list goes on... </rant> >> +- A packet which is the final result of all the data aggregation >> + and/or transformation. The packet contains: a *header* with magic and >> + version numbers and flags that inform of special processing needed >> + on the destination; a *payload-specific header* with metadata referent >> + to the packet's data portion, e.g. page counts; and a variable-size >> + *data portion* which contains the actual opaque payload data. > > * It'll help to define the exact packet format here. Like they do in RFCs. I'll try to produce some ascii art. > > Thank you for writing this. > --- > - Prasad
On Thu, 20 Mar 2025 at 20:15, Fabiano Rosas <farosas@suse.de> wrote:
> Technically both can happen. But that would just be the case of
> file:fdset migration which requires an extra fd for O_DIRECT. So
> "multiple" in the usual sense of "more is better" is only
> fd-per-thread. IOW, using multiple fds is an implementation detail IMO,
> what people really care about is medium saturation, which we can only
> get (with multifd) via parallelization.
* I see. Multifd is essentially multiple threads = thread pool then.
> > Because doing migration via QMP commands is not as
> > straightforward, I wonder who might do that and why.
> >
>
> All of QEMU developers, libvirt developers, cloud software developers,
> kernel developers etc.
* Really? That must be using QMP apis via libvirt/virsh kind of tools
I guess. Otherwise how does one follow above instructions to enable
'multifd' and set number of channels on both source and destination
machines? User has to open QMP shell on two machines and invoke QMP
commands?
> > * So multifd mechanism can be used to transfer non-ram data as well? I
> > thought it's only used for RAM migration. Are device/gpu states etc
> > bits also transferred via multifd threads?
> >
> device state migration with multifd has been merged for 10.0
>
> <rant>
> If it were up to me, we'd have a pool of multifd threads that transmit
> everything migration-related.
* Same my thought: If multifd is to be used for all data, why not use
the existing QEMU thread pool OR make it a migration thread pool.
IIRC, there is also some discussion about having a thread pool for
VFIO or GPU state transfer. Having so many different thread pools does
not seem right.
> Unfortunately, that's not so
> straight-forward to implement without rewriting a lot of code, multifd
> requires too much entanglement from the data producer. We're constantly
> dealing with details of data transmission getting in the way of data
> production/consumption (e.g. try to change ram.c to produce multiple
> pages at once and watch everyting explode).
* Ideally there should be separation between what the client is doing
and how migration is working.
* IMO, migration is a mechanism to transfer byte streams from one
machine to another. And while doing so, facilitate writing (data) at
specific addresses/offsets on the destination, not just append bytes
at the tail end. This entails that each individual migration packet
specifies where to write data on the destination. Let's say a
migration stream is a train of packets. Each packet has a header and
data.
( [header][...data...] )><><( [header][...data...] )><><(
[header][data] )><>< ... ><><( [header][data] )
Header specifies:
- Serial number
- Header length
- Data length/size (2MB/4MB/8MB etc.)
- Destination address <- offset where to write migration data, if
it is zero(0) append that data
- Data type (optional): Whether it is RAM/Device/GPU/CPU state etc.
- Data iteration number <- version/iteration of the same RAM page
... more variables
... more variables
- Some reserved bytes
Migration data is:
- Just a data byte stream <= Data length/size above.
* Such a train of packets is then transferred via 1 thread or 10
threads is an operational change.
* Such a packet is pushed (Precopy) from source to destination OR
pulled (Postcopy) by destination from the source side is an
operational difference. In Postcopy phase, it could send a message
saying I need the next RAM packet for this offset and RAM module on
the source side provides only relevant data. Again packaging and
transmission is done by the migration module. Similarly the Postcopy
phase could send a message saying I need the next GPU packet, and the
GPU module on the source side would provide relevant data.
* How long such a train of packets is, is also immaterial.
* With such a separation, things like synchronisation of threads is
not connected to the data (RAM/GPU/CPU/etc.) type.
* It may also allow us to apply compression/encryption uniformly
across all channels/threads, irrespective of the data type.
* Since migration is a packet transport mechanism,
creation/modification/destruction of packets could be done by one
entity. Clients (like RAM/GPU/CPU/VFIO etc.) shall only supply 'data'
to be packaged and sent. It shouldn't be like RAM.c writes its own
pakcets as they like, GPU.c writes their own packets as they like,
that does not seem right.
>> +- A packet which is the final result of all the data aggregation
> >> + and/or transformation. The packet contains: a *header* with magic and
> >> + version numbers and flags that inform of special processing needed
> >> + on the destination; a *payload-specific header* with metadata referent
> >> + to the packet's data portion, e.g. page counts; and a variable-size
> >> + *data portion* which contains the actual opaque payload data.
* Thread synchronisation and other such control messages could/should
be a separate packets of its own, to be sent on the main channel.
Thread synchronisation flags could/should not be combined with the
migration data packets above. Control message packets may have _no
data_ to be processed. (just sharing thoughts)
Thank you.
---
- Prasad
Prasad Pandit <ppandit@redhat.com> writes:
> On Thu, 20 Mar 2025 at 20:15, Fabiano Rosas <farosas@suse.de> wrote:
>> Technically both can happen. But that would just be the case of
>> file:fdset migration which requires an extra fd for O_DIRECT. So
>> "multiple" in the usual sense of "more is better" is only
>> fd-per-thread. IOW, using multiple fds is an implementation detail IMO,
>> what people really care about is medium saturation, which we can only
>> get (with multifd) via parallelization.
>
> * I see. Multifd is essentially multiple threads = thread pool then.
>
Yes, that's what I'm trying to convey with the first
sentence. Specifically to dispel any misconceptions that this is
something esoteric. It's not. We're just using multiple threads with
some custom locking and some callbacks.
`migrate-set-capability
multiple-threads-with-some-custom-locking-and-some-callbacks true`
...doesn't work that well =)
>> > Because doing migration via QMP commands is not as
>> > straightforward, I wonder who might do that and why.
>> >
>>
>> All of QEMU developers, libvirt developers, cloud software developers,
>> kernel developers etc.
>
> * Really? That must be using QMP apis via libvirt/virsh kind of tools
> I guess. Otherwise how does one follow above instructions to enable
> 'multifd' and set number of channels on both source and destination
> machines? User has to open QMP shell on two machines and invoke QMP
> commands?
>
Well, I can't speak for everyone, of course, but generally the less
layers on top of the object of your work the better. I don't even have
libvirt installed on my development machine for instance.
It's convenient to deal directly with QEMU command line and QMP because
that usually gives you a faster turnaround when experimenting with
various command lines/commands.
There are also setups that don't want to bring in too many dependencies,
so having a full libvirt installation is not wanted. There's a bunch of
little tools out there that invoke QEMU and give it QMP commands
directly.
There are several ways of accessing QMP, some examples I have lying
around:
==
$QEMU ... -qmp unix:${SRC_SOCK},server,wait=off
echo "
{ 'execute': 'qmp_capabilities' }
{ 'execute': 'migrate-set-capabilities','arguments':{ 'capabilities':[ \
{ 'capability': 'mapped-ram', 'state': true }, \
{ 'capability': 'multifd', 'state': true } \
] } }
{ 'execute': 'migrate-set-parameters','arguments':{ 'multifd-channels': 8 } }
{ 'execute': 'migrate-set-parameters','arguments':{ 'max-bandwidth': 0 } }
{ 'execute': 'migrate-set-parameters','arguments':{ 'direct-io': true } }
{ 'execute': 'migrate${incoming}','arguments':{ 'uri': 'file:$MIGFILE' } }
" | nc -NU $SRC_SOCK
==
(echo "migrate_set_capability x-ignore-shared on"; echo
"migrate_set_capability validate-uuid on"; echo "migrate
exec:cat>migfile-s390x"; echo "quit") | ./qemu-system-s390x -bios
/tmp/migration-test-16K1Z2/bootsect -monitor stdio
==
$QEMU ... -qmp unix:${DST_SOCK},server,wait=off
./qemu/scripts/qmp/qmp-shell $DST_SOCK
==
$QEMU ...
C-a c
(qemu) info migrate
>> > * So multifd mechanism can be used to transfer non-ram data as well? I
>> > thought it's only used for RAM migration. Are device/gpu states etc
>> > bits also transferred via multifd threads?
>> >
>> device state migration with multifd has been merged for 10.0
>>
>> <rant>
>> If it were up to me, we'd have a pool of multifd threads that transmit
>> everything migration-related.
>
> * Same my thought: If multifd is to be used for all data, why not use
> the existing QEMU thread pool OR make it a migration thread pool.
> IIRC, there is also some discussion about having a thread pool for
> VFIO or GPU state transfer. Having so many different thread pools does
> not seem right.
>
To be clear, multifd is not meant to transfer all data. It was designed
to transfer RAM pages and later got extended to deal with VFIO device
state. It _could_ be further extended for all device states (vmstate)
and it _could_ be further extended to handle control messages from the
main migration thread (QEMU_VM_*, postcopy commands, etc). My opinion is
that it would be interesting to have this kind of flexibility (at some
point). But it might turn out that it doesn't make sense technically,
it's costly in terms of development time, etc.
I think we all agree that having different sets of threads managed in
different ways is not ideal. The thing with multifd is that it's very
important to keep the performance and constraints of ram migration. If
we manage to achieve that with some generic thread pool, that's
great. But it's an exploration work that will have to be done.
>> Unfortunately, that's not so
>> straight-forward to implement without rewriting a lot of code, multifd
>> requires too much entanglement from the data producer. We're constantly
>> dealing with details of data transmission getting in the way of data
>> production/consumption (e.g. try to change ram.c to produce multiple
>> pages at once and watch everyting explode).
>
> * Ideally there should be separation between what the client is doing
> and how migration is working.
>
> * IMO, migration is a mechanism to transfer byte streams from one
> machine to another. And while doing so, facilitate writing (data) at
> specific addresses/offsets on the destination, not just append bytes
> at the tail end. This entails that each individual migration packet
> specifies where to write data on the destination. Let's say a
> migration stream is a train of packets. Each packet has a header and
> data.
>
> ( [header][...data...] )><><( [header][...data...] )><><(
> [header][data] )><>< ... ><><( [header][data] )
>
But then there's stuff like mapped-ram which wants its data free of any
metadata because it mirrors the RAM layout in the migration file.
> Header specifies:
> - Serial number
> - Header length
> - Data length/size (2MB/4MB/8MB etc.)
I generally like the idea of having the size of the header/data
specified in the header itself. It does seem like it would allow for
better extensibility over time. I spent a lot of time looking at those
"unused" bytes in MultiFDPacket_t trying to figure out a way of
embedding the size information in a backward-compatible way. We ended up
going with Maciej's idea of isolating the common parts of the header in
the MultiFDPacketHdr_t and having each data type define it's own
specific sub-header.
I don't know how this looks like in terms of type-safety and how we'd
keep compatibility (two separate issues) because a variable-size header
needs to end up in a well-defined structure at some point. It's
generally more difficult to maintain code that simply takes a buffer and
pokes at random offsets in there.
Even with the length, an old QEMU would still not know about extra
fields.
> - Destination address <- offset where to write migration data, if
> it is zero(0) append that data
> - Data type (optional): Whether it is RAM/Device/GPU/CPU state etc.
> - Data iteration number <- version/iteration of the same RAM page
> ... more variables
> ... more variables
This is all in the end client-centric, which means it is "data" from the
migration perspective. So the question I put earlier still remains, what
determines the kind of data that goes in the header and the kind of data
that goes in the data part of the packet? It seems we cannot escape from
having the client bring it's own header format.
> - Some reserved bytes
> Migration data is:
> - Just a data byte stream <= Data length/size above.
>
> * Such a train of packets is then transferred via 1 thread or 10
> threads is an operational change.
> * Such a packet is pushed (Precopy) from source to destination OR
> pulled (Postcopy) by destination from the source side is an
> operational difference. In Postcopy phase, it could send a message
> saying I need the next RAM packet for this offset and RAM module on
> the source side provides only relevant data. Again packaging and
> transmission is done by the migration module. Similarly the Postcopy
> phase could send a message saying I need the next GPU packet, and the
> GPU module on the source side would provide relevant data.
> * How long such a train of packets is, is also immaterial.
> * With such a separation, things like synchronisation of threads is
> not connected to the data (RAM/GPU/CPU/etc.) type.
> * It may also allow us to apply compression/encryption uniformly
> across all channels/threads, irrespective of the data type.
> * Since migration is a packet transport mechanism,
> creation/modification/destruction of packets could be done by one
> entity. Clients (like RAM/GPU/CPU/VFIO etc.) shall only supply 'data'
> to be packaged and sent. It shouldn't be like RAM.c writes its own
> pakcets as they like, GPU.c writes their own packets as they like,
> that does not seem right.
>
Right, so we'd need an extra abstraction layer with a well defined
interface to convert a raw packet into something that's useful for the
clients. The vmstate macros actually do that work kind of well. A device
emulation code does not need to care (too much) about how migration
works as long as the vmstate is written properly.
> >> +- A packet which is the final result of all the data aggregation
>> >> + and/or transformation. The packet contains: a *header* with magic and
>> >> + version numbers and flags that inform of special processing needed
>> >> + on the destination; a *payload-specific header* with metadata referent
>> >> + to the packet's data portion, e.g. page counts; and a variable-size
>> >> + *data portion* which contains the actual opaque payload data.
>
> * Thread synchronisation and other such control messages could/should
> be a separate packets of its own, to be sent on the main channel.
Remember that currently the control data is put raw on the stream, it is
not encapsulated by a packet. This would increase the amount of data put
on the stream, which might affect throughput.
> Thread synchronisation flags could/should not be combined with the
> migration data packets above. Control message packets may have _no
> data_ to be processed. (just sharing thoughts)
>
Yeah, the MULTIFD_FLAG_SYNC used to be part of a data packet and it was
utterly confusing to debug sync issues like that. Peter did the work to
make it a standalone (no data) packet.
> Thank you.
> ---
> - Prasad
Hi,
On Fri, 21 Mar 2025 at 19:34, Fabiano Rosas <farosas@suse.de> wrote:
> Well, I can't speak for everyone, of course, but generally the less
> layers on top of the object of your work the better.
* Yes, true.
> There are several ways of accessing QMP, some examples I have lying
> around:
>
> ==
> $QEMU ... -qmp unix:${SRC_SOCK},server,wait=off
>
> echo "
> { 'execute': 'qmp_capabilities' }
> { 'execute': 'migrate-set-capabilities','arguments':{ 'capabilities':[ \
> { 'capability': 'mapped-ram', 'state': true }, \
> { 'capability': 'multifd', 'state': true } \
> ] } }
> { 'execute': 'migrate-set-parameters','arguments':{ 'multifd-channels': 8 } }
> { 'execute': 'migrate-set-parameters','arguments':{ 'max-bandwidth': 0 } }
> { 'execute': 'migrate-set-parameters','arguments':{ 'direct-io': true } }
> { 'execute': 'migrate${incoming}','arguments':{ 'uri': 'file:$MIGFILE' } }
> " | nc -NU $SRC_SOCK
> ==
> (echo "migrate_set_capability x-ignore-shared on"; echo
> "migrate_set_capability validate-uuid on"; echo "migrate
> exec:cat>migfile-s390x"; echo "quit") | ./qemu-system-s390x -bios
> /tmp/migration-test-16K1Z2/bootsect -monitor stdio
> ==
> $QEMU ... -qmp unix:${DST_SOCK},server,wait=off
> ./qemu/scripts/qmp/qmp-shell $DST_SOCK
> ==
> $QEMU ...
> C-a c
> (qemu) info migrate
* Interesting. Initially I tried enabling multifd on two hosts and
setting multifd channels via QMP, but then quickly moved to virsh(1)
for its convenience.
> I think we all agree that having different sets of threads managed in
> different ways, is not ideal.
* Yes.
> The thing with multifd is that it's very
> important to keep the performance and constraints of ram migration. If
> we manage to achieve that with some generic thread pool, that's
> great. But it's an exploration work that will have to be done.
* Yes.
> >> Unfortunately, that's not so
> >> straight-forward to implement without rewriting a lot of code, multifd
> >> requires too much entanglement from the data producer. We're constantly
> >> dealing with details of data transmission getting in the way of data
> >> production/consumption (e.g. try to change ram.c to produce multiple
> >> pages at once and watch everyting explode).
* Hmmn, I think that's where a clear separation between migration and
client could help.
> But then there's stuff like mapped-ram which wants its data free of any
> metadata because it mirrors the RAM layout in the migration file.
* I'm not sure how it works now OR why it works that way. But shall
have a look at it.
> I generally like the idea of having the size of the header/data
> specified in the header itself. It does seem like it would allow for
> better extensibility over time. I spent a lot of time looking at those
> "unused" bytes in MultiFDPacket_t trying to figure out a way of
> embedding the size information in a backward-compatible way. We ended up
> going with Maciej's idea of isolating the common parts of the header in
> the MultiFDPacketHdr_t and having each data type define it's own
> specific sub-header.
>
> I don't know how this looks like in terms of type-safety and how we'd
> keep compatibility (two separate issues) because a variable-size header
> needs to end up in a well-defined structure at some point. It's
> generally more difficult to maintain code that simply takes a buffer and
> pokes at random offsets in there.
* I'm not sure if the header size would vary as much.
> This is all in the end client-centric, which means it is "data" from the
> migration perspective. So the question I put earlier still remains, what
> determines the kind of data that goes in the header and the kind of data
> that goes in the data part of the packet? It seems we cannot escape from
> having the client bring it's own header format.
* Yes. Actually that begs the question - Why do we need the migration
and client headers? The answer might differ based on how we look at
things.
Guest State
|
+---pcie-root-ports[0..15] -> [0...2GB]
|
+---Timer -> [0...1MB]
|
+---Audio -> [0...1MB]
|
+---Block -> [0...2GB]
|
+---RAM -> [0...128GB]
|
+---DirtyBitmap -> [0...4GB]
|
+---GPU -> [0...128GB]
|
+---CPUs[0...31] -> [0...8GB]
...
|
+ (above numbers are for example only)
[Host-1: Guest State] <== [Migration] ==> [Host-2: Guest State]
* Migration should create this same 'Guest State' tree on the
destination Host-2 as:
0. Whole guest state (vmstate) is a list of different nodes with
their own state as above.
1. Migration core on the source side would iterate over these
nodes to call the respective *_read() function to read their 'state'.
2. Migration core would transmit the read 'state' (as 2MB/4MB
data blocks) to the destination.
3. On the destination side - Migration core needs to know where
to store/write/deliver the received 2MB/4MB data blocks.
- This is where the migration header would help, to identify
which *_write() function to call.
4. The respective *_write() function would then write (or
overwrite) the received block at the specified 'offset' within its
state.
* Let's say 'Migration Core' is similar to the TCP layer. Irrespective
of the application protocol (ftp/http/ssh/ncat(1)) TCP behaves the
same. TCP layer identifies where to deliver received data by its port
and connection numbers. TCP layer does not care which program is
running at a given port. Similarly 'Migration core' could read data
from Host-1 and write/deliver it on the Host-2, irrespective of
whether it is a RAM or GPU or any other state block.
* To answer the question, what goes in the header part: is the minimum
information required to identify where to write/deliver the data
block. As with application protocols, that information could be
embedded in the data block itself as well. In which case migration
header may not be required OR it may store bits related to the threads
OR bandwidth control OR accounting etc. depending on their purpose.
* Migration core could:
- Create/destroy threads to transmit data
- Apply compression/decompression on the data
- Apply encryption/decryption on the data
- Apply bandwidth control/limits across all threads while transmitting data
> Right, so we'd need an extra abstraction layer with a well defined
> interface to convert a raw packet into something that's useful for the
> clients. The vmstate macros actually do that work kind of well. A device
> emulation code does not need to care (too much) about how migration
> works as long as the vmstate is written properly.
* Yes.
Thank you.
---
- Prasad
On Thu, Mar 20, 2025 at 11:45:29AM -0300, Fabiano Rosas wrote: > There's a bunch of other issues as well: > > - no clear distinction between what should go in the header and what > should go in the packet. > > - the header taking up one slot in the iov, which should in theory be > responsibility of the client > > - the whole multifd_ops situation which doesn't allow a clear interface > between multifd and client > > - the lack of uniformity between send/recv in regards to doing I/O from > multifd code or from client code > > - the recv having two different modes of operation, socket and file I can't say I know the answer of all of them, but to me the last one is kind of by design - obviously the old multifd was designed to be more or less event driven on dest side, but that doesn't play well on files. To be fair, I didn't invent multifd, but IMHO Juan did a great job designing it from scratch, at least it has a bunch of benefits comparing to the old protocol especially on perf side (even though when initially proposed I was not a fan of how the locking was designed.. but it should be much easier to understand after previous refactors). And just to say, we can change the code or protocol in whatever way we want if that could make it better. So instead of the rant (which is still welcomed whenever you feel like :), we can go for whatever you see fit with compat properties (and if with a handshake, that's even less of a concern). -- Peter Xu
Peter Xu <peterx@redhat.com> writes: > On Thu, Mar 20, 2025 at 11:45:29AM -0300, Fabiano Rosas wrote: >> There's a bunch of other issues as well: >> >> - no clear distinction between what should go in the header and what >> should go in the packet. >> >> - the header taking up one slot in the iov, which should in theory be >> responsibility of the client >> >> - the whole multifd_ops situation which doesn't allow a clear interface >> between multifd and client >> >> - the lack of uniformity between send/recv in regards to doing I/O from >> multifd code or from client code >> >> - the recv having two different modes of operation, socket and file > > I can't say I know the answer of all of them, but to me the last one is > kind of by design - obviously the old multifd was designed to be more or > less event driven on dest side, but that doesn't play well on files. > Yes, it's entirely by design. But it does create an extra hurdle in the end. The event driven model requires metadata to inform the IO thread what to do with the data collected (write to RAM at address X). The other model doesn't require that as it has a payload already included, so it just populates the fields. The problem is that we tied file migration with !packets, but if we want to use iovs all around, we'd still want packets (due to magic and version) although it'd be way easier to collect the actual data in MultiFDRecvData instead of passing information through the header and then doing the work at ->recv(). > To be fair, I didn't invent multifd, but IMHO Juan did a great job > designing it from scratch, at least it has a bunch of benefits > comparing to the old protocol especially on perf side (even though > when initially proposed I was not a fan of how the locking was > designed.. but it should be much easier to understand after previous > refactors). > Good point. My rant means no demerit at all to the current design, I'm just objectively pointing out the parts I think are getting in the way. > And just to say, we can change the code or protocol in whatever way we want > if that could make it better. So instead of the rant (which is still > welcomed whenever you feel like :), we can go for whatever you see fit with > compat properties (and if with a handshake, that's even less of a concern). Point taken (on the HS as well).
On Fri, Mar 07, 2025 at 10:42:02AM -0300, Fabiano Rosas wrote:
> There's currently no documentation for multifd, we can at least
> provide an overview of the feature.
We missed this for a long time indeed..
>
> Signed-off-by: Fabiano Rosas <farosas@suse.de>
> ---
> Keep in mind the feature grew organically over the years and it has
> had bugs that required reinventing some concepts, specially on the
> sync part, so there's still some amount of inconsistency in the code
> and that's not going to be fixed by documentation.
> ---
> docs/devel/migration/features.rst | 1 +
> docs/devel/migration/multifd.rst | 254 ++++++++++++++++++++++++++++++
> 2 files changed, 255 insertions(+)
> create mode 100644 docs/devel/migration/multifd.rst
>
> diff --git a/docs/devel/migration/features.rst b/docs/devel/migration/features.rst
> index 8f431d52f9..249d653124 100644
> --- a/docs/devel/migration/features.rst
> +++ b/docs/devel/migration/features.rst
> @@ -15,3 +15,4 @@ Migration has plenty of features to support different use cases.
> qpl-compression
> uadk-compression
> qatzip-compression
> + multifd
Considering that it's one of the main features (e.g. all compressors above
are only sub-features of multifd), we could move this upper, maybe even the
1st one.
> diff --git a/docs/devel/migration/multifd.rst b/docs/devel/migration/multifd.rst
> new file mode 100644
> index 0000000000..8f5ec840cb
> --- /dev/null
> +++ b/docs/devel/migration/multifd.rst
> @@ -0,0 +1,254 @@
> +Multifd
> +=======
> +
> +Multifd is the name given for the migration capability that enables
> +data transfer using multiple threads. Multifd supports all the
> +transport types currently in use with migration (inet, unix, vsock,
> +fd, file).
I never tried vsock, would it be used in any use case?
It seems to be introduced by accident in 72a8192e225cea, but I'm not sure.
Maybe there's something I missed.
If we don't plan to obsolete rdma, we may also want to mention it.. in
which case it doesn't support multifd.
> +
> +Usage
> +-----
> +
> +On both source and destination, enable the ``multifd`` capability:
> +
> + ``migrate_set_capability multifd on``
> +
> +Define a number of channels to use (default is 2, but 8 usually
> +provides best performance).
> +
> + ``migrate_set_parameter multifd-channels 8``
> +
> +Restrictions
> +------------
> +
> +For migration to a file, support is conditional on the presence of the
> +mapped-ram capability, see `mapped-ram`.
> +
> +Snapshots are currently not supported.
> +
> +`postcopy` migration is currently not supported.
> +
> +Components
> +----------
> +
> +Multifd consists of:
> +
> +- A client that produces the data on the migration source side and
> + consumes it on the destination. Currently the main client code is
> + ram.c, which selects the RAM pages for migration;
> +
> +- A shared data structure (``MultiFDSendData``), used to transfer data
> + between multifd and the client. On the source side, this structure
> + is further subdivided into payload types (``MultiFDPayload``);
> +
> +- An API operating on the shared data structure to allow the client
s/An API/A set of APIs/
> + code to interact with multifd;
> +
> + - ``multifd_send/recv()``: Transfers work to/from the channels.
> +
> + - ``multifd_*payload_*`` and ``MultiFDPayloadType``: Support
> + defining an opaque payload. The payload is always wrapped by
> + ``MultiFD*Data``.
> +
> + - ``multifd_send_data_*``: Used to manage the memory for the shared
> + data structure.
> +
> + - ``multifd_*_sync_main()``: See :ref:`synchronization` below.
When in doc, it might be helpful to list exact function names without
asterisks, so that people can grep for them when reading.
> +
> +- A set of threads (aka channels, due to a 1:1 mapping to QIOChannels)
> + responsible for doing I/O. Each multifd channel supports callbacks
> + (``MultiFDMethods``) that can be used for fine-grained processing of
> + the payload, such as compression and zero page detection.
> +
> +- A packet which is the final result of all the data aggregation
> + and/or transformation. The packet contains: a *header* with magic and
> + version numbers and flags that inform of special processing needed
> + on the destination; a *payload-specific header* with metadata referent
> + to the packet's data portion, e.g. page counts; and a variable-size
> + *data portion* which contains the actual opaque payload data.
> +
> + Note that due to historical reasons, the terminology around multifd
> + packets is inconsistent.
> +
> + The `mapped-ram` feature ignores packets entirely.
If above "packet" section does not cover mapped-ram, while mapped-ram is
part of multifd, maybe it means we should reword it?
One option is we drop above paragraph completely, but enrich the previous
section ("A set of threads.."), with:
... such as compression and zero page detection. Multifd threads can
dump the results to different targets. For socket-based URIs, the data
will be queued to the socket with multifd specific headers. For
file-based URIs, the data may be applied directly on top of the target
file at specific offset.
Optionally, we may have another separate section to explain the socket
headers. If so, we could have the header definition directly, and explain
the fields. Might be more straightforward too.
> +
> +Operation
> +---------
> +
> +The multifd channels operate in parallel with the main migration
> +thread. The transfer of data from a client code into multifd happens
> +from the main migration thread using the multifd API.
> +
> +The interaction between the client code and the multifd channels
> +happens in the ``multifd_send()`` and ``multifd_recv()``
> +methods. These are reponsible for selecting the next idle channel and
> +making the shared data structure containing the payload accessible to
> +that channel. The client code receives back an empty object which it
> +then uses for the next iteration of data transfer.
> +
> +The selection of idle channels is simply a round-robin over the idle
> +channels (``!p->pending_job``). Channels wait at a semaphore and once
> +a channel is released it starts operating on the data immediately.
The sender side is always like this indeed. For the recv side (and since
you also mentioned it above), multifd treats it differently based on socket
or file based. Maybe we should also discuss socket-based?
Something like this?
Multifd receive side relies on a proper ``MultiFDMethods.recv()`` method
provided by the consumer of the pages to know how to load the pages. The
recv threads can work in different ways depending on the channel type.
For socket-based channels, multifd recv side is almost event-driven.
Each multifd recv threads will be blocked reading the channels until a
complete multifd packet header is received. With that, pages are loaded
as they arrive on the ports with the ``MultiFDMethods.recv()`` method
provided by the client, so as to post-process the data received.
For file-based channels, multifd recv side works slightly differently.
It works more like the sender side, that client can queue requests to
multifd recv threads to load specific portion of file into corresponding
portion of RAMs. The ``MultiFDMethods.recv()`` in this case simply
always executes the load operation from file as requested.
Feel free to take all or none. You can also mention it after the next
paragraph on "client-specific handling". Anyway, some mentioning of
event-driven model used in socket channels would be nice.
> +
> +Aside from eventually transmitting the data over the underlying
> +QIOChannel, a channel's operation also includes calling back to the
> +client code at pre-determined points to allow for client-specific
> +handling such as data transformation (e.g. compression), creation of
> +the packet header and arranging the data into iovs (``struct
> +iovec``). Iovs are the type of data on which the QIOChannel operates.
> +
> +A high-level flow for each thread is:
> +
> +Migration thread:
> +
> +#. Populate shared structure with opaque data (e.g. ram pages)
> +#. Call ``multifd_send()``
> +
> + #. Loop over the channels until one is idle
> + #. Switch pointers between client data and channel data
> + #. Release channel semaphore
> +#. Receive back empty object
> +#. Repeat
> +
> +Multifd thread:
> +
> +#. Channel idle
> +#. Gets released by ``multifd_send()``
> +#. Call ``MultiFDMethods`` methods to fill iov
> +
> + #. Compression may happen
> + #. Zero page detection may happen
> + #. Packet is written
> + #. iov is written
> +#. Pass iov into QIOChannel for transferring (I/O happens here)
> +#. Repeat
> +
> +The destination side operates similarly but with ``multifd_recv()``,
> +decompression instead of compression, etc. One important aspect is
> +that when receiving the data, the iov will contain host virtual
> +addresses, so guest memory is written to directly from multifd
> +threads.
> +
> +About flags
> +-----------
> +The main thread orchestrates the migration by issuing control flags on
> +the migration stream (``QEMU_VM_*``).
> +
> +The main memory is migrated by ram.c and includes specific control
> +flags that are also put on the main migration stream
> +(``RAM_SAVE_FLAG_*``).
> +
> +Multifd has its own set of flags (``MULTIFD_FLAG_*``) that are
> +included into each packet. These may inform about properties such as
> +the compression algorithm used if the data is compressed.
I think I get your intention, on that we have different levels of flags and
maybe it's not easy to know which is which. However since this is multifd
specific doc, from that POV the first two paragraphs may be more suitable
for some more high level doc to me.
Meanwhile, I feel that reading the flag section without a quick packet
header introduction is a tiny little abrupt to readers, as the flag is part
of the packet but it came from nowhere yet. One option is we make this
section "multifd packet header" then introduce all fields quickly including
the flags. If you like keeping this it's ok too, we can work on top.
> +
> +.. _synchronization:
> +
> +Synchronization
> +---------------
> +
> +Data sent through multifd may arrive out of order and with different
> +timing. Some clients may also have synchronization requirements to
> +ensure data consistency, e.g. the RAM migration must ensure that
> +memory pages received by the destination machine are ordered in
> +relation to previous iterations of dirty tracking.
> +
> +Some cleanup tasks such as memory deallocation or error handling may
> +need to happen only after all channels have finished sending/receiving
> +the data.
> +
> +Multifd provides the ``multifd_send_sync_main()`` and
> +``multifd_recv_sync_main()`` helpers to synchronize the main migration
> +thread with the multifd channels. In addition, these helpers also
> +trigger the emission of a sync packet (``MULTIFD_FLAG_SYNC``) which
> +carries the synchronization command to the remote side of the
> +migration.
[1]
> +
> +After the channels have been put into a wait state by the sync
> +functions, the client code may continue to transmit additional data by
> +issuing ``multifd_send()`` once again.
> +
> +Note:
> +
> +- the RAM migration does, effectively, a global synchronization by
> + chaining a call to ``multifd_send_sync_main()`` with the emission of a
> + flag on the main migration channel (``RAM_SAVE_FLAG_MULTIFD_FLUSH``)
... or RAM_SAVE_FLAG_EOS ... depending on the machine type.
Maybe we should also add a sentence on the relationship of
MULTIFD_FLAG_SYNC and RAM_SAVE_FLAG_MULTIFD_FLUSH (or RAM_SAVE_FLAG_EOS ),
in that they should always be sent together, and only if so would it
provide ordering of multifd messages and what happens in the main migration
thread.
Maybe we can attach that sentence at the end of [1].
> + which in turn causes ``multifd_recv_sync_main()`` to be called on the
> + destination.
> +
> + There are also backward compatibility concerns expressed by
> + ``multifd_ram_sync_per_section()`` and
> + ``multifd_ram_sync_per_round()``. See the code for detailed
> + documentation.
> +
> +- the `mapped-ram` feature has different requirements because it's an
> + asynchronous migration (source and destination not migrating at the
> + same time). For that feature, only the sync between the channels is
> + relevant to prevent cleanup to happen before data is completely
> + written to (or read from) the migration file.
> +
> +Data transformation
> +-------------------
> +
> +The ``MultiFDMethods`` structure defines callbacks that allow the
> +client code to perform operations on the data at key points. These
> +operations could be client-specific (e.g. compression), but also
> +include a few required steps such as moving data into an iovs. See the
> +struct's definition for more detailed documentation.
> +
> +Historically, the only client for multifd has been the RAM migration,
> +so the ``MultiFDMethods`` are pre-registered in two categories,
> +compression and no-compression, with the latter being the regular,
> +uncompressed ram migration.
> +
> +Zero page detection
> ++++++++++++++++++++
> +
> +The migration without compression has a further specificity of
Compressors also have zero page detection. E.g.:
multifd_send_zero_page_detect()
<- multifd_send_prepare_common()
<- multifd_zstd_send_prepare()
> +possibly doing zero page detection. It involves doing the detection of
> +a zero page directly in the multifd channels instead of beforehand on
> +the main migration thread (as it's been done in the past). This is the
> +default behavior and can be disabled with:
> +
> + ``migrate_set_parameter zero-page-detection legacy``
> +
> +or to disable zero page detection completely:
> +
> + ``migrate_set_parameter zero-page-detection none``
> +
> +Error handling
> +--------------
> +
> +Any part of multifd code can be made to exit by setting the
> +``exiting`` atomic flag of the multifd state. Whenever a multifd
> +channel has an error, it should break out of its loop, set the flag to
> +indicate other channels to exit as well and set the migration error
> +with ``migrate_set_error()``.
> +
> +For clean exiting (triggered from outside the channels), the
> +``multifd_send|recv_terminate_threads()`` functions set the
> +``exiting`` flag and additionally release any channels that may be
> +idle or waiting for a sync.
> +
> +Code structure
> +--------------
> +
> +Multifd code is divided into:
> +
> +The main file containing the core routines
> +
> +- multifd.c
> +
> +RAM migration
> +
> +- multifd-nocomp.c (nocomp, for "no compression")
> +- multifd-zero-page.c
> +- ram.c (also involved in non-multifd migrations & snapshots)
> +
> +Compressors
> +
> +- multifd-uadk.c
> +- multifd-qatzip.c
> +- multifd-zlib.c
> +- multifd-qpl.c
> +- multifd-zstd.c
> --
> 2.35.3
>
--
Peter Xu
Peter Xu <peterx@redhat.com> writes:
> On Fri, Mar 07, 2025 at 10:42:02AM -0300, Fabiano Rosas wrote:
>> There's currently no documentation for multifd, we can at least
>> provide an overview of the feature.
>
> We missed this for a long time indeed..
>
>>
>> Signed-off-by: Fabiano Rosas <farosas@suse.de>
>> ---
>> Keep in mind the feature grew organically over the years and it has
>> had bugs that required reinventing some concepts, specially on the
>> sync part, so there's still some amount of inconsistency in the code
>> and that's not going to be fixed by documentation.
>> ---
>> docs/devel/migration/features.rst | 1 +
>> docs/devel/migration/multifd.rst | 254 ++++++++++++++++++++++++++++++
>> 2 files changed, 255 insertions(+)
>> create mode 100644 docs/devel/migration/multifd.rst
>>
>> diff --git a/docs/devel/migration/features.rst b/docs/devel/migration/features.rst
>> index 8f431d52f9..249d653124 100644
>> --- a/docs/devel/migration/features.rst
>> +++ b/docs/devel/migration/features.rst
>> @@ -15,3 +15,4 @@ Migration has plenty of features to support different use cases.
>> qpl-compression
>> uadk-compression
>> qatzip-compression
>> + multifd
>
> Considering that it's one of the main features (e.g. all compressors above
> are only sub-features of multifd), we could move this upper, maybe even the
> 1st one.
>
>> diff --git a/docs/devel/migration/multifd.rst b/docs/devel/migration/multifd.rst
>> new file mode 100644
>> index 0000000000..8f5ec840cb
>> --- /dev/null
>> +++ b/docs/devel/migration/multifd.rst
>> @@ -0,0 +1,254 @@
>> +Multifd
>> +=======
>> +
>> +Multifd is the name given for the migration capability that enables
>> +data transfer using multiple threads. Multifd supports all the
>> +transport types currently in use with migration (inet, unix, vsock,
>> +fd, file).
>
> I never tried vsock, would it be used in any use case?
>
I don't know, I'm going by what's in the code.
> It seems to be introduced by accident in 72a8192e225cea, but I'm not sure.
> Maybe there's something I missed.
The code was always had some variation of:
static bool transport_supports_multi_channels(SocketAddress *saddr)
{
return strstart(uri, "tcp:", NULL) || strstart(uri, "unix:", NULL) ||
strstart(uri, "vsock:", NULL);
}
Introduced by b7acd65707 ("migration: allow multifd for socket protocol
only").
> If we don't plan to obsolete rdma, we may also want to mention it.. in
> which case it doesn't support multifd.
>
ok.
>> +
>> +Usage
>> +-----
>> +
>> +On both source and destination, enable the ``multifd`` capability:
>> +
>> + ``migrate_set_capability multifd on``
>> +
>> +Define a number of channels to use (default is 2, but 8 usually
>> +provides best performance).
>> +
>> + ``migrate_set_parameter multifd-channels 8``
>> +
>> +Restrictions
>> +------------
>> +
>> +For migration to a file, support is conditional on the presence of the
>> +mapped-ram capability, see `mapped-ram`.
>> +
>> +Snapshots are currently not supported.
>> +
>> +`postcopy` migration is currently not supported.
>> +
>> +Components
>> +----------
>> +
>> +Multifd consists of:
>> +
>> +- A client that produces the data on the migration source side and
>> + consumes it on the destination. Currently the main client code is
>> + ram.c, which selects the RAM pages for migration;
>> +
>> +- A shared data structure (``MultiFDSendData``), used to transfer data
>> + between multifd and the client. On the source side, this structure
>> + is further subdivided into payload types (``MultiFDPayload``);
>> +
>> +- An API operating on the shared data structure to allow the client
>
> s/An API/A set of APIs/
>
>> + code to interact with multifd;
>> +
>> + - ``multifd_send/recv()``: Transfers work to/from the channels.
>> +
>> + - ``multifd_*payload_*`` and ``MultiFDPayloadType``: Support
>> + defining an opaque payload. The payload is always wrapped by
>> + ``MultiFD*Data``.
>> +
>> + - ``multifd_send_data_*``: Used to manage the memory for the shared
>> + data structure.
>> +
>> + - ``multifd_*_sync_main()``: See :ref:`synchronization` below.
>
> When in doc, it might be helpful to list exact function names without
> asterisks, so that people can grep for them when reading.
>
>> +
>> +- A set of threads (aka channels, due to a 1:1 mapping to QIOChannels)
>> + responsible for doing I/O. Each multifd channel supports callbacks
>> + (``MultiFDMethods``) that can be used for fine-grained processing of
>> + the payload, such as compression and zero page detection.
>> +
>> +- A packet which is the final result of all the data aggregation
>> + and/or transformation. The packet contains: a *header* with magic and
>> + version numbers and flags that inform of special processing needed
>> + on the destination; a *payload-specific header* with metadata referent
>> + to the packet's data portion, e.g. page counts; and a variable-size
>> + *data portion* which contains the actual opaque payload data.
>> +
>> + Note that due to historical reasons, the terminology around multifd
>> + packets is inconsistent.
>> +
>> + The `mapped-ram` feature ignores packets entirely.
>
> If above "packet" section does not cover mapped-ram, while mapped-ram is
> part of multifd, maybe it means we should reword it?
>
I get your point. I just want to clearly point out the places where
mapped-ram is completely different. Maybe some of the suggestions you
made will be enough for that...
> One option is we drop above paragraph completely, but enrich the previous
> section ("A set of threads.."), with:
>
No, the packet is important. Mainly because it's a mess. We should have
*more* information about it on the docs.
> ... such as compression and zero page detection. Multifd threads can
> dump the results to different targets. For socket-based URIs, the data
> will be queued to the socket with multifd specific headers. For
> file-based URIs, the data may be applied directly on top of the target
> file at specific offset.
>
> Optionally, we may have another separate section to explain the socket
> headers. If so, we could have the header definition directly, and explain
> the fields. Might be more straightforward too.
>
Probably, yes.
>> +
>> +Operation
>> +---------
>> +
>> +The multifd channels operate in parallel with the main migration
>> +thread. The transfer of data from a client code into multifd happens
>> +from the main migration thread using the multifd API.
>> +
>> +The interaction between the client code and the multifd channels
>> +happens in the ``multifd_send()`` and ``multifd_recv()``
>> +methods. These are reponsible for selecting the next idle channel and
>> +making the shared data structure containing the payload accessible to
>> +that channel. The client code receives back an empty object which it
>> +then uses for the next iteration of data transfer.
>> +
>> +The selection of idle channels is simply a round-robin over the idle
>> +channels (``!p->pending_job``). Channels wait at a semaphore and once
>> +a channel is released it starts operating on the data immediately.
>
> The sender side is always like this indeed. For the recv side (and since
> you also mentioned it above), multifd treats it differently based on socket
> or file based. Maybe we should also discuss socket-based?
>
> Something like this?
>
> Multifd receive side relies on a proper ``MultiFDMethods.recv()`` method
> provided by the consumer of the pages to know how to load the pages. The
> recv threads can work in different ways depending on the channel type.
>
> For socket-based channels, multifd recv side is almost event-driven.
> Each multifd recv threads will be blocked reading the channels until a
> complete multifd packet header is received. With that, pages are loaded
> as they arrive on the ports with the ``MultiFDMethods.recv()`` method
> provided by the client, so as to post-process the data received.
>
> For file-based channels, multifd recv side works slightly differently.
> It works more like the sender side, that client can queue requests to
> multifd recv threads to load specific portion of file into corresponding
> portion of RAMs. The ``MultiFDMethods.recv()`` in this case simply
> always executes the load operation from file as requested.
>
> Feel free to take all or none. You can also mention it after the next
> paragraph on "client-specific handling". Anyway, some mentioning of
> event-driven model used in socket channels would be nice.
>
ok.
>> +
>> +Aside from eventually transmitting the data over the underlying
>> +QIOChannel, a channel's operation also includes calling back to the
>> +client code at pre-determined points to allow for client-specific
>> +handling such as data transformation (e.g. compression), creation of
>> +the packet header and arranging the data into iovs (``struct
>> +iovec``). Iovs are the type of data on which the QIOChannel operates.
>> +
>> +A high-level flow for each thread is:
>> +
>> +Migration thread:
>> +
>> +#. Populate shared structure with opaque data (e.g. ram pages)
>> +#. Call ``multifd_send()``
>> +
>> + #. Loop over the channels until one is idle
>> + #. Switch pointers between client data and channel data
>> + #. Release channel semaphore
>> +#. Receive back empty object
>> +#. Repeat
>> +
>> +Multifd thread:
>> +
>> +#. Channel idle
>> +#. Gets released by ``multifd_send()``
>> +#. Call ``MultiFDMethods`` methods to fill iov
>> +
>> + #. Compression may happen
>> + #. Zero page detection may happen
>> + #. Packet is written
>> + #. iov is written
>> +#. Pass iov into QIOChannel for transferring (I/O happens here)
>> +#. Repeat
>> +
>> +The destination side operates similarly but with ``multifd_recv()``,
>> +decompression instead of compression, etc. One important aspect is
>> +that when receiving the data, the iov will contain host virtual
>> +addresses, so guest memory is written to directly from multifd
>> +threads.
>> +
>> +About flags
>> +-----------
>> +The main thread orchestrates the migration by issuing control flags on
>> +the migration stream (``QEMU_VM_*``).
>> +
>> +The main memory is migrated by ram.c and includes specific control
>> +flags that are also put on the main migration stream
>> +(``RAM_SAVE_FLAG_*``).
>> +
>> +Multifd has its own set of flags (``MULTIFD_FLAG_*``) that are
>> +included into each packet. These may inform about properties such as
>> +the compression algorithm used if the data is compressed.
>
> I think I get your intention, on that we have different levels of flags and
> maybe it's not easy to know which is which. However since this is multifd
> specific doc, from that POV the first two paragraphs may be more suitable
> for some more high level doc to me.
>
This is just to avoid mentioning RAM_SAVE_FLAG_MULTIFD_FLUSH below out
of nowhere. I'll try to merge the relevant part into there.
> Meanwhile, I feel that reading the flag section without a quick packet
> header introduction is a tiny little abrupt to readers, as the flag is part
> of the packet but it came from nowhere yet. One option is we make this
> section "multifd packet header" then introduce all fields quickly including
> the flags. If you like keeping this it's ok too, we can work on top.
>
I did mention the packet and the flags up there. It appears you missed
it, so I need to make it more explicit indeed. =)
>> +
>> +.. _synchronization:
>> +
>> +Synchronization
>> +---------------
>> +
>> +Data sent through multifd may arrive out of order and with different
>> +timing. Some clients may also have synchronization requirements to
>> +ensure data consistency, e.g. the RAM migration must ensure that
>> +memory pages received by the destination machine are ordered in
>> +relation to previous iterations of dirty tracking.
>> +
>> +Some cleanup tasks such as memory deallocation or error handling may
>> +need to happen only after all channels have finished sending/receiving
>> +the data.
>> +
>> +Multifd provides the ``multifd_send_sync_main()`` and
>> +``multifd_recv_sync_main()`` helpers to synchronize the main migration
>> +thread with the multifd channels. In addition, these helpers also
>> +trigger the emission of a sync packet (``MULTIFD_FLAG_SYNC``) which
>> +carries the synchronization command to the remote side of the
>> +migration.
>
> [1]
>
>> +
>> +After the channels have been put into a wait state by the sync
>> +functions, the client code may continue to transmit additional data by
>> +issuing ``multifd_send()`` once again.
>> +
>> +Note:
>> +
>> +- the RAM migration does, effectively, a global synchronization by
>> + chaining a call to ``multifd_send_sync_main()`` with the emission of a
>> + flag on the main migration channel (``RAM_SAVE_FLAG_MULTIFD_FLUSH``)
>
> ... or RAM_SAVE_FLAG_EOS ... depending on the machine type.
>
Eh.. big compatibility mess. I rather not mention it.
> Maybe we should also add a sentence on the relationship of
> MULTIFD_FLAG_SYNC and RAM_SAVE_FLAG_MULTIFD_FLUSH (or RAM_SAVE_FLAG_EOS ),
> in that they should always be sent together, and only if so would it
> provide ordering of multifd messages and what happens in the main migration
> thread.
>
The problem is that RAM_SAVE_FLAGs are a ram.c thing. In theory the need
for RAM_SAVE_FLAG_MULTIFD_FLUSH is just because the RAM migration is
driven by the source machine by the flags that are put on the
stream. IOW, this is a RAM migration design, not a multifd design. The
multifd design is (could be, we decide) that once sync packets are sent,
_something_ must do the following:
for (i = 0; i < thread_count; i++) {
trace_multifd_recv_sync_main_wait(i);
qemu_sem_wait(&multifd_recv_state->sem_sync);
}
... which is already part of multifd_recv_sync_main(), but that just
_happens to be_ called by ram.c when it sees the
RAM_SAVE_FLAG_MULTIFD_FLUSH flag on the stream, that's not a multifd
design requirement. The ram.c code could for instance do the sync when
some QEMU_VM_SECTION_EOS (or whatever it's called) appears.
> Maybe we can attach that sentence at the end of [1].
>
>> + which in turn causes ``multifd_recv_sync_main()`` to be called on the
>> + destination.
>> +
>> + There are also backward compatibility concerns expressed by
>> + ``multifd_ram_sync_per_section()`` and
>> + ``multifd_ram_sync_per_round()``. See the code for detailed
>> + documentation.
>> +
>> +- the `mapped-ram` feature has different requirements because it's an
>> + asynchronous migration (source and destination not migrating at the
>> + same time). For that feature, only the sync between the channels is
>> + relevant to prevent cleanup to happen before data is completely
>> + written to (or read from) the migration file.
>> +
>> +Data transformation
>> +-------------------
>> +
>> +The ``MultiFDMethods`` structure defines callbacks that allow the
>> +client code to perform operations on the data at key points. These
>> +operations could be client-specific (e.g. compression), but also
>> +include a few required steps such as moving data into an iovs. See the
>> +struct's definition for more detailed documentation.
>> +
>> +Historically, the only client for multifd has been the RAM migration,
>> +so the ``MultiFDMethods`` are pre-registered in two categories,
>> +compression and no-compression, with the latter being the regular,
>> +uncompressed ram migration.
>> +
>> +Zero page detection
>> ++++++++++++++++++++
>> +
>> +The migration without compression has a further specificity of
>
> Compressors also have zero page detection. E.g.:
>
> multifd_send_zero_page_detect()
> <- multifd_send_prepare_common()
> <- multifd_zstd_send_prepare()
>
Oops, I forgot. I thinking surely detecting zeros comes along with the
compression algorithm and we don't need to tell it.
>> +possibly doing zero page detection. It involves doing the detection of
>> +a zero page directly in the multifd channels instead of beforehand on
>> +the main migration thread (as it's been done in the past). This is the
>> +default behavior and can be disabled with:
>> +
>> + ``migrate_set_parameter zero-page-detection legacy``
>> +
>> +or to disable zero page detection completely:
>> +
>> + ``migrate_set_parameter zero-page-detection none``
>> +
>> +Error handling
>> +--------------
>> +
>> +Any part of multifd code can be made to exit by setting the
>> +``exiting`` atomic flag of the multifd state. Whenever a multifd
>> +channel has an error, it should break out of its loop, set the flag to
>> +indicate other channels to exit as well and set the migration error
>> +with ``migrate_set_error()``.
>> +
>> +For clean exiting (triggered from outside the channels), the
>> +``multifd_send|recv_terminate_threads()`` functions set the
>> +``exiting`` flag and additionally release any channels that may be
>> +idle or waiting for a sync.
>> +
>> +Code structure
>> +--------------
>> +
>> +Multifd code is divided into:
>> +
>> +The main file containing the core routines
>> +
>> +- multifd.c
>> +
>> +RAM migration
>> +
>> +- multifd-nocomp.c (nocomp, for "no compression")
>> +- multifd-zero-page.c
>> +- ram.c (also involved in non-multifd migrations & snapshots)
>> +
>> +Compressors
>> +
>> +- multifd-uadk.c
>> +- multifd-qatzip.c
>> +- multifd-zlib.c
>> +- multifd-qpl.c
>> +- multifd-zstd.c
>> --
>> 2.35.3
>>
On Fri, Mar 07, 2025 at 04:06:17PM -0300, Fabiano Rosas wrote:
> > I never tried vsock, would it be used in any use case?
> >
>
> I don't know, I'm going by what's in the code.
>
> > It seems to be introduced by accident in 72a8192e225cea, but I'm not sure.
> > Maybe there's something I missed.
>
> The code was always had some variation of:
>
> static bool transport_supports_multi_channels(SocketAddress *saddr)
> {
> return strstart(uri, "tcp:", NULL) || strstart(uri, "unix:", NULL) ||
> strstart(uri, "vsock:", NULL);
> }
>
> Introduced by b7acd65707 ("migration: allow multifd for socket protocol
> only").
Looks like a copy-paste error.. I found it, should be 9ba3b2baa1
("migration: add vsock as data channel support").
https://lore.kernel.org/all/2bc0e226-ee71-330a-1bcd-bd9d097509bc@huawei.com/
https://kvmforum2019.sched.com/event/Tmzh/zero-next-generation-virtualization-platform-for-huawei-cloud-jinsong-liu-zhichao-huang-huawei
https://e.huawei.com/sa/material/event/HC/e37b9c4c33e14e869bb1183fab468fed
So if I read it right.. the VM in this case is inside a container or
something, then it talks to an "agent" on a PCIe device which understands
virtio-vsock protocol. So maybe vsock just performs better than other ways
to dig that tunnel for the container.
In that case, mentioning vsock is at least ok.
[...]
> >> +After the channels have been put into a wait state by the sync
> >> +functions, the client code may continue to transmit additional data by
> >> +issuing ``multifd_send()`` once again.
> >> +
> >> +Note:
> >> +
> >> +- the RAM migration does, effectively, a global synchronization by
> >> + chaining a call to ``multifd_send_sync_main()`` with the emission of a
> >> + flag on the main migration channel (``RAM_SAVE_FLAG_MULTIFD_FLUSH``)
> >
> > ... or RAM_SAVE_FLAG_EOS ... depending on the machine type.
> >
>
> Eh.. big compatibility mess. I rather not mention it.
It's not strictly a compatibility mess. IIUC, it was used to be designed
to always work with EOS. I think at that time Juan was still focused on
making it work and not whole perf tunings, but then we found it can be a
major perf issue if we flush too soon. Then if we flush it once per round,
it may not always pair with a EOS. That's why we needed a new message.
But hey, you're writting a doc that helps everyone. You deserve to decide
whether you like to mention it or not on this one. :)
IIRC we updated our compat rule so we maintain each machine type for only 6
years. It means the whole per-iteration + EOS stuff can be removed in 3.5
years or so - we did that work in July 2022. So it isn't that important
either to mention indeed.
>
> > Maybe we should also add a sentence on the relationship of
> > MULTIFD_FLAG_SYNC and RAM_SAVE_FLAG_MULTIFD_FLUSH (or RAM_SAVE_FLAG_EOS ),
> > in that they should always be sent together, and only if so would it
> > provide ordering of multifd messages and what happens in the main migration
> > thread.
> >
>
> The problem is that RAM_SAVE_FLAGs are a ram.c thing. In theory the need
> for RAM_SAVE_FLAG_MULTIFD_FLUSH is just because the RAM migration is
> driven by the source machine by the flags that are put on the
> stream. IOW, this is a RAM migration design, not a multifd design. The
> multifd design is (could be, we decide) that once sync packets are sent,
> _something_ must do the following:
>
> for (i = 0; i < thread_count; i++) {
> trace_multifd_recv_sync_main_wait(i);
> qemu_sem_wait(&multifd_recv_state->sem_sync);
> }
>
> ... which is already part of multifd_recv_sync_main(), but that just
> _happens to be_ called by ram.c when it sees the
> RAM_SAVE_FLAG_MULTIFD_FLUSH flag on the stream, that's not a multifd
> design requirement. The ram.c code could for instance do the sync when
> some QEMU_VM_SECTION_EOS (or whatever it's called) appears.
I still think it should be done in RAM code only. One major goal (if not
the only goal..) is it wants to order different versions of pages and
that's only what the RAM module is about, not migration in general.
From that POV, having a QEMU_VM_* is kind of the wrong layer - they should
work for migration in general.
Said so, I agree we do violate it from time to time, for example, we have a
bunch of subcmds (MIG_CMD_POSTCOPY*) just for postcopy, which is under
QEMU_VM_COMMAND. But IIUC that was either kind of ancient (so we need to
stick with them now.. postcopy was there for 10 years) or it needs some
ping-pong messages in which case QEMU_VM_COMMAND is the easiest.. IMHO we
should still try to stick with the layering if possible.
--
Peter Xu
Peter Xu <peterx@redhat.com> writes:
> On Fri, Mar 07, 2025 at 04:06:17PM -0300, Fabiano Rosas wrote:
>> > I never tried vsock, would it be used in any use case?
>> >
>>
>> I don't know, I'm going by what's in the code.
>>
>> > It seems to be introduced by accident in 72a8192e225cea, but I'm not sure.
>> > Maybe there's something I missed.
>>
>> The code was always had some variation of:
>>
>> static bool transport_supports_multi_channels(SocketAddress *saddr)
>> {
>> return strstart(uri, "tcp:", NULL) || strstart(uri, "unix:", NULL) ||
>> strstart(uri, "vsock:", NULL);
>> }
>>
>> Introduced by b7acd65707 ("migration: allow multifd for socket protocol
>> only").
>
> Looks like a copy-paste error.. I found it, should be 9ba3b2baa1
> ("migration: add vsock as data channel support").
>
> https://lore.kernel.org/all/2bc0e226-ee71-330a-1bcd-bd9d097509bc@huawei.com/
> https://kvmforum2019.sched.com/event/Tmzh/zero-next-generation-virtualization-platform-for-huawei-cloud-jinsong-liu-zhichao-huang-huawei
> https://e.huawei.com/sa/material/event/HC/e37b9c4c33e14e869bb1183fab468fed
>
Great, thanks for finding those. I'll add it to my little "lore"
folder. This kind of information is always good to know.
> So if I read it right.. the VM in this case is inside a container or
> something, then it talks to an "agent" on a PCIe device which understands
> virtio-vsock protocol. So maybe vsock just performs better than other ways
> to dig that tunnel for the container.
>
> In that case, mentioning vsock is at least ok.
>
> [...]
>
>> >> +After the channels have been put into a wait state by the sync
>> >> +functions, the client code may continue to transmit additional data by
>> >> +issuing ``multifd_send()`` once again.
>> >> +
>> >> +Note:
>> >> +
>> >> +- the RAM migration does, effectively, a global synchronization by
>> >> + chaining a call to ``multifd_send_sync_main()`` with the emission of a
>> >> + flag on the main migration channel (``RAM_SAVE_FLAG_MULTIFD_FLUSH``)
>> >
>> > ... or RAM_SAVE_FLAG_EOS ... depending on the machine type.
>> >
>>
>> Eh.. big compatibility mess. I rather not mention it.
>
> It's not strictly a compatibility mess. IIUC, it was used to be designed
> to always work with EOS. I think at that time Juan was still focused on
> making it work and not whole perf tunings, but then we found it can be a
> major perf issue if we flush too soon. Then if we flush it once per round,
> it may not always pair with a EOS. That's why we needed a new message.
>
Being fully honest, at the time I got the impression the situation was
"random person inside RH decided to measure performance of random thing
and upstream maintainer felt pressure to push a fix".
Whether that was the case or not, it doesn't matter now, but we can't
deny that this _has_ generated some headache, just look at how many
issues arose from the introduction of that flag.
> But hey, you're writting a doc that helps everyone. You deserve to decide
> whether you like to mention it or not on this one. :)
My rant aside, I really want to avoid any readers having to think too
much about this flush thing. We're already seeing some confusion when
discussing it with Prasad in the other thread. The code itself and the
git log are more reliable to explain the compat situation IMO.
>
> IIRC we updated our compat rule so we maintain each machine type for only 6
> years. It means the whole per-iteration + EOS stuff can be removed in 3.5
> years or so - we did that work in July 2022. So it isn't that important
> either to mention indeed.
>
Yep, that as well.
>>
>> > Maybe we should also add a sentence on the relationship of
>> > MULTIFD_FLAG_SYNC and RAM_SAVE_FLAG_MULTIFD_FLUSH (or RAM_SAVE_FLAG_EOS ),
>> > in that they should always be sent together, and only if so would it
>> > provide ordering of multifd messages and what happens in the main migration
>> > thread.
>> >
>>
>> The problem is that RAM_SAVE_FLAGs are a ram.c thing. In theory the need
>> for RAM_SAVE_FLAG_MULTIFD_FLUSH is just because the RAM migration is
>> driven by the source machine by the flags that are put on the
>> stream. IOW, this is a RAM migration design, not a multifd design. The
>> multifd design is (could be, we decide) that once sync packets are sent,
>> _something_ must do the following:
>>
>> for (i = 0; i < thread_count; i++) {
>> trace_multifd_recv_sync_main_wait(i);
>> qemu_sem_wait(&multifd_recv_state->sem_sync);
>> }
>>
>> ... which is already part of multifd_recv_sync_main(), but that just
>> _happens to be_ called by ram.c when it sees the
>> RAM_SAVE_FLAG_MULTIFD_FLUSH flag on the stream, that's not a multifd
>> design requirement. The ram.c code could for instance do the sync when
>> some QEMU_VM_SECTION_EOS (or whatever it's called) appears.
>
> I still think it should be done in RAM code only. One major goal (if not
> the only goal..) is it wants to order different versions of pages and
> that's only what the RAM module is about, not migration in general.
>
> From that POV, having a QEMU_VM_* is kind of the wrong layer - they should
> work for migration in general.
>
> Said so, I agree we do violate it from time to time, for example, we have a
> bunch of subcmds (MIG_CMD_POSTCOPY*) just for postcopy, which is under
> QEMU_VM_COMMAND. But IIUC that was either kind of ancient (so we need to
> stick with them now.. postcopy was there for 10 years) or it needs some
> ping-pong messages in which case QEMU_VM_COMMAND is the easiest.. IMHO we
> should still try to stick with the layering if possible.
All good points, but I was talking of something else:
I was just throwing an example of how it could be done differently to
make the point clear that the recv_sync has nothing to do with the ram
flag, that's just implementation detail. I was thinking specifically
about the multifd+postcopy work where we might need syncs but there is
no RAM_FLAGS there.
We don't actually _need_ to sync with the migration thread on the
destination like that. The client could send control information in it's
opaque packet (instead of in the migration thread) or in a completely
separate channel if it wanted. That sync is also not necessary if there
is no dependency around the data being transferred (i.e. mapped-ram just
takes data form the file and writes to guest memory)
To be clear I'm not suggesting we change anything, I'm only trying to
reflect in the docs some level of separation between multifd.c and ram.c
(or whatever else) because we've seen that mixing the two makes the
design less clean. We've already cleared much of the "p->pages" kind of
issues and the packet definition is more versatile after Maciej's work,
but I still think the separation should be more strict (hence all the
"client" talk in this doc).
(I realise I'm hijacking the documentation thread to talk about high
level design, my apologies).
On Mon, Mar 10, 2025 at 11:24:15AM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
>
> > On Fri, Mar 07, 2025 at 04:06:17PM -0300, Fabiano Rosas wrote:
> >> > I never tried vsock, would it be used in any use case?
> >> >
> >>
> >> I don't know, I'm going by what's in the code.
> >>
> >> > It seems to be introduced by accident in 72a8192e225cea, but I'm not sure.
> >> > Maybe there's something I missed.
> >>
> >> The code was always had some variation of:
> >>
> >> static bool transport_supports_multi_channels(SocketAddress *saddr)
> >> {
> >> return strstart(uri, "tcp:", NULL) || strstart(uri, "unix:", NULL) ||
> >> strstart(uri, "vsock:", NULL);
> >> }
> >>
> >> Introduced by b7acd65707 ("migration: allow multifd for socket protocol
> >> only").
> >
> > Looks like a copy-paste error.. I found it, should be 9ba3b2baa1
> > ("migration: add vsock as data channel support").
> >
> > https://lore.kernel.org/all/2bc0e226-ee71-330a-1bcd-bd9d097509bc@huawei.com/
> > https://kvmforum2019.sched.com/event/Tmzh/zero-next-generation-virtualization-platform-for-huawei-cloud-jinsong-liu-zhichao-huang-huawei
> > https://e.huawei.com/sa/material/event/HC/e37b9c4c33e14e869bb1183fab468fed
> >
>
> Great, thanks for finding those. I'll add it to my little "lore"
> folder. This kind of information is always good to know.
>
> > So if I read it right.. the VM in this case is inside a container or
> > something, then it talks to an "agent" on a PCIe device which understands
> > virtio-vsock protocol. So maybe vsock just performs better than other ways
> > to dig that tunnel for the container.
> >
> > In that case, mentioning vsock is at least ok.
> >
> > [...]
> >
> >> >> +After the channels have been put into a wait state by the sync
> >> >> +functions, the client code may continue to transmit additional data by
> >> >> +issuing ``multifd_send()`` once again.
> >> >> +
> >> >> +Note:
> >> >> +
> >> >> +- the RAM migration does, effectively, a global synchronization by
> >> >> + chaining a call to ``multifd_send_sync_main()`` with the emission of a
> >> >> + flag on the main migration channel (``RAM_SAVE_FLAG_MULTIFD_FLUSH``)
> >> >
> >> > ... or RAM_SAVE_FLAG_EOS ... depending on the machine type.
> >> >
> >>
> >> Eh.. big compatibility mess. I rather not mention it.
> >
> > It's not strictly a compatibility mess. IIUC, it was used to be designed
> > to always work with EOS. I think at that time Juan was still focused on
> > making it work and not whole perf tunings, but then we found it can be a
> > major perf issue if we flush too soon. Then if we flush it once per round,
> > it may not always pair with a EOS. That's why we needed a new message.
> >
>
> Being fully honest, at the time I got the impression the situation was
> "random person inside RH decided to measure performance of random thing
> and upstream maintainer felt pressure to push a fix".
>
> Whether that was the case or not, it doesn't matter now, but we can't
> deny that this _has_ generated some headache, just look at how many
> issues arose from the introduction of that flag.
I might be the "random person" here. I remember I raised this question to
Juan on why we need to flush for each iteration.
I also remember we did perf test, we can redo it. But we can discuss the
design first.
To me, this is a fairly important question to ask. Fundamentally, the very
initial question is why do we need periodic flush and sync at all. It's
because we want to make sure new version of pages to land later than old
versions.
Note that we can achieve that in other ways too. E.g., if we only enqueue a
page to a specific multifd thread (e.g. page_index % n_multifd_threads),
then it'll guarantee the ordering without flush and sync, because new / old
version for the same page will only go via the same channel, which
guarantees ordering of packets in time order naturally. But that at least
has risk of not being able to fully leverage the bandwidth, e.g., worst
case is the guest has dirty pages that are accidentally always hashed to
the same channel; consider a program keeps dirtying every 32K on a 4K psize
system with 8 channels. Or something like that.
Not documenting EOS part is ok too from that pov, because it's confusing
too on why we need to flush per EOS. Per-round is more understandable from
that POV, because we want to make sure new version lands later, and
versioning boost for pages only happen per-round, not per-iteration.
>
> > But hey, you're writting a doc that helps everyone. You deserve to decide
> > whether you like to mention it or not on this one. :)
>
> My rant aside, I really want to avoid any readers having to think too
> much about this flush thing. We're already seeing some confusion when
> discussing it with Prasad in the other thread. The code itself and the
> git log are more reliable to explain the compat situation IMO.
>
> >
> > IIRC we updated our compat rule so we maintain each machine type for only 6
> > years. It means the whole per-iteration + EOS stuff can be removed in 3.5
> > years or so - we did that work in July 2022. So it isn't that important
> > either to mention indeed.
> >
>
> Yep, that as well.
>
> >>
> >> > Maybe we should also add a sentence on the relationship of
> >> > MULTIFD_FLAG_SYNC and RAM_SAVE_FLAG_MULTIFD_FLUSH (or RAM_SAVE_FLAG_EOS ),
> >> > in that they should always be sent together, and only if so would it
> >> > provide ordering of multifd messages and what happens in the main migration
> >> > thread.
> >> >
> >>
> >> The problem is that RAM_SAVE_FLAGs are a ram.c thing. In theory the need
> >> for RAM_SAVE_FLAG_MULTIFD_FLUSH is just because the RAM migration is
> >> driven by the source machine by the flags that are put on the
> >> stream. IOW, this is a RAM migration design, not a multifd design. The
> >> multifd design is (could be, we decide) that once sync packets are sent,
> >> _something_ must do the following:
> >>
> >> for (i = 0; i < thread_count; i++) {
> >> trace_multifd_recv_sync_main_wait(i);
> >> qemu_sem_wait(&multifd_recv_state->sem_sync);
> >> }
> >>
> >> ... which is already part of multifd_recv_sync_main(), but that just
> >> _happens to be_ called by ram.c when it sees the
> >> RAM_SAVE_FLAG_MULTIFD_FLUSH flag on the stream, that's not a multifd
> >> design requirement. The ram.c code could for instance do the sync when
> >> some QEMU_VM_SECTION_EOS (or whatever it's called) appears.
> >
> > I still think it should be done in RAM code only. One major goal (if not
> > the only goal..) is it wants to order different versions of pages and
> > that's only what the RAM module is about, not migration in general.
> >
> > From that POV, having a QEMU_VM_* is kind of the wrong layer - they should
> > work for migration in general.
> >
> > Said so, I agree we do violate it from time to time, for example, we have a
> > bunch of subcmds (MIG_CMD_POSTCOPY*) just for postcopy, which is under
> > QEMU_VM_COMMAND. But IIUC that was either kind of ancient (so we need to
> > stick with them now.. postcopy was there for 10 years) or it needs some
> > ping-pong messages in which case QEMU_VM_COMMAND is the easiest.. IMHO we
> > should still try to stick with the layering if possible.
>
> All good points, but I was talking of something else:
>
> I was just throwing an example of how it could be done differently to
> make the point clear that the recv_sync has nothing to do with the ram
> flag, that's just implementation detail. I was thinking specifically
> about the multifd+postcopy work where we might need syncs but there is
> no RAM_FLAGS there.
>
> We don't actually _need_ to sync with the migration thread on the
> destination like that. The client could send control information in it's
> opaque packet (instead of in the migration thread) or in a completely
> separate channel if it wanted. That sync is also not necessary if there
> is no dependency around the data being transferred (i.e. mapped-ram just
> takes data form the file and writes to guest memory)
Mapped-ram is definitely different.
For sockets, IIUC we do rely on the messages on the multifd channels _and_
the message on the main channel.
So I may not have fully get your points above, but.. See how it more or
less implemented a remote memory barrier kind of thing _with_ the main
channel message:
main channel multifd channel 1 multifd channel 2
------------ ----------------- -----------------
send page P v1
+------------------------------------------------------------------+
| RAM_SAVE_FLAG_MULTIFD_FLUSH |
| MULTIFD_FLAG_SYNC MULTIFD_FLAG_SYNC |
+------------------------------------------------------------------+
send page P v2
Then v1 and v2 of the page P are ordered.
If without the message on the main channel:
main channel multifd channel 1 multifd channel 2
------------ ----------------- -----------------
send page P v1
MULTIFD_FLAG_SYNC
MULTIFD_FLAG_SYNC
send page P v2
Then I don't see what protects reorder of arrival of messages like:
main channel multifd channel 1 multifd channel 2
------------ ----------------- -----------------
MULTIFD_FLAG_SYNC
send page P v2
send page P v1
MULTIFD_FLAG_SYNC
>
> To be clear I'm not suggesting we change anything, I'm only trying to
> reflect in the docs some level of separation between multifd.c and ram.c
> (or whatever else) because we've seen that mixing the two makes the
> design less clean. We've already cleared much of the "p->pages" kind of
> issues and the packet definition is more versatile after Maciej's work,
> but I still think the separation should be more strict (hence all the
> "client" talk in this doc).
>
> (I realise I'm hijacking the documentation thread to talk about high
> level design, my apologies).
>
--
Peter Xu
Peter Xu <peterx@redhat.com> writes:
> On Mon, Mar 10, 2025 at 11:24:15AM -0300, Fabiano Rosas wrote:
>> Peter Xu <peterx@redhat.com> writes:
>>
>> > On Fri, Mar 07, 2025 at 04:06:17PM -0300, Fabiano Rosas wrote:
>> >> > I never tried vsock, would it be used in any use case?
>> >> >
>> >>
>> >> I don't know, I'm going by what's in the code.
>> >>
>> >> > It seems to be introduced by accident in 72a8192e225cea, but I'm not sure.
>> >> > Maybe there's something I missed.
>> >>
>> >> The code was always had some variation of:
>> >>
>> >> static bool transport_supports_multi_channels(SocketAddress *saddr)
>> >> {
>> >> return strstart(uri, "tcp:", NULL) || strstart(uri, "unix:", NULL) ||
>> >> strstart(uri, "vsock:", NULL);
>> >> }
>> >>
>> >> Introduced by b7acd65707 ("migration: allow multifd for socket protocol
>> >> only").
>> >
>> > Looks like a copy-paste error.. I found it, should be 9ba3b2baa1
>> > ("migration: add vsock as data channel support").
>> >
>> > https://lore.kernel.org/all/2bc0e226-ee71-330a-1bcd-bd9d097509bc@huawei.com/
>> > https://kvmforum2019.sched.com/event/Tmzh/zero-next-generation-virtualization-platform-for-huawei-cloud-jinsong-liu-zhichao-huang-huawei
>> > https://e.huawei.com/sa/material/event/HC/e37b9c4c33e14e869bb1183fab468fed
>> >
>>
>> Great, thanks for finding those. I'll add it to my little "lore"
>> folder. This kind of information is always good to know.
>>
>> > So if I read it right.. the VM in this case is inside a container or
>> > something, then it talks to an "agent" on a PCIe device which understands
>> > virtio-vsock protocol. So maybe vsock just performs better than other ways
>> > to dig that tunnel for the container.
>> >
>> > In that case, mentioning vsock is at least ok.
>> >
>> > [...]
>> >
>> >> >> +After the channels have been put into a wait state by the sync
>> >> >> +functions, the client code may continue to transmit additional data by
>> >> >> +issuing ``multifd_send()`` once again.
>> >> >> +
>> >> >> +Note:
>> >> >> +
>> >> >> +- the RAM migration does, effectively, a global synchronization by
>> >> >> + chaining a call to ``multifd_send_sync_main()`` with the emission of a
>> >> >> + flag on the main migration channel (``RAM_SAVE_FLAG_MULTIFD_FLUSH``)
>> >> >
>> >> > ... or RAM_SAVE_FLAG_EOS ... depending on the machine type.
>> >> >
>> >>
>> >> Eh.. big compatibility mess. I rather not mention it.
>> >
>> > It's not strictly a compatibility mess. IIUC, it was used to be designed
>> > to always work with EOS. I think at that time Juan was still focused on
>> > making it work and not whole perf tunings, but then we found it can be a
>> > major perf issue if we flush too soon. Then if we flush it once per round,
>> > it may not always pair with a EOS. That's why we needed a new message.
>> >
>>
>> Being fully honest, at the time I got the impression the situation was
>> "random person inside RH decided to measure performance of random thing
>> and upstream maintainer felt pressure to push a fix".
>>
>> Whether that was the case or not, it doesn't matter now, but we can't
>> deny that this _has_ generated some headache, just look at how many
>> issues arose from the introduction of that flag.
>
> I might be the "random person" here. I remember I raised this question to
> Juan on why we need to flush for each iteration.
>
It's a good question indeed. It was the right call to address it, I just
wish we had come up with a more straight-forward solution to it.
> I also remember we did perf test, we can redo it. But we can discuss the
> design first.
>
> To me, this is a fairly important question to ask. Fundamentally, the very
> initial question is why do we need periodic flush and sync at all. It's
> because we want to make sure new version of pages to land later than old
> versions.
>
> Note that we can achieve that in other ways too. E.g., if we only enqueue a
> page to a specific multifd thread (e.g. page_index % n_multifd_threads),
> then it'll guarantee the ordering without flush and sync, because new / old
> version for the same page will only go via the same channel, which
> guarantees ordering of packets in time order naturally.
Right.
> But that at least has risk of not being able to fully leverage the
> bandwidth, e.g., worst case is the guest has dirty pages that are
> accidentally always hashed to the same channel; consider a program
> keeps dirtying every 32K on a 4K psize system with 8 channels. Or
> something like that.
>
> Not documenting EOS part is ok too from that pov, because it's confusing
> too on why we need to flush per EOS. Per-round is more understandable from
> that POV, because we want to make sure new version lands later, and
> versioning boost for pages only happen per-round, not per-iteration.
>
>>
>> > But hey, you're writting a doc that helps everyone. You deserve to decide
>> > whether you like to mention it or not on this one. :)
>>
>> My rant aside, I really want to avoid any readers having to think too
>> much about this flush thing. We're already seeing some confusion when
>> discussing it with Prasad in the other thread. The code itself and the
>> git log are more reliable to explain the compat situation IMO.
>>
>> >
>> > IIRC we updated our compat rule so we maintain each machine type for only 6
>> > years. It means the whole per-iteration + EOS stuff can be removed in 3.5
>> > years or so - we did that work in July 2022. So it isn't that important
>> > either to mention indeed.
>> >
>>
>> Yep, that as well.
>>
>> >>
>> >> > Maybe we should also add a sentence on the relationship of
>> >> > MULTIFD_FLAG_SYNC and RAM_SAVE_FLAG_MULTIFD_FLUSH (or RAM_SAVE_FLAG_EOS ),
>> >> > in that they should always be sent together, and only if so would it
>> >> > provide ordering of multifd messages and what happens in the main migration
>> >> > thread.
>> >> >
>> >>
>> >> The problem is that RAM_SAVE_FLAGs are a ram.c thing. In theory the need
>> >> for RAM_SAVE_FLAG_MULTIFD_FLUSH is just because the RAM migration is
>> >> driven by the source machine by the flags that are put on the
>> >> stream. IOW, this is a RAM migration design, not a multifd design. The
>> >> multifd design is (could be, we decide) that once sync packets are sent,
>> >> _something_ must do the following:
>> >>
>> >> for (i = 0; i < thread_count; i++) {
>> >> trace_multifd_recv_sync_main_wait(i);
>> >> qemu_sem_wait(&multifd_recv_state->sem_sync);
>> >> }
>> >>
>> >> ... which is already part of multifd_recv_sync_main(), but that just
>> >> _happens to be_ called by ram.c when it sees the
>> >> RAM_SAVE_FLAG_MULTIFD_FLUSH flag on the stream, that's not a multifd
>> >> design requirement. The ram.c code could for instance do the sync when
>> >> some QEMU_VM_SECTION_EOS (or whatever it's called) appears.
>> >
>> > I still think it should be done in RAM code only. One major goal (if not
>> > the only goal..) is it wants to order different versions of pages and
>> > that's only what the RAM module is about, not migration in general.
>> >
>> > From that POV, having a QEMU_VM_* is kind of the wrong layer - they should
>> > work for migration in general.
>> >
>> > Said so, I agree we do violate it from time to time, for example, we have a
>> > bunch of subcmds (MIG_CMD_POSTCOPY*) just for postcopy, which is under
>> > QEMU_VM_COMMAND. But IIUC that was either kind of ancient (so we need to
>> > stick with them now.. postcopy was there for 10 years) or it needs some
>> > ping-pong messages in which case QEMU_VM_COMMAND is the easiest.. IMHO we
>> > should still try to stick with the layering if possible.
>>
>> All good points, but I was talking of something else:
>>
>> I was just throwing an example of how it could be done differently to
>> make the point clear that the recv_sync has nothing to do with the ram
>> flag, that's just implementation detail. I was thinking specifically
>> about the multifd+postcopy work where we might need syncs but there is
>> no RAM_FLAGS there.
>>
>> We don't actually _need_ to sync with the migration thread on the
>> destination like that. The client could send control information in it's
>> opaque packet (instead of in the migration thread) or in a completely
>> separate channel if it wanted. That sync is also not necessary if there
>> is no dependency around the data being transferred (i.e. mapped-ram just
>> takes data form the file and writes to guest memory)
>
> Mapped-ram is definitely different.
>
> For sockets, IIUC we do rely on the messages on the multifd channels _and_
> the message on the main channel.
>
> So I may not have fully get your points above, but..
My point is just a theoretical one. We _could_ make this work with a
different mechanism. And that's why I'm being careful in what to
document, that's all.
> See how it more or
> less implemented a remote memory barrier kind of thing _with_ the main
> channel message:
>
> main channel multifd channel 1 multifd channel 2
> ------------ ----------------- -----------------
> send page P v1
> +------------------------------------------------------------------+
> | RAM_SAVE_FLAG_MULTIFD_FLUSH |
> | MULTIFD_FLAG_SYNC MULTIFD_FLAG_SYNC |
> +------------------------------------------------------------------+
> send page P v2
>
This is a nice way of diagramming that!
> Then v1 and v2 of the page P are ordered.
>
> If without the message on the main channel:
>
> main channel multifd channel 1 multifd channel 2
> ------------ ----------------- -----------------
> send page P v1
> MULTIFD_FLAG_SYNC
> MULTIFD_FLAG_SYNC
> send page P v2
>
> Then I don't see what protects reorder of arrival of messages like:
>
> main channel multifd channel 1 multifd channel 2
> ------------ ----------------- -----------------
> MULTIFD_FLAG_SYNC
> send page P v2
> send page P v1
> MULTIFD_FLAG_SYNC
>
That's all fine. As long as the recv part doesn't see them out of
order. I'll try to write some code to confirm so I don't waste too much
of your time.
On Tue, 11 Mar 2025 at 00:59, Fabiano Rosas <farosas@suse.de> wrote: > Peter Xu <peterx@redhat.com> writes: > > To me, this is a fairly important question to ask. Fundamentally, the very > > initial question is why do we need periodic flush and sync at all. It's > > because we want to make sure new version of pages to land later than old > > versions. ... > > Then v1 and v2 of the page P are ordered. > > If without the message on the main channel: > > Then I don't see what protects reorder of arrival of messages like: ... > That's all fine. As long as the recv part doesn't see them out of > order. I'll try to write some code to confirm so I don't waste too much > of your time. * Relying on this receive order seems like a passive solution. On one side we are saying there is no defined 'requirement' on the network or compute capacity/quality for migration. ie. compute and network can be as bad as possible, yet migration shall always work reliably. * When receiving different versions of pages, couldn't multifd_recv check the latest version present in guest RAM and accept the incoming version only if it is fresher than the already present one? ie. if v1 arrives later than v2 on the receive side, the receive side could/should discard v1 because v2 is already received. Thank you. --- - Prasad
Prasad Pandit <ppandit@redhat.com> writes: > On Tue, 11 Mar 2025 at 00:59, Fabiano Rosas <farosas@suse.de> wrote: >> Peter Xu <peterx@redhat.com> writes: >> > To me, this is a fairly important question to ask. Fundamentally, the very >> > initial question is why do we need periodic flush and sync at all. It's >> > because we want to make sure new version of pages to land later than old >> > versions. > ... >> > Then v1 and v2 of the page P are ordered. >> > If without the message on the main channel: >> > Then I don't see what protects reorder of arrival of messages like: > ... >> That's all fine. As long as the recv part doesn't see them out of >> order. I'll try to write some code to confirm so I don't waste too much >> of your time. > > * Relying on this receive order seems like a passive solution. On one > side we are saying there is no defined 'requirement' on the network or > compute capacity/quality for migration. ie. compute and network can be > as bad as possible, yet migration shall always work reliably. > > * When receiving different versions of pages, couldn't multifd_recv > check the latest version present in guest RAM and accept the incoming > version only if it is fresher than the already present one? ie. if v1 > arrives later than v2 on the receive side, the receive side > could/should discard v1 because v2 is already received. > "in guest RAM" I don't think so, the performance would probably be affected. We could have a sequence number that gets bumped per iteration, but I'm not sure how much of a improvement that would be. Without a sync, we'd need some sort of per-page handling*. I have a gut feeling this would get costly. *- maybe per-iovec depending on how we queue pages to multifd. > Thank you. > --- > - Prasad
© 2016 - 2025 Red Hat, Inc.