[PATCH 04/17] migration/multifd: Set p->running = true in the right place

Avihai Horon posted 17 patches 10 months ago
Maintainers: Peter Xu <peterx@redhat.com>, Fabiano Rosas <farosas@suse.de>
[PATCH 04/17] migration/multifd: Set p->running = true in the right place
Posted by Avihai Horon 10 months ago
The commit in the fixes line moved multifd thread creation to a
different location, but forgot to move the p->running = true assignment
as well. Thus, p->running is set to true before multifd thread is
actually created.

p->running is used in multifd_save_cleanup() to decide whether to join
the multifd thread or not.

With TLS, an error in multifd_tls_channel_connect() can lead to a
segmentation fault because p->running is true but p->thread is never
initialized, so multifd_save_cleanup() tries to join an uninitialized
thread.

Fix it by moving p->running = true assignment right after multifd thread
creation. Also move qio_channel_set_delay() to there, as this is where
it used to be originally.

Fixes: 29647140157a ("migration/tls: add support for multifd tls-handshake")
Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 migration/multifd.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index 25cbc6dc6b..564e911b6c 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -850,11 +850,13 @@ static bool multifd_channel_connect(MultiFDSendParams *p,
         return multifd_tls_channel_connect(p, ioc, errp);
     }
 
+    qio_channel_set_delay(ioc, false);
     migration_ioc_register_yank(ioc);
     p->registered_yank = true;
     p->c = ioc;
     qemu_thread_create(&p->thread, p->name, multifd_send_thread, p,
                        QEMU_THREAD_JOINABLE);
+    p->running = true;
     return true;
 }
 
@@ -883,8 +885,6 @@ static void multifd_new_send_channel_async(QIOTask *task, gpointer opaque)
 
     trace_multifd_new_send_channel_async(p->id);
     if (!qio_task_propagate_error(task, &local_err)) {
-        qio_channel_set_delay(ioc, false);
-        p->running = true;
         if (multifd_channel_connect(p, ioc, &local_err)) {
             return;
         }
-- 
2.26.3
Re: [PATCH 04/17] migration/multifd: Set p->running = true in the right place
Posted by Fabiano Rosas 10 months ago
Avihai Horon <avihaih@nvidia.com> writes:

> The commit in the fixes line moved multifd thread creation to a
> different location, but forgot to move the p->running = true assignment
> as well. Thus, p->running is set to true before multifd thread is
> actually created.
>
> p->running is used in multifd_save_cleanup() to decide whether to join
> the multifd thread or not.
>
> With TLS, an error in multifd_tls_channel_connect() can lead to a
> segmentation fault because p->running is true but p->thread is never
> initialized, so multifd_save_cleanup() tries to join an uninitialized
> thread.
>
> Fix it by moving p->running = true assignment right after multifd thread
> creation. Also move qio_channel_set_delay() to there, as this is where
> it used to be originally.
>
> Fixes: 29647140157a ("migration/tls: add support for multifd tls-handshake")
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>

Just for context, I haven't looked at this patch yet, but we were
planning to remove p->running altogether:

https://lore.kernel.org/r/20231110200241.20679-1-farosas@suse.de
Re: [PATCH 04/17] migration/multifd: Set p->running = true in the right place
Posted by Avihai Horon 10 months ago
On 25/01/2024 22:57, Fabiano Rosas wrote:
> External email: Use caution opening links or attachments
>
>
> Avihai Horon <avihaih@nvidia.com> writes:
>
>> The commit in the fixes line moved multifd thread creation to a
>> different location, but forgot to move the p->running = true assignment
>> as well. Thus, p->running is set to true before multifd thread is
>> actually created.
>>
>> p->running is used in multifd_save_cleanup() to decide whether to join
>> the multifd thread or not.
>>
>> With TLS, an error in multifd_tls_channel_connect() can lead to a
>> segmentation fault because p->running is true but p->thread is never
>> initialized, so multifd_save_cleanup() tries to join an uninitialized
>> thread.
>>
>> Fix it by moving p->running = true assignment right after multifd thread
>> creation. Also move qio_channel_set_delay() to there, as this is where
>> it used to be originally.
>>
>> Fixes: 29647140157a ("migration/tls: add support for multifd tls-handshake")
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> Just for context, I haven't looked at this patch yet, but we were
> planning to remove p->running altogether:
>
> https://lore.kernel.org/r/20231110200241.20679-1-farosas@suse.de

Thanks for putting me in the picture.
I see that there has been a discussion about the multifd 
creation/treadown flow.
In light of this discussion, I can already see a few problems in my 
series that I didn't notice before (such as the TLS handshake thread leak).
The thread you mentioned here and some of my patches point out some 
problems in multifd creation/treardown. I guess we can discuss it and 
see what's the best way to solve them.

Regarding this patch, your solution indeed solves the bug that this 
patch addresses, so maybe this could be dropped (or only noted in your 
patch).

Maybe I should also put you (and Peter) in context for this whole series 
-- I am writing it as preparation for adding a separate migration 
channel for VFIO device migration, so VFIO devices could be migrated in 
parallel.
So this series tries to lay down some foundations to facilitate it.
Re: [PATCH 04/17] migration/multifd: Set p->running = true in the right place
Posted by Peter Xu 10 months ago
On Sun, Jan 28, 2024 at 05:43:52PM +0200, Avihai Horon wrote:
> 
> On 25/01/2024 22:57, Fabiano Rosas wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > Avihai Horon <avihaih@nvidia.com> writes:
> > 
> > > The commit in the fixes line moved multifd thread creation to a
> > > different location, but forgot to move the p->running = true assignment
> > > as well. Thus, p->running is set to true before multifd thread is
> > > actually created.
> > > 
> > > p->running is used in multifd_save_cleanup() to decide whether to join
> > > the multifd thread or not.
> > > 
> > > With TLS, an error in multifd_tls_channel_connect() can lead to a
> > > segmentation fault because p->running is true but p->thread is never
> > > initialized, so multifd_save_cleanup() tries to join an uninitialized
> > > thread.
> > > 
> > > Fix it by moving p->running = true assignment right after multifd thread
> > > creation. Also move qio_channel_set_delay() to there, as this is where
> > > it used to be originally.
> > > 
> > > Fixes: 29647140157a ("migration/tls: add support for multifd tls-handshake")
> > > Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> > Just for context, I haven't looked at this patch yet, but we were
> > planning to remove p->running altogether:
> > 
> > https://lore.kernel.org/r/20231110200241.20679-1-farosas@suse.de
> 
> Thanks for putting me in the picture.
> I see that there has been a discussion about the multifd creation/treadown
> flow.
> In light of this discussion, I can already see a few problems in my series
> that I didn't notice before (such as the TLS handshake thread leak).
> The thread you mentioned here and some of my patches point out some problems
> in multifd creation/treardown. I guess we can discuss it and see what's the
> best way to solve them.
> 
> Regarding this patch, your solution indeed solves the bug that this patch
> addresses, so maybe this could be dropped (or only noted in your patch).
> 
> Maybe I should also put you (and Peter) in context for this whole series --
> I am writing it as preparation for adding a separate migration channel for
> VFIO device migration, so VFIO devices could be migrated in parallel.
> So this series tries to lay down some foundations to facilitate it.

Avihai, is the throughput the only reason that VFIO would like to have a
separate channel?

I'm wondering if we can also use multifd threads to send vfio data at some
point.  Now multifd indeed is closely bound to ram pages but maybe it'll
change in the near future to take any load?

Multifd is for solving the throughput issue already. If vfio has the same
goal, IMHO it'll be good to keep them using the same thread model, instead
of managing different threads in different places.  With that, any user
setting (for example, multifd-n-threads) will naturally apply to all
components, rather than relying on yet-another vfio-migration-threads-num
parameter.

-- 
Peter Xu
Re: [PATCH 04/17] migration/multifd: Set p->running = true in the right place
Posted by Fabiano Rosas 10 months ago
Peter Xu <peterx@redhat.com> writes:

> On Sun, Jan 28, 2024 at 05:43:52PM +0200, Avihai Horon wrote:
>> 
>> On 25/01/2024 22:57, Fabiano Rosas wrote:
>> > External email: Use caution opening links or attachments
>> > 
>> > 
>> > Avihai Horon <avihaih@nvidia.com> writes:
>> > 
>> > > The commit in the fixes line moved multifd thread creation to a
>> > > different location, but forgot to move the p->running = true assignment
>> > > as well. Thus, p->running is set to true before multifd thread is
>> > > actually created.
>> > > 
>> > > p->running is used in multifd_save_cleanup() to decide whether to join
>> > > the multifd thread or not.
>> > > 
>> > > With TLS, an error in multifd_tls_channel_connect() can lead to a
>> > > segmentation fault because p->running is true but p->thread is never
>> > > initialized, so multifd_save_cleanup() tries to join an uninitialized
>> > > thread.
>> > > 
>> > > Fix it by moving p->running = true assignment right after multifd thread
>> > > creation. Also move qio_channel_set_delay() to there, as this is where
>> > > it used to be originally.
>> > > 
>> > > Fixes: 29647140157a ("migration/tls: add support for multifd tls-handshake")
>> > > Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> > Just for context, I haven't looked at this patch yet, but we were
>> > planning to remove p->running altogether:
>> > 
>> > https://lore.kernel.org/r/20231110200241.20679-1-farosas@suse.de
>> 
>> Thanks for putting me in the picture.
>> I see that there has been a discussion about the multifd creation/treadown
>> flow.
>> In light of this discussion, I can already see a few problems in my series
>> that I didn't notice before (such as the TLS handshake thread leak).
>> The thread you mentioned here and some of my patches point out some problems
>> in multifd creation/treardown. I guess we can discuss it and see what's the
>> best way to solve them.
>> 
>> Regarding this patch, your solution indeed solves the bug that this patch
>> addresses, so maybe this could be dropped (or only noted in your patch).
>> 
>> Maybe I should also put you (and Peter) in context for this whole series --
>> I am writing it as preparation for adding a separate migration channel for
>> VFIO device migration, so VFIO devices could be migrated in parallel.
>> So this series tries to lay down some foundations to facilitate it.
>
> Avihai, is the throughput the only reason that VFIO would like to have a
> separate channel?
>
> I'm wondering if we can also use multifd threads to send vfio data at some
> point.  Now multifd indeed is closely bound to ram pages but maybe it'll
> change in the near future to take any load?

We're not far away from it IMO. We just need to gradually kick the pages
concept out of multifd.

Again, for context, I have played with this recently:

https://gitlab.com/farosas/qemu/-/commits/multifd-packet-cleanups?ref_type=heads

I'd be happy with any solution that turns that p->pages into something
opaque that multifd has no knowledge about.
Re: [PATCH 04/17] migration/multifd: Set p->running = true in the right place
Posted by Avihai Horon 10 months ago
On 29/01/2024 6:17, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> On Sun, Jan 28, 2024 at 05:43:52PM +0200, Avihai Horon wrote:
>> On 25/01/2024 22:57, Fabiano Rosas wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> Avihai Horon <avihaih@nvidia.com> writes:
>>>
>>>> The commit in the fixes line moved multifd thread creation to a
>>>> different location, but forgot to move the p->running = true assignment
>>>> as well. Thus, p->running is set to true before multifd thread is
>>>> actually created.
>>>>
>>>> p->running is used in multifd_save_cleanup() to decide whether to join
>>>> the multifd thread or not.
>>>>
>>>> With TLS, an error in multifd_tls_channel_connect() can lead to a
>>>> segmentation fault because p->running is true but p->thread is never
>>>> initialized, so multifd_save_cleanup() tries to join an uninitialized
>>>> thread.
>>>>
>>>> Fix it by moving p->running = true assignment right after multifd thread
>>>> creation. Also move qio_channel_set_delay() to there, as this is where
>>>> it used to be originally.
>>>>
>>>> Fixes: 29647140157a ("migration/tls: add support for multifd tls-handshake")
>>>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>>> Just for context, I haven't looked at this patch yet, but we were
>>> planning to remove p->running altogether:
>>>
>>> https://lore.kernel.org/r/20231110200241.20679-1-farosas@suse.de
>> Thanks for putting me in the picture.
>> I see that there has been a discussion about the multifd creation/treadown
>> flow.
>> In light of this discussion, I can already see a few problems in my series
>> that I didn't notice before (such as the TLS handshake thread leak).
>> The thread you mentioned here and some of my patches point out some problems
>> in multifd creation/treardown. I guess we can discuss it and see what's the
>> best way to solve them.
>>
>> Regarding this patch, your solution indeed solves the bug that this patch
>> addresses, so maybe this could be dropped (or only noted in your patch).
>>
>> Maybe I should also put you (and Peter) in context for this whole series --
>> I am writing it as preparation for adding a separate migration channel for
>> VFIO device migration, so VFIO devices could be migrated in parallel.
>> So this series tries to lay down some foundations to facilitate it.
> Avihai, is the throughput the only reason that VFIO would like to have a
> separate channel?

Actually, the main reason is to be able to send and load multiple VFIO 
devices data in parallel.
For example, today if we have three VFIO devices, they are migrated 
sequentially one after another.
This particularly hurts during the complete pre-copy phase (downtime), 
as loading the VFIO data in destination involves FW interaction and 
resource allocation, which takes time and simply blocks the other 
devices from sending and loading their data.
Providing a separate channel and thread for each VIFO device solves this 
problem and ideally reduces the VFIO contribution to downtime from 
sum{VFIO device #1, ..., VFIO device #N} to max{VFIO device #1, ..., 
VFIO device #N}.

>
> I'm wondering if we can also use multifd threads to send vfio data at some
> point.  Now multifd indeed is closely bound to ram pages but maybe it'll
> change in the near future to take any load?
>
> Multifd is for solving the throughput issue already. If vfio has the same
> goal, IMHO it'll be good to keep them using the same thread model, instead
> of managing different threads in different places.  With that, any user
> setting (for example, multifd-n-threads) will naturally apply to all
> components, rather than relying on yet-another vfio-migration-threads-num
> parameter.

Frankly, I didn't really put much attention to the throughput factor, 
and my plan is to introduce only a single thread per device.
VFIO devices may have many GBs of data to migrate (e.g., vGPUs) and even 
mlx5 VFs can have a few GBs of data.
So what you are saying here is interesting, although I didn't test such 
scenario to see the actual benefit.

I am trying to think if/how this could work and I have a few concerns:
1. RAM is made of fixed-positioned pages that can be randomly 
read/written, so sending these pages over multiple channels and loading 
them in the destination can work pretty naturally without much overhead.
    VFIO device data, on the other hand, is just an opaque stream of 
bytes from QEMU point of view. This means that if we break this data to 
"packets" and send them over multiple channels, we must preserve the 
order by which this data was
    originally read from the device and write the data in the same order 
to the destination device.
    I am wondering if the overhead of maintaining such order may hurt 
performance, making it not worthwhile.

2. As I mentioned above, the main motivation for separate VFIO migration 
channel/thread, as I see today, is to allow parallel migration of VFIO 
devices.
    AFAIU multifd, as it is today, doesn't provide such parallelism 
(i.e., in the complete pre-copy phase each device, be it RAM or VFIO, 
will fully send its data over the multifd threads and only after 
finishing will the next device send its data).

This is just what came to mind. Maybe we can think this more thoroughly 
and see if it could work somehow, now or in the future.
However, I think making the multifd threads generic so they can send any 
kind of data is a good thing in general, regardless of VFIO.


Re: [PATCH 04/17] migration/multifd: Set p->running = true in the right place
Posted by Peter Xu 10 months ago
On Mon, Jan 29, 2024 at 02:20:35PM +0200, Avihai Horon wrote:
> 
> On 29/01/2024 6:17, Peter Xu wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > On Sun, Jan 28, 2024 at 05:43:52PM +0200, Avihai Horon wrote:
> > > On 25/01/2024 22:57, Fabiano Rosas wrote:
> > > > External email: Use caution opening links or attachments
> > > > 
> > > > 
> > > > Avihai Horon <avihaih@nvidia.com> writes:
> > > > 
> > > > > The commit in the fixes line moved multifd thread creation to a
> > > > > different location, but forgot to move the p->running = true assignment
> > > > > as well. Thus, p->running is set to true before multifd thread is
> > > > > actually created.
> > > > > 
> > > > > p->running is used in multifd_save_cleanup() to decide whether to join
> > > > > the multifd thread or not.
> > > > > 
> > > > > With TLS, an error in multifd_tls_channel_connect() can lead to a
> > > > > segmentation fault because p->running is true but p->thread is never
> > > > > initialized, so multifd_save_cleanup() tries to join an uninitialized
> > > > > thread.
> > > > > 
> > > > > Fix it by moving p->running = true assignment right after multifd thread
> > > > > creation. Also move qio_channel_set_delay() to there, as this is where
> > > > > it used to be originally.
> > > > > 
> > > > > Fixes: 29647140157a ("migration/tls: add support for multifd tls-handshake")
> > > > > Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> > > > Just for context, I haven't looked at this patch yet, but we were
> > > > planning to remove p->running altogether:
> > > > 
> > > > https://lore.kernel.org/r/20231110200241.20679-1-farosas@suse.de
> > > Thanks for putting me in the picture.
> > > I see that there has been a discussion about the multifd creation/treadown
> > > flow.
> > > In light of this discussion, I can already see a few problems in my series
> > > that I didn't notice before (such as the TLS handshake thread leak).
> > > The thread you mentioned here and some of my patches point out some problems
> > > in multifd creation/treardown. I guess we can discuss it and see what's the
> > > best way to solve them.
> > > 
> > > Regarding this patch, your solution indeed solves the bug that this patch
> > > addresses, so maybe this could be dropped (or only noted in your patch).
> > > 
> > > Maybe I should also put you (and Peter) in context for this whole series --
> > > I am writing it as preparation for adding a separate migration channel for
> > > VFIO device migration, so VFIO devices could be migrated in parallel.
> > > So this series tries to lay down some foundations to facilitate it.
> > Avihai, is the throughput the only reason that VFIO would like to have a
> > separate channel?
> 
> Actually, the main reason is to be able to send and load multiple VFIO
> devices data in parallel.
> For example, today if we have three VFIO devices, they are migrated
> sequentially one after another.
> This particularly hurts during the complete pre-copy phase (downtime), as
> loading the VFIO data in destination involves FW interaction and resource
> allocation, which takes time and simply blocks the other devices from
> sending and loading their data.
> Providing a separate channel and thread for each VIFO device solves this
> problem and ideally reduces the VFIO contribution to downtime from sum{VFIO
> device #1, ..., VFIO device #N} to max{VFIO device #1, ..., VFIO device #N}.

I see.

> 
> > 
> > I'm wondering if we can also use multifd threads to send vfio data at some
> > point.  Now multifd indeed is closely bound to ram pages but maybe it'll
> > change in the near future to take any load?
> > 
> > Multifd is for solving the throughput issue already. If vfio has the same
> > goal, IMHO it'll be good to keep them using the same thread model, instead
> > of managing different threads in different places.  With that, any user
> > setting (for example, multifd-n-threads) will naturally apply to all
> > components, rather than relying on yet-another vfio-migration-threads-num
> > parameter.
> 
> Frankly, I didn't really put much attention to the throughput factor, and my
> plan is to introduce only a single thread per device.
> VFIO devices may have many GBs of data to migrate (e.g., vGPUs) and even
> mlx5 VFs can have a few GBs of data.
> So what you are saying here is interesting, although I didn't test such
> scenario to see the actual benefit.
> 
> I am trying to think if/how this could work and I have a few concerns:
> 1. RAM is made of fixed-positioned pages that can be randomly read/written,
> so sending these pages over multiple channels and loading them in the
> destination can work pretty naturally without much overhead.
>    VFIO device data, on the other hand, is just an opaque stream of bytes
> from QEMU point of view. This means that if we break this data to "packets"
> and send them over multiple channels, we must preserve the order by which
> this data was
>    originally read from the device and write the data in the same order to
> the destination device.
>    I am wondering if the overhead of maintaining such order may hurt
> performance, making it not worthwhile.

Indeed, it seems to me VFIO migration is based on a streaming model where
there's no easy way to index a chunk of data.

Is there any background on how that kernel interface was designed?  It
seems pretty unfriendly to concurrency already: even if multiple devices
can migrate concurrently, each single device can already contain GBs of
data as you said, which is pretty common here.  I'm a bit surprised to see
that the kernel interface is designed in this way for such a device.

I was wondering the possibility of whether VFIO can provide data chunks
with indexes just like RAM (which is represented in ramblock offsets).

> 
> 2. As I mentioned above, the main motivation for separate VFIO migration
> channel/thread, as I see today, is to allow parallel migration of VFIO
> devices.
>    AFAIU multifd, as it is today, doesn't provide such parallelism (i.e., in
> the complete pre-copy phase each device, be it RAM or VFIO, will fully send
> its data over the multifd threads and only after finishing will the next
> device send its data).

Indeed. That's actually an issue not only to VFIO but also to migration in
general: we can't migrate device states concurrently, and multifd is out of
the picture here so far, but maybe there's chance.

Consider huge VMs where there can be already ~500 vCPUs, each need their
own get()/put() of CPU states from/to KVM.  It'll be nice if we can do this
in concurrent threads too.  Here VFIO is one of the devices that will also
benefit from such a design, and greatly.

I added a todo in our wiki for this, which I see it a general improvement,
and hopefully someone can look into this:

https://wiki.qemu.org/ToDo/LiveMigration#Device_state_concurrency

I hope VFIO can consider resolving this as a generic issue, rather than
providing its own solution.

> 
> This is just what came to mind. Maybe we can think this more thoroughly and
> see if it could work somehow, now or in the future.
> However, I think making the multifd threads generic so they can send any
> kind of data is a good thing in general, regardless of VFIO.

Right.

In general, having a VFIO separate channel may solve the immediate issue,
but it may still won't solve all, meanwhile it may introduce the first
example of using completely separate channel that migration won't easily
manage itself, which IMHO can cause migration much harder to maintain in
the future.

It may also in the future become some technical debt that VFIO will need to
take even if such a solution merged, because VFIO could have its own model
of handling a few similar problems that migration has.

I hope there's some way out that we can work together to improve the
framework, providing a clean approach and consider for the long terms.

Thanks,

-- 
Peter Xu


Re: [PATCH 04/17] migration/multifd: Set p->running = true in the right place
Posted by Avihai Horon 10 months ago
On 30/01/2024 7:57, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> On Mon, Jan 29, 2024 at 02:20:35PM +0200, Avihai Horon wrote:
>> On 29/01/2024 6:17, Peter Xu wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> On Sun, Jan 28, 2024 at 05:43:52PM +0200, Avihai Horon wrote:
>>>> On 25/01/2024 22:57, Fabiano Rosas wrote:
>>>>> External email: Use caution opening links or attachments
>>>>>
>>>>>
>>>>> Avihai Horon <avihaih@nvidia.com> writes:
>>>>>
>>>>>> The commit in the fixes line moved multifd thread creation to a
>>>>>> different location, but forgot to move the p->running = true assignment
>>>>>> as well. Thus, p->running is set to true before multifd thread is
>>>>>> actually created.
>>>>>>
>>>>>> p->running is used in multifd_save_cleanup() to decide whether to join
>>>>>> the multifd thread or not.
>>>>>>
>>>>>> With TLS, an error in multifd_tls_channel_connect() can lead to a
>>>>>> segmentation fault because p->running is true but p->thread is never
>>>>>> initialized, so multifd_save_cleanup() tries to join an uninitialized
>>>>>> thread.
>>>>>>
>>>>>> Fix it by moving p->running = true assignment right after multifd thread
>>>>>> creation. Also move qio_channel_set_delay() to there, as this is where
>>>>>> it used to be originally.
>>>>>>
>>>>>> Fixes: 29647140157a ("migration/tls: add support for multifd tls-handshake")
>>>>>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>>>>> Just for context, I haven't looked at this patch yet, but we were
>>>>> planning to remove p->running altogether:
>>>>>
>>>>> https://lore.kernel.org/r/20231110200241.20679-1-farosas@suse.de
>>>> Thanks for putting me in the picture.
>>>> I see that there has been a discussion about the multifd creation/treadown
>>>> flow.
>>>> In light of this discussion, I can already see a few problems in my series
>>>> that I didn't notice before (such as the TLS handshake thread leak).
>>>> The thread you mentioned here and some of my patches point out some problems
>>>> in multifd creation/treardown. I guess we can discuss it and see what's the
>>>> best way to solve them.
>>>>
>>>> Regarding this patch, your solution indeed solves the bug that this patch
>>>> addresses, so maybe this could be dropped (or only noted in your patch).
>>>>
>>>> Maybe I should also put you (and Peter) in context for this whole series --
>>>> I am writing it as preparation for adding a separate migration channel for
>>>> VFIO device migration, so VFIO devices could be migrated in parallel.
>>>> So this series tries to lay down some foundations to facilitate it.
>>> Avihai, is the throughput the only reason that VFIO would like to have a
>>> separate channel?
>> Actually, the main reason is to be able to send and load multiple VFIO
>> devices data in parallel.
>> For example, today if we have three VFIO devices, they are migrated
>> sequentially one after another.
>> This particularly hurts during the complete pre-copy phase (downtime), as
>> loading the VFIO data in destination involves FW interaction and resource
>> allocation, which takes time and simply blocks the other devices from
>> sending and loading their data.
>> Providing a separate channel and thread for each VIFO device solves this
>> problem and ideally reduces the VFIO contribution to downtime from sum{VFIO
>> device #1, ..., VFIO device #N} to max{VFIO device #1, ..., VFIO device #N}.
> I see.
>
>>> I'm wondering if we can also use multifd threads to send vfio data at some
>>> point.  Now multifd indeed is closely bound to ram pages but maybe it'll
>>> change in the near future to take any load?
>>>
>>> Multifd is for solving the throughput issue already. If vfio has the same
>>> goal, IMHO it'll be good to keep them using the same thread model, instead
>>> of managing different threads in different places.  With that, any user
>>> setting (for example, multifd-n-threads) will naturally apply to all
>>> components, rather than relying on yet-another vfio-migration-threads-num
>>> parameter.
>> Frankly, I didn't really put much attention to the throughput factor, and my
>> plan is to introduce only a single thread per device.
>> VFIO devices may have many GBs of data to migrate (e.g., vGPUs) and even
>> mlx5 VFs can have a few GBs of data.
>> So what you are saying here is interesting, although I didn't test such
>> scenario to see the actual benefit.
>>
>> I am trying to think if/how this could work and I have a few concerns:
>> 1. RAM is made of fixed-positioned pages that can be randomly read/written,
>> so sending these pages over multiple channels and loading them in the
>> destination can work pretty naturally without much overhead.
>>     VFIO device data, on the other hand, is just an opaque stream of bytes
>> from QEMU point of view. This means that if we break this data to "packets"
>> and send them over multiple channels, we must preserve the order by which
>> this data was
>>     originally read from the device and write the data in the same order to
>> the destination device.
>>     I am wondering if the overhead of maintaining such order may hurt
>> performance, making it not worthwhile.
> Indeed, it seems to me VFIO migration is based on a streaming model where
> there's no easy way to index a chunk of data.

Yes, you can see it here as well: 
https://elixir.bootlin.com/linux/v6.8-rc2/source/include/uapi/linux/vfio.h#L1039

>
> Is there any background on how that kernel interface was designed?  It
> seems pretty unfriendly to concurrency already: even if multiple devices
> can migrate concurrently, each single device can already contain GBs of
> data as you said, which is pretty common here.  I'm a bit surprised to see
> that the kernel interface is designed in this way for such a device.

Not that I know of. There has been a pretty big discussion about the 
uAPI back then when it was introduced, but not something formal.
However, I don't think having a few GBs of device data is the common 
case for VFIO devices, I just pointed out the extreme cases.

> I was wondering the possibility of whether VFIO can provide data chunks
> with indexes just like RAM (which is represented in ramblock offsets).

I am not sure this would be feasible, as it will involve major changes 
to the uAPI and for the devices.
But if that's something you wish to explore I can ask around and see if 
it's a hard no go.

>> 2. As I mentioned above, the main motivation for separate VFIO migration
>> channel/thread, as I see today, is to allow parallel migration of VFIO
>> devices.
>>     AFAIU multifd, as it is today, doesn't provide such parallelism (i.e., in
>> the complete pre-copy phase each device, be it RAM or VFIO, will fully send
>> its data over the multifd threads and only after finishing will the next
>> device send its data).
> Indeed. That's actually an issue not only to VFIO but also to migration in
> general: we can't migrate device states concurrently, and multifd is out of
> the picture here so far, but maybe there's chance.
>
> Consider huge VMs where there can be already ~500 vCPUs, each need their
> own get()/put() of CPU states from/to KVM.  It'll be nice if we can do this
> in concurrent threads too.  Here VFIO is one of the devices that will also
> benefit from such a design, and greatly.

Interesting, do you know how much time migrating these vCPUs take?

> I added a todo in our wiki for this, which I see it a general improvement,
> and hopefully someone can look into this:
>
> https://wiki.qemu.org/ToDo/LiveMigration#Device_state_concurrency
>
> I hope VFIO can consider resolving this as a generic issue, rather than
> providing its own solution.
>
>> This is just what came to mind. Maybe we can think this more thoroughly and
>> see if it could work somehow, now or in the future.
>> However, I think making the multifd threads generic so they can send any
>> kind of data is a good thing in general, regardless of VFIO.
> Right.
>
> In general, having a VFIO separate channel may solve the immediate issue,
> but it may still won't solve all, meanwhile it may introduce the first
> example of using completely separate channel that migration won't easily
> manage itself, which IMHO can cause migration much harder to maintain in
> the future.
>
> It may also in the future become some technical debt that VFIO will need to
> take even if such a solution merged, because VFIO could have its own model
> of handling a few similar problems that migration has.
>
> I hope there's some way out that we can work together to improve the
> framework, providing a clean approach and consider for the long terms.

I understand your concern, and I hope as well we can work together 
towards a proper and maintainable solution.
Even if we don't get VFIO to use the multifd framework directly, maybe 
adding common APIs would be good enough.
For example, take this series of adding a common API to create migration 
channels.
I also saw you and Fabiano have been talking about a thread-pool model 
to replace the multifd threads. Maybe we can use such scheme also for 
VFIO and even for the vCPUs case you mentioned above, each component 
stating how many threads it needs and creating a big pool for all 
migration clients.
Re: [PATCH 04/17] migration/multifd: Set p->running = true in the right place
Posted by Peter Xu 9 months, 3 weeks ago
On Tue, Jan 30, 2024 at 08:44:19PM +0200, Avihai Horon wrote:
> 
> On 30/01/2024 7:57, Peter Xu wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > On Mon, Jan 29, 2024 at 02:20:35PM +0200, Avihai Horon wrote:
> > > On 29/01/2024 6:17, Peter Xu wrote:
> > > > External email: Use caution opening links or attachments
> > > > 
> > > > 
> > > > On Sun, Jan 28, 2024 at 05:43:52PM +0200, Avihai Horon wrote:
> > > > > On 25/01/2024 22:57, Fabiano Rosas wrote:
> > > > > > External email: Use caution opening links or attachments
> > > > > > 
> > > > > > 
> > > > > > Avihai Horon <avihaih@nvidia.com> writes:
> > > > > > 
> > > > > > > The commit in the fixes line moved multifd thread creation to a
> > > > > > > different location, but forgot to move the p->running = true assignment
> > > > > > > as well. Thus, p->running is set to true before multifd thread is
> > > > > > > actually created.
> > > > > > > 
> > > > > > > p->running is used in multifd_save_cleanup() to decide whether to join
> > > > > > > the multifd thread or not.
> > > > > > > 
> > > > > > > With TLS, an error in multifd_tls_channel_connect() can lead to a
> > > > > > > segmentation fault because p->running is true but p->thread is never
> > > > > > > initialized, so multifd_save_cleanup() tries to join an uninitialized
> > > > > > > thread.
> > > > > > > 
> > > > > > > Fix it by moving p->running = true assignment right after multifd thread
> > > > > > > creation. Also move qio_channel_set_delay() to there, as this is where
> > > > > > > it used to be originally.
> > > > > > > 
> > > > > > > Fixes: 29647140157a ("migration/tls: add support for multifd tls-handshake")
> > > > > > > Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> > > > > > Just for context, I haven't looked at this patch yet, but we were
> > > > > > planning to remove p->running altogether:
> > > > > > 
> > > > > > https://lore.kernel.org/r/20231110200241.20679-1-farosas@suse.de
> > > > > Thanks for putting me in the picture.
> > > > > I see that there has been a discussion about the multifd creation/treadown
> > > > > flow.
> > > > > In light of this discussion, I can already see a few problems in my series
> > > > > that I didn't notice before (such as the TLS handshake thread leak).
> > > > > The thread you mentioned here and some of my patches point out some problems
> > > > > in multifd creation/treardown. I guess we can discuss it and see what's the
> > > > > best way to solve them.
> > > > > 
> > > > > Regarding this patch, your solution indeed solves the bug that this patch
> > > > > addresses, so maybe this could be dropped (or only noted in your patch).
> > > > > 
> > > > > Maybe I should also put you (and Peter) in context for this whole series --
> > > > > I am writing it as preparation for adding a separate migration channel for
> > > > > VFIO device migration, so VFIO devices could be migrated in parallel.
> > > > > So this series tries to lay down some foundations to facilitate it.
> > > > Avihai, is the throughput the only reason that VFIO would like to have a
> > > > separate channel?
> > > Actually, the main reason is to be able to send and load multiple VFIO
> > > devices data in parallel.
> > > For example, today if we have three VFIO devices, they are migrated
> > > sequentially one after another.
> > > This particularly hurts during the complete pre-copy phase (downtime), as
> > > loading the VFIO data in destination involves FW interaction and resource
> > > allocation, which takes time and simply blocks the other devices from
> > > sending and loading their data.
> > > Providing a separate channel and thread for each VIFO device solves this
> > > problem and ideally reduces the VFIO contribution to downtime from sum{VFIO
> > > device #1, ..., VFIO device #N} to max{VFIO device #1, ..., VFIO device #N}.
> > I see.
> > 
> > > > I'm wondering if we can also use multifd threads to send vfio data at some
> > > > point.  Now multifd indeed is closely bound to ram pages but maybe it'll
> > > > change in the near future to take any load?
> > > > 
> > > > Multifd is for solving the throughput issue already. If vfio has the same
> > > > goal, IMHO it'll be good to keep them using the same thread model, instead
> > > > of managing different threads in different places.  With that, any user
> > > > setting (for example, multifd-n-threads) will naturally apply to all
> > > > components, rather than relying on yet-another vfio-migration-threads-num
> > > > parameter.
> > > Frankly, I didn't really put much attention to the throughput factor, and my
> > > plan is to introduce only a single thread per device.
> > > VFIO devices may have many GBs of data to migrate (e.g., vGPUs) and even
> > > mlx5 VFs can have a few GBs of data.
> > > So what you are saying here is interesting, although I didn't test such
> > > scenario to see the actual benefit.
> > > 
> > > I am trying to think if/how this could work and I have a few concerns:
> > > 1. RAM is made of fixed-positioned pages that can be randomly read/written,
> > > so sending these pages over multiple channels and loading them in the
> > > destination can work pretty naturally without much overhead.
> > >     VFIO device data, on the other hand, is just an opaque stream of bytes
> > > from QEMU point of view. This means that if we break this data to "packets"
> > > and send them over multiple channels, we must preserve the order by which
> > > this data was
> > >     originally read from the device and write the data in the same order to
> > > the destination device.
> > >     I am wondering if the overhead of maintaining such order may hurt
> > > performance, making it not worthwhile.
> > Indeed, it seems to me VFIO migration is based on a streaming model where
> > there's no easy way to index a chunk of data.
> 
> Yes, you can see it here as well: https://elixir.bootlin.com/linux/v6.8-rc2/source/include/uapi/linux/vfio.h#L1039
> 
> > 
> > Is there any background on how that kernel interface was designed?  It
> > seems pretty unfriendly to concurrency already: even if multiple devices
> > can migrate concurrently, each single device can already contain GBs of
> > data as you said, which is pretty common here.  I'm a bit surprised to see
> > that the kernel interface is designed in this way for such a device.
> 
> Not that I know of. There has been a pretty big discussion about the uAPI
> back then when it was introduced, but not something formal.
> However, I don't think having a few GBs of device data is the common case
> for VFIO devices, I just pointed out the extreme cases.

I had that impression possibly because our QE team tests VFIO normally with
vGPU, and my memory is 1Q ~= 1GB device data, where NQ ~= N GB (mostly vRAM
per my understanding).

> 
> > I was wondering the possibility of whether VFIO can provide data chunks
> > with indexes just like RAM (which is represented in ramblock offsets).
> 
> I am not sure this would be feasible, as it will involve major changes to
> the uAPI and for the devices.
> But if that's something you wish to explore I can ask around and see if it's
> a hard no go.

The thing is if VFIO always relies on 1 fd read() then it's impossible to
scale.  Maybe there's chance in the future to provide other interfaces so
that at least data can be concurrently read or updated, even if they can
not directly be offseted.

> 
> > > 2. As I mentioned above, the main motivation for separate VFIO migration
> > > channel/thread, as I see today, is to allow parallel migration of VFIO
> > > devices.
> > >     AFAIU multifd, as it is today, doesn't provide such parallelism (i.e., in
> > > the complete pre-copy phase each device, be it RAM or VFIO, will fully send
> > > its data over the multifd threads and only after finishing will the next
> > > device send its data).
> > Indeed. That's actually an issue not only to VFIO but also to migration in
> > general: we can't migrate device states concurrently, and multifd is out of
> > the picture here so far, but maybe there's chance.
> > 
> > Consider huge VMs where there can be already ~500 vCPUs, each need their
> > own get()/put() of CPU states from/to KVM.  It'll be nice if we can do this
> > in concurrent threads too.  Here VFIO is one of the devices that will also
> > benefit from such a design, and greatly.
> 
> Interesting, do you know how much time migrating these vCPUs take?

Most of the vCPUs took only a portion of milliseconds, but some may took a
long time like 100+ ms.  We still haven't looked into why some vCPU is
special and what caused that slowness when save(), however consider >1
vCPUs taking 100ms they'll be added up lump sum contributing to the total
downtime.  That's definitely unwanted.

If we can somehow concurrently submit device states save() jobs to multiple
threads, then it could be potentially useful.

> 
> > I added a todo in our wiki for this, which I see it a general improvement,
> > and hopefully someone can look into this:
> > 
> > https://wiki.qemu.org/ToDo/LiveMigration#Device_state_concurrency
> > 
> > I hope VFIO can consider resolving this as a generic issue, rather than
> > providing its own solution.
> > 
> > > This is just what came to mind. Maybe we can think this more thoroughly and
> > > see if it could work somehow, now or in the future.
> > > However, I think making the multifd threads generic so they can send any
> > > kind of data is a good thing in general, regardless of VFIO.
> > Right.
> > 
> > In general, having a VFIO separate channel may solve the immediate issue,
> > but it may still won't solve all, meanwhile it may introduce the first
> > example of using completely separate channel that migration won't easily
> > manage itself, which IMHO can cause migration much harder to maintain in
> > the future.
> > 
> > It may also in the future become some technical debt that VFIO will need to
> > take even if such a solution merged, because VFIO could have its own model
> > of handling a few similar problems that migration has.
> > 
> > I hope there's some way out that we can work together to improve the
> > framework, providing a clean approach and consider for the long terms.
> 
> I understand your concern, and I hope as well we can work together towards a
> proper and maintainable solution.
> Even if we don't get VFIO to use the multifd framework directly, maybe
> adding common APIs would be good enough.

Yes.  Note that I haven't looked closely in your patchsets right now, as I
mentioned it may not apply due to the recent fixes.  However please feel
free to repost whatever that you think would still be worthwhile.

> For example, take this series of adding a common API to create migration
> channels.
> I also saw you and Fabiano have been talking about a thread-pool model to
> replace the multifd threads. Maybe we can use such scheme also for VFIO and
> even for the vCPUs case you mentioned above, each component stating how many
> threads it needs and creating a big pool for all migration clients.

We can keep discussing that.  So far I still think it'll be more valuable
if you can propose something that will apply not only to VFIO but also
other devices on concurrent save/loads, but I don't think I have everything
thought thoroughly.  So feel free to go with what you think proper, and we
can keep the discussion in your new thread if you're going to prepare
something.

Thanks,

-- 
Peter Xu
Re: [PATCH 04/17] migration/multifd: Set p->running = true in the right place
Posted by Avihai Horon 9 months, 3 weeks ago
On 06/02/2024 12:25, Peter Xu wrote:
> External email: Use caution opening links or attachments
>
>
> On Tue, Jan 30, 2024 at 08:44:19PM +0200, Avihai Horon wrote:
>> On 30/01/2024 7:57, Peter Xu wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> On Mon, Jan 29, 2024 at 02:20:35PM +0200, Avihai Horon wrote:
>>>> On 29/01/2024 6:17, Peter Xu wrote:
>>>>> External email: Use caution opening links or attachments
>>>>>
>>>>>
>>>>> On Sun, Jan 28, 2024 at 05:43:52PM +0200, Avihai Horon wrote:
>>>>>> On 25/01/2024 22:57, Fabiano Rosas wrote:
>>>>>>> External email: Use caution opening links or attachments
>>>>>>>
>>>>>>>
>>>>>>> Avihai Horon <avihaih@nvidia.com> writes:
>>>>>>>
>>>>>>>> The commit in the fixes line moved multifd thread creation to a
>>>>>>>> different location, but forgot to move the p->running = true assignment
>>>>>>>> as well. Thus, p->running is set to true before multifd thread is
>>>>>>>> actually created.
>>>>>>>>
>>>>>>>> p->running is used in multifd_save_cleanup() to decide whether to join
>>>>>>>> the multifd thread or not.
>>>>>>>>
>>>>>>>> With TLS, an error in multifd_tls_channel_connect() can lead to a
>>>>>>>> segmentation fault because p->running is true but p->thread is never
>>>>>>>> initialized, so multifd_save_cleanup() tries to join an uninitialized
>>>>>>>> thread.
>>>>>>>>
>>>>>>>> Fix it by moving p->running = true assignment right after multifd thread
>>>>>>>> creation. Also move qio_channel_set_delay() to there, as this is where
>>>>>>>> it used to be originally.
>>>>>>>>
>>>>>>>> Fixes: 29647140157a ("migration/tls: add support for multifd tls-handshake")
>>>>>>>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>>>>>>> Just for context, I haven't looked at this patch yet, but we were
>>>>>>> planning to remove p->running altogether:
>>>>>>>
>>>>>>> https://lore.kernel.org/r/20231110200241.20679-1-farosas@suse.de
>>>>>> Thanks for putting me in the picture.
>>>>>> I see that there has been a discussion about the multifd creation/treadown
>>>>>> flow.
>>>>>> In light of this discussion, I can already see a few problems in my series
>>>>>> that I didn't notice before (such as the TLS handshake thread leak).
>>>>>> The thread you mentioned here and some of my patches point out some problems
>>>>>> in multifd creation/treardown. I guess we can discuss it and see what's the
>>>>>> best way to solve them.
>>>>>>
>>>>>> Regarding this patch, your solution indeed solves the bug that this patch
>>>>>> addresses, so maybe this could be dropped (or only noted in your patch).
>>>>>>
>>>>>> Maybe I should also put you (and Peter) in context for this whole series --
>>>>>> I am writing it as preparation for adding a separate migration channel for
>>>>>> VFIO device migration, so VFIO devices could be migrated in parallel.
>>>>>> So this series tries to lay down some foundations to facilitate it.
>>>>> Avihai, is the throughput the only reason that VFIO would like to have a
>>>>> separate channel?
>>>> Actually, the main reason is to be able to send and load multiple VFIO
>>>> devices data in parallel.
>>>> For example, today if we have three VFIO devices, they are migrated
>>>> sequentially one after another.
>>>> This particularly hurts during the complete pre-copy phase (downtime), as
>>>> loading the VFIO data in destination involves FW interaction and resource
>>>> allocation, which takes time and simply blocks the other devices from
>>>> sending and loading their data.
>>>> Providing a separate channel and thread for each VIFO device solves this
>>>> problem and ideally reduces the VFIO contribution to downtime from sum{VFIO
>>>> device #1, ..., VFIO device #N} to max{VFIO device #1, ..., VFIO device #N}.
>>> I see.
>>>
>>>>> I'm wondering if we can also use multifd threads to send vfio data at some
>>>>> point.  Now multifd indeed is closely bound to ram pages but maybe it'll
>>>>> change in the near future to take any load?
>>>>>
>>>>> Multifd is for solving the throughput issue already. If vfio has the same
>>>>> goal, IMHO it'll be good to keep them using the same thread model, instead
>>>>> of managing different threads in different places.  With that, any user
>>>>> setting (for example, multifd-n-threads) will naturally apply to all
>>>>> components, rather than relying on yet-another vfio-migration-threads-num
>>>>> parameter.
>>>> Frankly, I didn't really put much attention to the throughput factor, and my
>>>> plan is to introduce only a single thread per device.
>>>> VFIO devices may have many GBs of data to migrate (e.g., vGPUs) and even
>>>> mlx5 VFs can have a few GBs of data.
>>>> So what you are saying here is interesting, although I didn't test such
>>>> scenario to see the actual benefit.
>>>>
>>>> I am trying to think if/how this could work and I have a few concerns:
>>>> 1. RAM is made of fixed-positioned pages that can be randomly read/written,
>>>> so sending these pages over multiple channels and loading them in the
>>>> destination can work pretty naturally without much overhead.
>>>>      VFIO device data, on the other hand, is just an opaque stream of bytes
>>>> from QEMU point of view. This means that if we break this data to "packets"
>>>> and send them over multiple channels, we must preserve the order by which
>>>> this data was
>>>>      originally read from the device and write the data in the same order to
>>>> the destination device.
>>>>      I am wondering if the overhead of maintaining such order may hurt
>>>> performance, making it not worthwhile.
>>> Indeed, it seems to me VFIO migration is based on a streaming model where
>>> there's no easy way to index a chunk of data.
>> Yes, you can see it here as well: https://elixir.bootlin.com/linux/v6.8-rc2/source/include/uapi/linux/vfio.h#L1039
>>
>>> Is there any background on how that kernel interface was designed?  It
>>> seems pretty unfriendly to concurrency already: even if multiple devices
>>> can migrate concurrently, each single device can already contain GBs of
>>> data as you said, which is pretty common here.  I'm a bit surprised to see
>>> that the kernel interface is designed in this way for such a device.
>> Not that I know of. There has been a pretty big discussion about the uAPI
>> back then when it was introduced, but not something formal.
>> However, I don't think having a few GBs of device data is the common case
>> for VFIO devices, I just pointed out the extreme cases.
> I had that impression possibly because our QE team tests VFIO normally with
> vGPU, and my memory is 1Q ~= 1GB device data, where NQ ~= N GB (mostly vRAM
> per my understanding).
>
>>> I was wondering the possibility of whether VFIO can provide data chunks
>>> with indexes just like RAM (which is represented in ramblock offsets).
>> I am not sure this would be feasible, as it will involve major changes to
>> the uAPI and for the devices.
>> But if that's something you wish to explore I can ask around and see if it's
>> a hard no go.
> The thing is if VFIO always relies on 1 fd read() then it's impossible to
> scale.  Maybe there's chance in the future to provide other interfaces so
> that at least data can be concurrently read or updated, even if they can
> not directly be offseted.

Yes, this could be some future work if it turns out to be a serious 
bottleneck.

>
>>>> 2. As I mentioned above, the main motivation for separate VFIO migration
>>>> channel/thread, as I see today, is to allow parallel migration of VFIO
>>>> devices.
>>>>      AFAIU multifd, as it is today, doesn't provide such parallelism (i.e., in
>>>> the complete pre-copy phase each device, be it RAM or VFIO, will fully send
>>>> its data over the multifd threads and only after finishing will the next
>>>> device send its data).
>>> Indeed. That's actually an issue not only to VFIO but also to migration in
>>> general: we can't migrate device states concurrently, and multifd is out of
>>> the picture here so far, but maybe there's chance.
>>>
>>> Consider huge VMs where there can be already ~500 vCPUs, each need their
>>> own get()/put() of CPU states from/to KVM.  It'll be nice if we can do this
>>> in concurrent threads too.  Here VFIO is one of the devices that will also
>>> benefit from such a design, and greatly.
>> Interesting, do you know how much time migrating these vCPUs take?
> Most of the vCPUs took only a portion of milliseconds, but some may took a
> long time like 100+ ms.  We still haven't looked into why some vCPU is
> special and what caused that slowness when save(), however consider >1
> vCPUs taking 100ms they'll be added up lump sum contributing to the total
> downtime.  That's definitely unwanted.
>
> If we can somehow concurrently submit device states save() jobs to multiple
> threads, then it could be potentially useful.

Yes, I agree.

>
>>> I added a todo in our wiki for this, which I see it a general improvement,
>>> and hopefully someone can look into this:
>>>
>>> https://wiki.qemu.org/ToDo/LiveMigration#Device_state_concurrency
>>>
>>> I hope VFIO can consider resolving this as a generic issue, rather than
>>> providing its own solution.
>>>
>>>> This is just what came to mind. Maybe we can think this more thoroughly and
>>>> see if it could work somehow, now or in the future.
>>>> However, I think making the multifd threads generic so they can send any
>>>> kind of data is a good thing in general, regardless of VFIO.
>>> Right.
>>>
>>> In general, having a VFIO separate channel may solve the immediate issue,
>>> but it may still won't solve all, meanwhile it may introduce the first
>>> example of using completely separate channel that migration won't easily
>>> manage itself, which IMHO can cause migration much harder to maintain in
>>> the future.
>>>
>>> It may also in the future become some technical debt that VFIO will need to
>>> take even if such a solution merged, because VFIO could have its own model
>>> of handling a few similar problems that migration has.
>>>
>>> I hope there's some way out that we can work together to improve the
>>> framework, providing a clean approach and consider for the long terms.
>> I understand your concern, and I hope as well we can work together towards a
>> proper and maintainable solution.
>> Even if we don't get VFIO to use the multifd framework directly, maybe
>> adding common APIs would be good enough.
> Yes.  Note that I haven't looked closely in your patchsets right now, as I
> mentioned it may not apply due to the recent fixes.  However please feel
> free to repost whatever that you think would still be worthwhile.

Sure.

>
>> For example, take this series of adding a common API to create migration
>> channels.
>> I also saw you and Fabiano have been talking about a thread-pool model to
>> replace the multifd threads. Maybe we can use such scheme also for VFIO and
>> even for the vCPUs case you mentioned above, each component stating how many
>> threads it needs and creating a big pool for all migration clients.
> We can keep discussing that.  So far I still think it'll be more valuable
> if you can propose something that will apply not only to VFIO but also
> other devices on concurrent save/loads, but I don't think I have everything
> thought thoroughly.  So feel free to go with what you think proper, and we
> can keep the discussion in your new thread if you're going to prepare
> something.

Well, it turns out that Oracle has been coding a PoC for VFIO parallel 
migration over the multifd framework.

It looks promising, so I will drop my efforts in the current approach, 
giving the lead to Maciej from Oracle (CCed him here) who coded this 
PoC. AFAIK, he is currently working on an initial series based on his 
PoC and he will post it when ready. I think in the meanwhile it would be 
good to CC him in other multifd work that could be related to VFIO 
parallel migration.

Thanks!