[PATCH] Use multifd state to determine if multifd cleanup is needed

Shivam Kumar posted 1 patch 1 month, 2 weeks ago
migration/multifd.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
[PATCH] Use multifd state to determine if multifd cleanup is needed
Posted by Shivam Kumar 1 month, 2 weeks ago
If the client calls the QMP command to reset the migration
capabilities after the migration status is set to failed or cancelled
but before multifd cleanup starts, multifd cleanup can be skipped as
it will falsely assume that multifd was not used for migration. This
will eventually lead to source QEMU crashing due to the following
assertion failure:

yank_unregister_instance: Assertion `QLIST_EMPTY(&entry->yankfns)`
failed

Check multifd state to determine whether multifd was used or not for
the migration rather than checking the state of multifd migration
capability.

Signed-off-by: Shivam Kumar <shivam.kumar1@nutanix.com>
---
 migration/multifd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index 9b200f4ad9..427c9a7956 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -487,7 +487,7 @@ void multifd_send_shutdown(void)
 {
     int i;
 
-    if (!migrate_multifd()) {
+    if (!multifd_send_state) {
         return;
     }
 
-- 
2.22.3
Re: [PATCH] Use multifd state to determine if multifd cleanup is needed
Posted by Peter Xu 1 month, 2 weeks ago
On Mon, Oct 07, 2024 at 03:44:51PM +0000, Shivam Kumar wrote:
> If the client calls the QMP command to reset the migration
> capabilities after the migration status is set to failed or cancelled

Is cancelled ok?

Asked because I think migrate_fd_cleanup() should still be in CANCELLING
stage there, so no one can disable multifd capability before that, it
should fail the QMP command.

But FAILED indeed looks problematic.

IIUC it's not only to multifd alone - is it a race condition that
migrate_fd_cleanup() can be invoked without migration_is_running() keeps
being true?  Then I wonder what happens if a concurrent QMP "migrate"
happens together with migrate_fd_cleanup(), even with multifd always off.

Do we perhaps need to cleanup everything before the state changes to
FAILED?

> but before multifd cleanup starts, multifd cleanup can be skipped as
> it will falsely assume that multifd was not used for migration. This
> will eventually lead to source QEMU crashing due to the following
> assertion failure:
> 
> yank_unregister_instance: Assertion `QLIST_EMPTY(&entry->yankfns)`
> failed
> 
> Check multifd state to determine whether multifd was used or not for
> the migration rather than checking the state of multifd migration
> capability.
> 
> Signed-off-by: Shivam Kumar <shivam.kumar1@nutanix.com>
> ---
>  migration/multifd.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/migration/multifd.c b/migration/multifd.c
> index 9b200f4ad9..427c9a7956 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -487,7 +487,7 @@ void multifd_send_shutdown(void)
>  {
>      int i;
>  
> -    if (!migrate_multifd()) {
> +    if (!multifd_send_state) {
>          return;
>      }
>  
> -- 
> 2.22.3
> 

-- 
Peter Xu
Re: [PATCH] Use multifd state to determine if multifd cleanup is needed
Posted by Fabiano Rosas 1 month, 2 weeks ago
Peter Xu <peterx@redhat.com> writes:

> On Mon, Oct 07, 2024 at 03:44:51PM +0000, Shivam Kumar wrote:
>> If the client calls the QMP command to reset the migration
>> capabilities after the migration status is set to failed or cancelled
>
> Is cancelled ok?
>
> Asked because I think migrate_fd_cleanup() should still be in CANCELLING
> stage there, so no one can disable multifd capability before that, it
> should fail the QMP command.
>
> But FAILED indeed looks problematic.
>
> IIUC it's not only to multifd alone - is it a race condition that
> migrate_fd_cleanup() can be invoked without migration_is_running() keeps
> being true?  Then I wonder what happens if a concurrent QMP "migrate"
> happens together with migrate_fd_cleanup(), even with multifd always off.
>
> Do we perhaps need to cleanup everything before the state changes to
> FAILED?
>

Should we make CANCELLED the only terminal state aside from COMPLETED?
So migrate_fd_cleanup would set CANCELLED whenever it sees either
CANCELLING or FAILED.
Re: [PATCH] Use multifd state to determine if multifd cleanup is needed
Posted by Peter Xu 1 month, 2 weeks ago
On Tue, Oct 08, 2024 at 11:20:03AM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Mon, Oct 07, 2024 at 03:44:51PM +0000, Shivam Kumar wrote:
> >> If the client calls the QMP command to reset the migration
> >> capabilities after the migration status is set to failed or cancelled
> >
> > Is cancelled ok?
> >
> > Asked because I think migrate_fd_cleanup() should still be in CANCELLING
> > stage there, so no one can disable multifd capability before that, it
> > should fail the QMP command.
> >
> > But FAILED indeed looks problematic.
> >
> > IIUC it's not only to multifd alone - is it a race condition that
> > migrate_fd_cleanup() can be invoked without migration_is_running() keeps
> > being true?  Then I wonder what happens if a concurrent QMP "migrate"
> > happens together with migrate_fd_cleanup(), even with multifd always off.
> >
> > Do we perhaps need to cleanup everything before the state changes to
> > FAILED?
> >
> 
> Should we make CANCELLED the only terminal state aside from COMPLETED?
> So migrate_fd_cleanup would set CANCELLED whenever it sees either
> CANCELLING or FAILED.

I think that may be a major ABI change that can be risky, as we normally
see CANCELLED to be user's choice.

If we really want an ABI change, we could also introduce FAILING too, but I
wonder what I replied in the other email could work without any ABI change,
but close the gap on this race.

-- 
Peter Xu
Re: [PATCH] Use multifd state to determine if multifd cleanup is needed
Posted by Fabiano Rosas 1 month, 2 weeks ago
Peter Xu <peterx@redhat.com> writes:

> On Tue, Oct 08, 2024 at 11:20:03AM -0300, Fabiano Rosas wrote:
>> Peter Xu <peterx@redhat.com> writes:
>> 
>> > On Mon, Oct 07, 2024 at 03:44:51PM +0000, Shivam Kumar wrote:
>> >> If the client calls the QMP command to reset the migration
>> >> capabilities after the migration status is set to failed or cancelled
>> >
>> > Is cancelled ok?
>> >
>> > Asked because I think migrate_fd_cleanup() should still be in CANCELLING
>> > stage there, so no one can disable multifd capability before that, it
>> > should fail the QMP command.
>> >
>> > But FAILED indeed looks problematic.
>> >
>> > IIUC it's not only to multifd alone - is it a race condition that
>> > migrate_fd_cleanup() can be invoked without migration_is_running() keeps
>> > being true?  Then I wonder what happens if a concurrent QMP "migrate"
>> > happens together with migrate_fd_cleanup(), even with multifd always off.
>> >
>> > Do we perhaps need to cleanup everything before the state changes to
>> > FAILED?
>> >
>> 
>> Should we make CANCELLED the only terminal state aside from COMPLETED?
>> So migrate_fd_cleanup would set CANCELLED whenever it sees either
>> CANCELLING or FAILED.
>
> I think that may be a major ABI change that can be risky, as we normally
> see CANCELLED to be user's choice.

Ok, I misunderstood your proposal.

>
> If we really want an ABI change, we could also introduce FAILING too, but I
> wonder what I replied in the other email could work without any ABI change,
> but close the gap on this race.

I don't think we want a FAILING state, but indeed something else that
conveys the same meaning as CANCELLING. I have already suggested
something similar in our TODO list[1]. We need a clear indication of
both "cancelling" and "failing" that's decoupled from the state ABI. Of
course we're talking only about "failing" here, we can leave
"cancelling" which is more complex for another time maybe.

What multifd does with ->exiting seems sane to me.

1- https://wiki.qemu.org/ToDo/LiveMigration#Migration_cancel_concurrency
Re: [PATCH] Use multifd state to determine if multifd cleanup is needed
Posted by Shivam Kumar 1 month, 2 weeks ago

On 9 Oct 2024, at 12:10 AM, Fabiano Rosas <farosas@suse.de> wrote:

!-------------------------------------------------------------------|
 CAUTION: External Email

|-------------------------------------------------------------------!

Peter Xu <peterx@redhat.com<mailto:peterx@redhat.com>> writes:

On Tue, Oct 08, 2024 at 11:20:03AM -0300, Fabiano Rosas wrote:
Peter Xu <peterx@redhat.com> writes:

On Mon, Oct 07, 2024 at 03:44:51PM +0000, Shivam Kumar wrote:
If the client calls the QMP command to reset the migration
capabilities after the migration status is set to failed or cancelled

Is cancelled ok?

Asked because I think migrate_fd_cleanup() should still be in CANCELLING
stage there, so no one can disable multifd capability before that, it
should fail the QMP command.

But FAILED indeed looks problematic.

IIUC it's not only to multifd alone - is it a race condition that
migrate_fd_cleanup() can be invoked without migration_is_running() keeps
being true?  Then I wonder what happens if a concurrent QMP "migrate"
happens together with migrate_fd_cleanup(), even with multifd always off.

Do we perhaps need to cleanup everything before the state changes to
FAILED?


Should we make CANCELLED the only terminal state aside from COMPLETED?
So migrate_fd_cleanup would set CANCELLED whenever it sees either
CANCELLING or FAILED.

I think that may be a major ABI change that can be risky, as we normally
see CANCELLED to be user's choice.

Ok, I misunderstood your proposal.


If we really want an ABI change, we could also introduce FAILING too, but I
wonder what I replied in the other email could work without any ABI change,
but close the gap on this race.

I don't think we want a FAILING state, but indeed something else that
conveys the same meaning as CANCELLING. I have already suggested
something similar in our TODO list[1]. We need a clear indication of
both "cancelling" and "failing" that's decoupled from the state ABI. Of
course we're talking only about "failing" here, we can leave
"cancelling" which is more complex for another time maybe.

What multifd does with ->exiting seems sane to me.

1- https://urldefense.proofpoint.com/v2/url?u=https-3A__wiki.qemu.org_ToDo_LiveMigration-23Migration-5Fcancel-5Fconcurrency&d=DwIBAg&c=s883GpUCOChKOHiocYtGcg&r=4hVFP4-J13xyn-OcN0apTCh8iKZRosf5OJTQePXBMB8&m=8BHh6O05G9bfMxWIM951LFPPGU1RqpOpPUOd646hGmzh7_Aes30zw81Pj4OAxVmc&s=xqf0rCR4tKMBpr7flPSuGtGkAFy5txwi0Wf_Sa-MR84&e=
Having flags to track the 'cancelling' and ‘failing’ states makes
sense. I think they should be a part of MigrationState itself. I will
send follow-up patches.

However, can this patch be accpeted as a cosmetic change? To me, it
makes sense to check 'multifd_send_state' and not migration multifd
capability before cleaning 'multifd_send_state'.  And this also helps
with one race at least (with qmp_migrate_set_capabilities).
Please let me know if you have different thoughts.
Re: [PATCH] Use multifd state to determine if multifd cleanup is needed
Posted by Fabiano Rosas 1 month, 2 weeks ago
Shivam Kumar <shivam.kumar1@nutanix.com> writes:

> On 9 Oct 2024, at 12:10 AM, Fabiano Rosas <farosas@suse.de> wrote:
>
> !-------------------------------------------------------------------|
>  CAUTION: External Email
>
> |-------------------------------------------------------------------!
>
> Peter Xu <peterx@redhat.com<mailto:peterx@redhat.com>> writes:
>
> On Tue, Oct 08, 2024 at 11:20:03AM -0300, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
>
> On Mon, Oct 07, 2024 at 03:44:51PM +0000, Shivam Kumar wrote:
> If the client calls the QMP command to reset the migration
> capabilities after the migration status is set to failed or cancelled
>
> Is cancelled ok?
>
> Asked because I think migrate_fd_cleanup() should still be in CANCELLING
> stage there, so no one can disable multifd capability before that, it
> should fail the QMP command.
>
> But FAILED indeed looks problematic.
>
> IIUC it's not only to multifd alone - is it a race condition that
> migrate_fd_cleanup() can be invoked without migration_is_running() keeps
> being true?  Then I wonder what happens if a concurrent QMP "migrate"
> happens together with migrate_fd_cleanup(), even with multifd always off.
>
> Do we perhaps need to cleanup everything before the state changes to
> FAILED?
>
>
> Should we make CANCELLED the only terminal state aside from COMPLETED?
> So migrate_fd_cleanup would set CANCELLED whenever it sees either
> CANCELLING or FAILED.
>
> I think that may be a major ABI change that can be risky, as we normally
> see CANCELLED to be user's choice.
>
> Ok, I misunderstood your proposal.
>
>
> If we really want an ABI change, we could also introduce FAILING too, but I
> wonder what I replied in the other email could work without any ABI change,
> but close the gap on this race.
>
> I don't think we want a FAILING state, but indeed something else that
> conveys the same meaning as CANCELLING. I have already suggested
> something similar in our TODO list[1]. We need a clear indication of
> both "cancelling" and "failing" that's decoupled from the state ABI. Of
> course we're talking only about "failing" here, we can leave
> "cancelling" which is more complex for another time maybe.
>
> What multifd does with ->exiting seems sane to me.
>
> 1- https://urldefense.proofpoint.com/v2/url?u=https-3A__wiki.qemu.org_ToDo_LiveMigration-23Migration-5Fcancel-5Fconcurrency&d=DwIBAg&c=s883GpUCOChKOHiocYtGcg&r=4hVFP4-J13xyn-OcN0apTCh8iKZRosf5OJTQePXBMB8&m=8BHh6O05G9bfMxWIM951LFPPGU1RqpOpPUOd646hGmzh7_Aes30zw81Pj4OAxVmc&s=xqf0rCR4tKMBpr7flPSuGtGkAFy5txwi0Wf_Sa-MR84&e=
> Having flags to track the 'cancelling' and ‘failing’ states makes
> sense. I think they should be a part of MigrationState itself. I will
> send follow-up patches.
>
> However, can this patch be accpeted as a cosmetic change? To me, it
> makes sense to check 'multifd_send_state' and not migration multifd
> capability before cleaning 'multifd_send_state'.  And this also helps
> with one race at least (with qmp_migrate_set_capabilities).
> Please let me know if you have different thoughts.

This change cannot be considered cosmetic. There is the implication that
checking a capability state is not enough to determine if code
pertaining to that feature can be executed. So it makes for more
confusing code overall. I'd prefer if we fixed the underlying issue of
reaching multifd_send_shutdown() while capabilities can be cleared.
Re: [PATCH] Use multifd state to determine if multifd cleanup is needed
Posted by Shivam Kumar 1 month, 2 weeks ago

On 7 Oct 2024, at 9:56 PM, Peter Xu <peterx@redhat.com> wrote:

!-------------------------------------------------------------------|
 CAUTION: External Email

|-------------------------------------------------------------------!

On Mon, Oct 07, 2024 at 03:44:51PM +0000, Shivam Kumar wrote:
If the client calls the QMP command to reset the migration
capabilities after the migration status is set to failed or cancelled

Is cancelled ok?

Asked because I think migrate_fd_cleanup() should still be in CANCELLING
stage there, so no one can disable multifd capability before that, it
should fail the QMP command.
I meant CANCELLED but I can see that currently, CANCELLED is only possible
after migrate_fd_cleanup is called. So, you are right. We won’t have a problem
in that path at least.

But FAILED indeed looks problematic.

IIUC it's not only to multifd alone - is it a race condition that
migrate_fd_cleanup() can be invoked without migration_is_running() keeps
being true?  Then I wonder what happens if a concurrent QMP "migrate"
happens together with migrate_fd_cleanup(), even with multifd always off.

Do we perhaps need to cleanup everything before the state changes to
FAILED?
Tried calling migrate_fd_cleanup before (and just after) setting the status to
failed. The migration thread gets stuck in close_return_path_on_source trying
to join rp_thread.
but before multifd cleanup starts, multifd cleanup can be skipped as
it will falsely assume that multifd was not used for migration. This
will eventually lead to source QEMU crashing due to the following
assertion failure:

yank_unregister_instance: Assertion `QLIST_EMPTY(&entry->yankfns)`
failed

Check multifd state to determine whether multifd was used or not for
the migration rather than checking the state of multifd migration
capability.

Signed-off-by: Shivam Kumar <shivam.kumar1@nutanix.com>
---
migration/multifd.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/migration/multifd.c b/migration/multifd.c
index 9b200f4ad9..427c9a7956 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -487,7 +487,7 @@ void multifd_send_shutdown(void)
{
    int i;

-    if (!migrate_multifd()) {
+    if (!multifd_send_state) {
        return;
    }

--
2.22.3


--
Peter Xu

Re: [PATCH] Use multifd state to determine if multifd cleanup is needed
Posted by Peter Xu 1 month, 2 weeks ago
On Tue, Oct 08, 2024 at 12:09:03PM +0000, Shivam Kumar wrote:
> 
> 
> On 7 Oct 2024, at 9:56 PM, Peter Xu <peterx@redhat.com> wrote:
> 
> !-------------------------------------------------------------------|
>  CAUTION: External Email
> 
> |-------------------------------------------------------------------!
> 
> On Mon, Oct 07, 2024 at 03:44:51PM +0000, Shivam Kumar wrote:
> If the client calls the QMP command to reset the migration
> capabilities after the migration status is set to failed or cancelled
> 
> Is cancelled ok?
> 
> Asked because I think migrate_fd_cleanup() should still be in CANCELLING
> stage there, so no one can disable multifd capability before that, it
> should fail the QMP command.
> I meant CANCELLED but I can see that currently, CANCELLED is only possible
> after migrate_fd_cleanup is called. So, you are right. We won’t have a problem
> in that path at least.
> 
> But FAILED indeed looks problematic.
> 
> IIUC it's not only to multifd alone - is it a race condition that
> migrate_fd_cleanup() can be invoked without migration_is_running() keeps
> being true?  Then I wonder what happens if a concurrent QMP "migrate"
> happens together with migrate_fd_cleanup(), even with multifd always off.
> 
> Do we perhaps need to cleanup everything before the state changes to
> FAILED?
> Tried calling migrate_fd_cleanup before (and just after) setting the status to
> failed. The migration thread gets stuck in close_return_path_on_source trying
> to join rp_thread.

I am guessing it's because the new rp thread is created before cleanup of
the previous one in this case, so the join() will hang forever.

In this case, below change might not be enough I guess, as I discussed
above.

We may need to postpone setting FAILED status after everything cleaned up
just like what we do with CANCELLING.. maybe we don't need a FAILING state
if we have migrate_set/has_error() - we can use migrate_has/set_error() for
whatever we used to do (set/check) with FAILED, then we set FAILED at last
in the BH like CANCELLED.

> but before multifd cleanup starts, multifd cleanup can be skipped as
> it will falsely assume that multifd was not used for migration. This
> will eventually lead to source QEMU crashing due to the following
> assertion failure:
> 
> yank_unregister_instance: Assertion `QLIST_EMPTY(&entry->yankfns)`
> failed
> 
> Check multifd state to determine whether multifd was used or not for
> the migration rather than checking the state of multifd migration
> capability.
> 
> Signed-off-by: Shivam Kumar <shivam.kumar1@nutanix.com>
> ---
> migration/multifd.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/migration/multifd.c b/migration/multifd.c
> index 9b200f4ad9..427c9a7956 100644
> --- a/migration/multifd.c
> +++ b/migration/multifd.c
> @@ -487,7 +487,7 @@ void multifd_send_shutdown(void)
> {
>     int i;
> 
> -    if (!migrate_multifd()) {
> +    if (!multifd_send_state) {
>         return;
>     }
> 
> --
> 2.22.3
> 
> 
> --
> Peter Xu
> 

-- 
Peter Xu