[RFC PATCH 2/4] migration: Fix state transition in postcopy_start() error handling

Juraj Marcin posted 4 patches 4 months, 1 week ago
Maintainers: Peter Xu <peterx@redhat.com>, Fabiano Rosas <farosas@suse.de>, Eric Blake <eblake@redhat.com>, Markus Armbruster <armbru@redhat.com>, Laurent Vivier <lvivier@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>, Stefan Weil <sw@weilnetz.de>
[RFC PATCH 2/4] migration: Fix state transition in postcopy_start() error handling
Posted by Juraj Marcin 4 months, 1 week ago
From: Juraj Marcin <jmarcin@redhat.com>

Depending on where an error during postcopy_start() happens, the state
can be either "active", "device" or "cancelling", but never
"postcopy-active". Migration state is transitioned to "postcopy-active"
only just before a successful return from the function.

Accept any state except "cancelling" when transitioning to "failed"
state.

Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
---
 migration/migration.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index 10c216d25d..e5ce2940d5 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -2872,8 +2872,9 @@ static int postcopy_start(MigrationState *ms, Error **errp)
 fail_closefb:
     qemu_fclose(fb);
 fail:
-    migrate_set_state(&ms->state, MIGRATION_STATUS_POSTCOPY_ACTIVE,
-                          MIGRATION_STATUS_FAILED);
+    if ( ms->state != MIGRATION_STATUS_CANCELLING) {
+        migrate_set_state(&ms->state, ms->state, MIGRATION_STATUS_FAILED);
+    }
     migration_block_activate(NULL);
     migration_call_notifiers(ms, MIG_EVENT_PRECOPY_FAILED, NULL);
     bql_unlock();
-- 
2.50.1
Re: [RFC PATCH 2/4] migration: Fix state transition in postcopy_start() error handling
Posted by Peter Xu 4 months, 1 week ago
On Thu, Aug 07, 2025 at 01:49:10PM +0200, Juraj Marcin wrote:
> From: Juraj Marcin <jmarcin@redhat.com>
> 
> Depending on where an error during postcopy_start() happens, the state
> can be either "active", "device" or "cancelling", but never
> "postcopy-active". Migration state is transitioned to "postcopy-active"
> only just before a successful return from the function.
> 
> Accept any state except "cancelling" when transitioning to "failed"
> state.
> 
> Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
> ---
>  migration/migration.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 10c216d25d..e5ce2940d5 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -2872,8 +2872,9 @@ static int postcopy_start(MigrationState *ms, Error **errp)
>  fail_closefb:
>      qemu_fclose(fb);
>  fail:
> -    migrate_set_state(&ms->state, MIGRATION_STATUS_POSTCOPY_ACTIVE,
> -                          MIGRATION_STATUS_FAILED);
> +    if ( ms->state != MIGRATION_STATUS_CANCELLING) {
> +        migrate_set_state(&ms->state, ms->state, MIGRATION_STATUS_FAILED);
> +    }

Hmm, this might have been overlooked from my commit 48814111366b.  Maybe
worth a Fixes and copy stable?

For example, I would expect the old code (prior of 48814111366b) still be
able to fail postcopy and resume src QEMU if qemu_savevm_send_packaged()
failed.  Now, looks like it'll be stuck at "device" state..

The other thing is it also looks like a common pattern to set FAILED
meanwhile not messing with a CANCELLING stage.  It's not easy to always
remember this, so maybe we should consider having a helper function?

  migrate_set_failure(MigrationState *, Error *err);

Which could set err with migrate_set_error() (likely we could also
error_report() the error), and update FAILED iff it's not CANCELLING.

I saw three of such occurances that such helper may apply, but worth double
check:

postcopy_start[2725]           if (ms->state != MIGRATION_STATUS_CANCELLING) {
migration_completion[3069]     if (s->state != MIGRATION_STATUS_CANCELLING) {
igration_connect[4064]        if (s->state != MIGRATION_STATUS_CANCELLING) {

If the cleanup looks worthwhile, and if the Fixes apply, we could have the
cleanup patch on top of the fixes patch so patch 1 is easier to backport.

Thanks,

>      migration_block_activate(NULL);
>      migration_call_notifiers(ms, MIG_EVENT_PRECOPY_FAILED, NULL);
>      bql_unlock();
> -- 
> 2.50.1
> 

-- 
Peter Xu
Re: [RFC PATCH 2/4] migration: Fix state transition in postcopy_start() error handling
Posted by Fabiano Rosas 4 months, 1 week ago
Peter Xu <peterx@redhat.com> writes:

> On Thu, Aug 07, 2025 at 01:49:10PM +0200, Juraj Marcin wrote:
>> From: Juraj Marcin <jmarcin@redhat.com>
>> 
>> Depending on where an error during postcopy_start() happens, the state
>> can be either "active", "device" or "cancelling", but never
>> "postcopy-active". Migration state is transitioned to "postcopy-active"
>> only just before a successful return from the function.
>> 
>> Accept any state except "cancelling" when transitioning to "failed"
>> state.
>> 
>> Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
>> ---
>>  migration/migration.c | 5 +++--
>>  1 file changed, 3 insertions(+), 2 deletions(-)
>> 
>> diff --git a/migration/migration.c b/migration/migration.c
>> index 10c216d25d..e5ce2940d5 100644
>> --- a/migration/migration.c
>> +++ b/migration/migration.c
>> @@ -2872,8 +2872,9 @@ static int postcopy_start(MigrationState *ms, Error **errp)
>>  fail_closefb:
>>      qemu_fclose(fb);
>>  fail:
>> -    migrate_set_state(&ms->state, MIGRATION_STATUS_POSTCOPY_ACTIVE,
>> -                          MIGRATION_STATUS_FAILED);
>> +    if ( ms->state != MIGRATION_STATUS_CANCELLING) {
>> +        migrate_set_state(&ms->state, ms->state, MIGRATION_STATUS_FAILED);
>> +    }
>
> Hmm, this might have been overlooked from my commit 48814111366b.  Maybe
> worth a Fixes and copy stable?
>
> For example, I would expect the old code (prior of 48814111366b) still be
> able to fail postcopy and resume src QEMU if qemu_savevm_send_packaged()
> failed.  Now, looks like it'll be stuck at "device" state..
>
> The other thing is it also looks like a common pattern to set FAILED
> meanwhile not messing with a CANCELLING stage.  It's not easy to always
> remember this, so maybe we should consider having a helper function?
>
>   migrate_set_failure(MigrationState *, Error *err);
>

We didn't do it back then because at there would be some logical
conflict with this series:

https://lore.kernel.org/r/20250110100707.4805-1-shivam.kumar1@nutanix.com

But I don't remember the details. If it works this time I'm all for it.

> Which could set err with migrate_set_error() (likely we could also
> error_report() the error), and update FAILED iff it's not CANCELLING.
>
> I saw three of such occurances that such helper may apply, but worth double
> check:
>
> postcopy_start[2725]           if (ms->state != MIGRATION_STATUS_CANCELLING) {
> migration_completion[3069]     if (s->state != MIGRATION_STATUS_CANCELLING) {
> igration_connect[4064]        if (s->state != MIGRATION_STATUS_CANCELLING) {
>
> If the cleanup looks worthwhile, and if the Fixes apply, we could have the
> cleanup patch on top of the fixes patch so patch 1 is easier to backport.
>
> Thanks,
>
>>      migration_block_activate(NULL);
>>      migration_call_notifiers(ms, MIG_EVENT_PRECOPY_FAILED, NULL);
>>      bql_unlock();
>> -- 
>> 2.50.1
>>
Re: [RFC PATCH 2/4] migration: Fix state transition in postcopy_start() error handling
Posted by Juraj Marcin 4 months ago
Hi Fabiano

On 2025-08-08 16:08, Fabiano Rosas wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Thu, Aug 07, 2025 at 01:49:10PM +0200, Juraj Marcin wrote:
> >> From: Juraj Marcin <jmarcin@redhat.com>
> >> 
> >> Depending on where an error during postcopy_start() happens, the state
> >> can be either "active", "device" or "cancelling", but never
> >> "postcopy-active". Migration state is transitioned to "postcopy-active"
> >> only just before a successful return from the function.
> >> 
> >> Accept any state except "cancelling" when transitioning to "failed"
> >> state.
> >> 
> >> Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
> >> ---
> >>  migration/migration.c | 5 +++--
> >>  1 file changed, 3 insertions(+), 2 deletions(-)
> >> 
> >> diff --git a/migration/migration.c b/migration/migration.c
> >> index 10c216d25d..e5ce2940d5 100644
> >> --- a/migration/migration.c
> >> +++ b/migration/migration.c
> >> @@ -2872,8 +2872,9 @@ static int postcopy_start(MigrationState *ms, Error **errp)
> >>  fail_closefb:
> >>      qemu_fclose(fb);
> >>  fail:
> >> -    migrate_set_state(&ms->state, MIGRATION_STATUS_POSTCOPY_ACTIVE,
> >> -                          MIGRATION_STATUS_FAILED);
> >> +    if ( ms->state != MIGRATION_STATUS_CANCELLING) {
> >> +        migrate_set_state(&ms->state, ms->state, MIGRATION_STATUS_FAILED);
> >> +    }
> >
> > Hmm, this might have been overlooked from my commit 48814111366b.  Maybe
> > worth a Fixes and copy stable?
> >
> > For example, I would expect the old code (prior of 48814111366b) still be
> > able to fail postcopy and resume src QEMU if qemu_savevm_send_packaged()
> > failed.  Now, looks like it'll be stuck at "device" state..
> >
> > The other thing is it also looks like a common pattern to set FAILED
> > meanwhile not messing with a CANCELLING stage.  It's not easy to always
> > remember this, so maybe we should consider having a helper function?
> >
> >   migrate_set_failure(MigrationState *, Error *err);
> >
> 
> We didn't do it back then because at there would be some logical
> conflict with this series:
> 
> https://lore.kernel.org/r/20250110100707.4805-1-shivam.kumar1@nutanix.com
> 
> But I don't remember the details. If it works this time I'm all for it.

Thanks! I will look into that.

Best regards,

Juraj Marcin

> 
> > Which could set err with migrate_set_error() (likely we could also
> > error_report() the error), and update FAILED iff it's not CANCELLING.
> >
> > I saw three of such occurances that such helper may apply, but worth double
> > check:
> >
> > postcopy_start[2725]           if (ms->state != MIGRATION_STATUS_CANCELLING) {
> > migration_completion[3069]     if (s->state != MIGRATION_STATUS_CANCELLING) {
> > igration_connect[4064]        if (s->state != MIGRATION_STATUS_CANCELLING) {
> >
> > If the cleanup looks worthwhile, and if the Fixes apply, we could have the
> > cleanup patch on top of the fixes patch so patch 1 is easier to backport.
> >
> > Thanks,
> >
> >>      migration_block_activate(NULL);
> >>      migration_call_notifiers(ms, MIG_EVENT_PRECOPY_FAILED, NULL);
> >>      bql_unlock();
> >> -- 
> >> 2.50.1
> >> 
>
Re: [RFC PATCH 2/4] migration: Fix state transition in postcopy_start() error handling
Posted by Juraj Marcin 4 months, 1 week ago
Hi Peter,

On 2025-08-07 16:54, Peter Xu wrote:
> On Thu, Aug 07, 2025 at 01:49:10PM +0200, Juraj Marcin wrote:
> > From: Juraj Marcin <jmarcin@redhat.com>
> > 
> > Depending on where an error during postcopy_start() happens, the state
> > can be either "active", "device" or "cancelling", but never
> > "postcopy-active". Migration state is transitioned to "postcopy-active"
> > only just before a successful return from the function.
> > 
> > Accept any state except "cancelling" when transitioning to "failed"
> > state.
> > 
> > Signed-off-by: Juraj Marcin <jmarcin@redhat.com>
> > ---
> >  migration/migration.c | 5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> > 
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 10c216d25d..e5ce2940d5 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -2872,8 +2872,9 @@ static int postcopy_start(MigrationState *ms, Error **errp)
> >  fail_closefb:
> >      qemu_fclose(fb);
> >  fail:
> > -    migrate_set_state(&ms->state, MIGRATION_STATUS_POSTCOPY_ACTIVE,
> > -                          MIGRATION_STATUS_FAILED);
> > +    if ( ms->state != MIGRATION_STATUS_CANCELLING) {
> > +        migrate_set_state(&ms->state, ms->state, MIGRATION_STATUS_FAILED);
> > +    }
> 
> Hmm, this might have been overlooked from my commit 48814111366b.  Maybe
> worth a Fixes and copy stable?

yeah, it looks like it. POSTCOPY_ACTIVE state used to be set way sooner
before. I'll add Fixes tag to the patch.

> 
> For example, I would expect the old code (prior of 48814111366b) still be
> able to fail postcopy and resume src QEMU if qemu_savevm_send_packaged()
> failed.  Now, looks like it'll be stuck at "device" state..
> 
> The other thing is it also looks like a common pattern to set FAILED
> meanwhile not messing with a CANCELLING stage.  It's not easy to always
> remember this, so maybe we should consider having a helper function?
> 
>   migrate_set_failure(MigrationState *, Error *err);
> 
> Which could set err with migrate_set_error() (likely we could also
> error_report() the error), and update FAILED iff it's not CANCELLING.
> 
> I saw three of such occurances that such helper may apply, but worth double
> check:
> 
> postcopy_start[2725]           if (ms->state != MIGRATION_STATUS_CANCELLING) {
> migration_completion[3069]     if (s->state != MIGRATION_STATUS_CANCELLING) {
> igration_connect[4064]        if (s->state != MIGRATION_STATUS_CANCELLING) {
> 
> If the cleanup looks worthwhile, and if the Fixes apply, we could have the
> cleanup patch on top of the fixes patch so patch 1 is easier to backport.

Such function could be useful. I could also send it with the above fix
together as a separate patchset, and send it also to stable.


Best regards

Juraj Marcin

> 
> Thanks,
> 
> >      migration_block_activate(NULL);
> >      migration_call_notifiers(ms, MIG_EVENT_PRECOPY_FAILED, NULL);
> >      bql_unlock();
> > -- 
> > 2.50.1
> > 
> 
> -- 
> Peter Xu
>
Re: [RFC PATCH 2/4] migration: Fix state transition in postcopy_start() error handling
Posted by Peter Xu 4 months, 1 week ago
On Fri, Aug 08, 2025 at 11:44:36AM +0200, Juraj Marcin wrote:
> Such function could be useful. I could also send it with the above fix
> together as a separate patchset, and send it also to stable.

Yep that works.  The 2nd patch as a cleanup doesn't need to copy stable.

Thanks,

-- 
Peter Xu