We set CANCELLED very late, it means migration_has_failed() may not work
correctly if it's invoked before updating CANCELLING to CANCELLED.
Allow that state will make migration_has_failed() working as expected even
if it's invoked slightly earlier.
One current user is the multifd code for the TLS graceful termination,
where it's before updating to CANCELLED.
Signed-off-by: Peter Xu <peterx@redhat.com>
---
migration/migration.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/migration/migration.c b/migration/migration.c
index 7015c2b5e0..397917b1b3 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1723,7 +1723,8 @@ int migration_call_notifiers(MigrationState *s, MigrationEventType type,
bool migration_has_failed(MigrationState *s)
{
- return (s->state == MIGRATION_STATUS_CANCELLED ||
+ return (s->state == MIGRATION_STATUS_CANCELLING ||
+ s->state == MIGRATION_STATUS_CANCELLED ||
s->state == MIGRATION_STATUS_FAILED);
}
--
2.50.1
Peter Xu <peterx@redhat.com> writes: > We set CANCELLED very late, it means migration_has_failed() may not work > correctly if it's invoked before updating CANCELLING to CANCELLED. > The prophecy is fulfilled. https://wiki.qemu.org/ToDo/LiveMigration#Migration_cancel_concurrency I'm not sure I'm convinced, for instance, CANCELLING is part of migration_is_running(), while FAILED is not. This doesn't seem right. Another point is that CANCELLING is not a final state, so we're prone to later need a migration_has_finished_failing_now() helper. =) My mental model is that CANCELLING is a transitional, ongoing state where we shouldn't really be making assumptions. Once FAILED is reached, then we're sure in which general state everything is. How did you catch this? It was one of the cancel tests that failed? I just noticed that multifd_send_shutdown() is called from migration_cleanup() before it changes the state to CANCELLED. So current code also has whatever issue you detected here. > Allow that state will make migration_has_failed() working as expected even > if it's invoked slightly earlier. > > One current user is the multifd code for the TLS graceful termination, > where it's before updating to CANCELLED. > > Signed-off-by: Peter Xu <peterx@redhat.com> > --- > migration/migration.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/migration/migration.c b/migration/migration.c > index 7015c2b5e0..397917b1b3 100644 > --- a/migration/migration.c > +++ b/migration/migration.c > @@ -1723,7 +1723,8 @@ int migration_call_notifiers(MigrationState *s, MigrationEventType type, > > bool migration_has_failed(MigrationState *s) > { > - return (s->state == MIGRATION_STATUS_CANCELLED || > + return (s->state == MIGRATION_STATUS_CANCELLING || > + s->state == MIGRATION_STATUS_CANCELLED || > s->state == MIGRATION_STATUS_FAILED); > }
On Wed, Sep 17, 2025 at 05:52:54PM -0300, Fabiano Rosas wrote: > Peter Xu <peterx@redhat.com> writes: > > > We set CANCELLED very late, it means migration_has_failed() may not work > > correctly if it's invoked before updating CANCELLING to CANCELLED. > > > > The prophecy is fulfilled. > > https://wiki.qemu.org/ToDo/LiveMigration#Migration_cancel_concurrency > > I'm not sure I'm convinced, for instance, CANCELLING is part of > migration_is_running(), while FAILED is not. This doesn't seem > right. Another point is that CANCELLING is not a final state, so we're > prone to later need a migration_has_finished_failing_now() helper. =) Considering we only have two users so far, and the other user doesn't care about CANCELLING (while the multifd shutdown cares?), then I assume it's ok to treat CANCELLING to be "has failed"? :) I didn't try to interpret "has failed" in English, but only for the sake of an universal helper that works for both places. Or maybe it can be is_failing() too? I don't have a strong feeling. > > My mental model is that CANCELLING is a transitional, ongoing state > where we shouldn't really be making assumptions. Once FAILED is reached, > then we're sure in which general state everything is. > > How did you catch this? It was one of the cancel tests that failed? I > just noticed that multifd_send_shutdown() is called from > migration_cleanup() before it changes the state to CANCELLED. So current > code also has whatever issue you detected here. No test failed, it was only by code observation, mentioned below [1], exactly as you said. I just think when cancelling the tls sessions, we shouldn't dump the error messages anymore even if the bye failed. Or maybe we simply do not need to invoke migration_tls_channel_end() when CANCELLING / FAILED? That's relevant to your ask on the cover letter, we can discuss there. This is very trivial. Let me know how you thinks. I can also drop this patch when repost v3 but fix the postcopy warning first, which reliably reproduce now with qtest. > > > Allow that state will make migration_has_failed() working as expected even > > if it's invoked slightly earlier. > > > > One current user is the multifd code for the TLS graceful termination, > > where it's before updating to CANCELLED. [1] > > > > Signed-off-by: Peter Xu <peterx@redhat.com> > > --- > > migration/migration.c | 3 ++- > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > diff --git a/migration/migration.c b/migration/migration.c > > index 7015c2b5e0..397917b1b3 100644 > > --- a/migration/migration.c > > +++ b/migration/migration.c > > @@ -1723,7 +1723,8 @@ int migration_call_notifiers(MigrationState *s, MigrationEventType type, > > > > bool migration_has_failed(MigrationState *s) > > { > > - return (s->state == MIGRATION_STATUS_CANCELLED || > > + return (s->state == MIGRATION_STATUS_CANCELLING || > > + s->state == MIGRATION_STATUS_CANCELLED || > > s->state == MIGRATION_STATUS_FAILED); > > } > -- Peter Xu
Peter Xu <peterx@redhat.com> writes: > On Wed, Sep 17, 2025 at 05:52:54PM -0300, Fabiano Rosas wrote: >> Peter Xu <peterx@redhat.com> writes: >> >> > We set CANCELLED very late, it means migration_has_failed() may not work >> > correctly if it's invoked before updating CANCELLING to CANCELLED. >> > >> >> The prophecy is fulfilled. >> >> https://wiki.qemu.org/ToDo/LiveMigration#Migration_cancel_concurrency >> >> I'm not sure I'm convinced, for instance, CANCELLING is part of >> migration_is_running(), while FAILED is not. This doesn't seem >> right. Another point is that CANCELLING is not a final state, so we're >> prone to later need a migration_has_finished_failing_now() helper. =) > > Considering we only have two users so far, and the other user doesn't care > about CANCELLING (while the multifd shutdown cares?), then I assume it's ok > to treat CANCELLING to be "has failed"? :) I didn't try to interpret "has > failed" in English, but only for the sake of an universal helper that works > for both places. > > Or maybe it can be is_failing() too? I don't have a strong feeling. > I'm not nitipicking on language. I'm pointing out that CANCELLING is a transitory state, i.e. from migrate_cancel() until migrate_cleanup(), while FAILED is a terminal state, nothing happens after it. But fine, I guess it's really only *my* assumptions being broken and not the ones in the code. >> >> My mental model is that CANCELLING is a transitional, ongoing state >> where we shouldn't really be making assumptions. Once FAILED is reached, >> then we're sure in which general state everything is. >> >> How did you catch this? It was one of the cancel tests that failed? I >> just noticed that multifd_send_shutdown() is called from >> migration_cleanup() before it changes the state to CANCELLED. So current >> code also has whatever issue you detected here. > > No test failed, it was only by code observation, mentioned below [1], > exactly as you said. > > I just think when cancelling the tls sessions, we shouldn't dump the error > messages anymore even if the bye failed. Ok > Or maybe we simply do not need to > invoke migration_tls_channel_end() when CANCELLING / FAILED? That's > relevant to your ask on the cover letter, we can discuss there. > > This is very trivial. Nah, let me review the patch properly, please. > Let me know how you thinks. I can also drop this > patch when repost v3 but fix the postcopy warning first, which reliably > reproduce now with qtest. > >> >> > Allow that state will make migration_has_failed() working as expected even >> > if it's invoked slightly earlier. >> > >> > One current user is the multifd code for the TLS graceful termination, >> > where it's before updating to CANCELLED. > > [1] > >> > >> > Signed-off-by: Peter Xu <peterx@redhat.com> >> > --- >> > migration/migration.c | 3 ++- >> > 1 file changed, 2 insertions(+), 1 deletion(-) >> > >> > diff --git a/migration/migration.c b/migration/migration.c >> > index 7015c2b5e0..397917b1b3 100644 >> > --- a/migration/migration.c >> > +++ b/migration/migration.c >> > @@ -1723,7 +1723,8 @@ int migration_call_notifiers(MigrationState *s, MigrationEventType type, >> > >> > bool migration_has_failed(MigrationState *s) >> > { >> > - return (s->state == MIGRATION_STATUS_CANCELLED || >> > + return (s->state == MIGRATION_STATUS_CANCELLING || >> > + s->state == MIGRATION_STATUS_CANCELLED || >> > s->state == MIGRATION_STATUS_FAILED); >> > } >>
© 2016 - 2025 Red Hat, Inc.