From: Marc-André Lureau <marcandre.lureau@redhat.com>
When a migration is cancelled during the early SETUP phase (before
migration_connect_outgoing() has set s->to_dst_file), migration_cancel()
takes a fast path that transitions the state from CANCELLING to
Cancelled without calling migration_cleanup(). This leaves the migration
yank instance registered.
A subsequent qmp_migrate() call passes the migration_is_running() guard
(since CANCELLED is a terminal state) and then calls
yank_register_instance(MIGRATION_YANK_INSTANCE, &error_abort), which
finds the stale entry and aborts with "duplicate yank instance".
#6 0x000056028b5e17fc in error_setg_internal
(errp=errp@entry=0x56028cc8cb18 <error_abort>, src=src@entry=0x56028ba87fa5 "../util/yank.c", line=line@entry=87, func=func@entry=0x56028bb77140 <__func__.5> "yank_register_instance", fmt=fmt@entry=0x56028ba87f8d "duplicate yank instance") at ../util/error.c:100
#7 0x000056028b601b2a in yank_register_instance (instance=instance@entry=0x7ffea0cf36b0, errp=0x56028cc8cb18 <error_abort>) at ../util/yank.c:87
#8 0x000056028b25221e in migrate_prepare (s=0x5602b0a7db90, resume=<optimized out>, errp=0x7ffea0cf3718) at ../migration/migration.c:2001
#9 qmp_migrate (uri=<optimized out>, has_channels=<optimized out>, channels=<optimized out>, has_resume=<optimized out>, resume=<optimized out>, errp=errp@entry=0x7ffea0cf3718) at ../migration/migration.c:2039
#10 0x000056028b5891be in qmp_marshal_migrate (args=<optimized out>, ret=<optimized out>, errp=0x7f63392e0ee0) at qapi/qapi-commands-migration.c:459
Add missing yank_unregister_instance. Alternatively, it seems
migration_cleanup() should be safe in this context too.
Fixes: 624e6e654e11 ("migration: cpr-transfer mode")
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
---
migration/migration.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/migration/migration.c b/migration/migration.c
index 5c9aaa6e58f..bbdd91ee7ee 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1477,6 +1477,7 @@ void migration_cancel(void)
MIGRATION_STATUS_CANCELLED);
cpr_state_close();
cpr_transfer_source_destroy(s);
+ yank_unregister_instance(MIGRATION_YANK_INSTANCE);
}
}
--
2.53.0
On Fri, Apr 17, 2026 at 10:47:42PM +0400, marcandre.lureau@redhat.com wrote:
> From: Marc-André Lureau <marcandre.lureau@redhat.com>
>
> When a migration is cancelled during the early SETUP phase (before
> migration_connect_outgoing() has set s->to_dst_file), migration_cancel()
> takes a fast path that transitions the state from CANCELLING to
> Cancelled without calling migration_cleanup(). This leaves the migration
> yank instance registered.
>
> A subsequent qmp_migrate() call passes the migration_is_running() guard
> (since CANCELLED is a terminal state) and then calls
> yank_register_instance(MIGRATION_YANK_INSTANCE, &error_abort), which
> finds the stale entry and aborts with "duplicate yank instance".
>
> #6 0x000056028b5e17fc in error_setg_internal
> (errp=errp@entry=0x56028cc8cb18 <error_abort>, src=src@entry=0x56028ba87fa5 "../util/yank.c", line=line@entry=87, func=func@entry=0x56028bb77140 <__func__.5> "yank_register_instance", fmt=fmt@entry=0x56028ba87f8d "duplicate yank instance") at ../util/error.c:100
> #7 0x000056028b601b2a in yank_register_instance (instance=instance@entry=0x7ffea0cf36b0, errp=0x56028cc8cb18 <error_abort>) at ../util/yank.c:87
> #8 0x000056028b25221e in migrate_prepare (s=0x5602b0a7db90, resume=<optimized out>, errp=0x7ffea0cf3718) at ../migration/migration.c:2001
> #9 qmp_migrate (uri=<optimized out>, has_channels=<optimized out>, channels=<optimized out>, has_resume=<optimized out>, resume=<optimized out>, errp=errp@entry=0x7ffea0cf3718) at ../migration/migration.c:2039
> #10 0x000056028b5891be in qmp_marshal_migrate (args=<optimized out>, ret=<optimized out>, errp=0x7f63392e0ee0) at qapi/qapi-commands-migration.c:459
>
> Add missing yank_unregister_instance. Alternatively, it seems
> migration_cleanup() should be safe in this context too.
>
> Fixes: 624e6e654e11 ("migration: cpr-transfer mode")
> Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
> ---
> migration/migration.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/migration/migration.c b/migration/migration.c
> index 5c9aaa6e58f..bbdd91ee7ee 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1477,6 +1477,7 @@ void migration_cancel(void)
> MIGRATION_STATUS_CANCELLED);
> cpr_state_close();
> cpr_transfer_source_destroy(s);
> + yank_unregister_instance(MIGRATION_YANK_INSTANCE);
> }
> }
Thanks for the report.
I had a feeling that this is not enough.. and I had a feeling that this
special casing was something fast merged for cpr-transfer work to be able
to cancel the HUP source when waiting, however did it wrong. IOW, at least
this chunk:
if (setup && !s->to_dst_file) {
migrate_set_state(&s->state, MIGRATION_STATUS_CANCELLING,
MIGRATION_STATUS_CANCELLED);
cpr_state_close();
cpr_transfer_source_destroy(s);
}
Should only apply to cpr-transfer, not normal migrations..
E.g. consider one qmp_migrate() on top of tcp socket:
- qmp_migrate
- migration_connect_outgoing
- socket_connect_outgoing
- ... waiting for worker to invoke socket_outgoing_migration()
- ... some time later
- qmp_migrate_cancel
- above check hits, update CANCELLING->CANCELLED, unregister yank etc.
- ... some time later
- socket_outgoing_migration() invoked with an error
- migration_connect_error_propagate() should see CANCELLED state, which
is illegal.
So IMHO for non-cpr cases we should always get ourselves covered with
migration_connect_error_propagate(), except CPR where it may have the HUP
source registered via migration_connect_error_propagate().
IIUC, maybe we could change this check to be an explicit one, checking
whether cpr_transfer_add_hup_watch() has the source attached but not yet
executed (if executed, migration_connect_outgoing_cb() will also properly
invoke migration_connect_error_propagate()).
We could check against both "mode==CPR_TRANSFER && s->hup_source", but then
we need early release of hup_resource too (currently it is deferred until
migration_cleanup(), which is slightly hackish; after all the source
returns always with G_SOURCE_REMOVE for migration_connect_outgoing_cb, so
it was removed long time ago, rather than until the end of migration).
Thanks,
--
Peter Xu
On Mon, Apr 20, 2026 at 03:40:27PM -0400, Peter Xu wrote:
> On Fri, Apr 17, 2026 at 10:47:42PM +0400, marcandre.lureau@redhat.com wrote:
> > From: Marc-André Lureau <marcandre.lureau@redhat.com>
> >
> > When a migration is cancelled during the early SETUP phase (before
> > migration_connect_outgoing() has set s->to_dst_file), migration_cancel()
> > takes a fast path that transitions the state from CANCELLING to
> > Cancelled without calling migration_cleanup(). This leaves the migration
> > yank instance registered.
> >
> > A subsequent qmp_migrate() call passes the migration_is_running() guard
> > (since CANCELLED is a terminal state) and then calls
> > yank_register_instance(MIGRATION_YANK_INSTANCE, &error_abort), which
> > finds the stale entry and aborts with "duplicate yank instance".
> >
> > #6 0x000056028b5e17fc in error_setg_internal
> > (errp=errp@entry=0x56028cc8cb18 <error_abort>, src=src@entry=0x56028ba87fa5 "../util/yank.c", line=line@entry=87, func=func@entry=0x56028bb77140 <__func__.5> "yank_register_instance", fmt=fmt@entry=0x56028ba87f8d "duplicate yank instance") at ../util/error.c:100
> > #7 0x000056028b601b2a in yank_register_instance (instance=instance@entry=0x7ffea0cf36b0, errp=0x56028cc8cb18 <error_abort>) at ../util/yank.c:87
> > #8 0x000056028b25221e in migrate_prepare (s=0x5602b0a7db90, resume=<optimized out>, errp=0x7ffea0cf3718) at ../migration/migration.c:2001
> > #9 qmp_migrate (uri=<optimized out>, has_channels=<optimized out>, channels=<optimized out>, has_resume=<optimized out>, resume=<optimized out>, errp=errp@entry=0x7ffea0cf3718) at ../migration/migration.c:2039
> > #10 0x000056028b5891be in qmp_marshal_migrate (args=<optimized out>, ret=<optimized out>, errp=0x7f63392e0ee0) at qapi/qapi-commands-migration.c:459
> >
> > Add missing yank_unregister_instance. Alternatively, it seems
> > migration_cleanup() should be safe in this context too.
> >
> > Fixes: 624e6e654e11 ("migration: cpr-transfer mode")
> > Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
> > ---
> > migration/migration.c | 1 +
> > 1 file changed, 1 insertion(+)
> >
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 5c9aaa6e58f..bbdd91ee7ee 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -1477,6 +1477,7 @@ void migration_cancel(void)
> > MIGRATION_STATUS_CANCELLED);
> > cpr_state_close();
> > cpr_transfer_source_destroy(s);
> > + yank_unregister_instance(MIGRATION_YANK_INSTANCE);
> > }
> > }
>
> Thanks for the report.
>
> I had a feeling that this is not enough.. and I had a feeling that this
> special casing was something fast merged for cpr-transfer work to be able
> to cancel the HUP source when waiting, however did it wrong. IOW, at least
> this chunk:
>
> if (setup && !s->to_dst_file) {
> migrate_set_state(&s->state, MIGRATION_STATUS_CANCELLING,
> MIGRATION_STATUS_CANCELLED);
> cpr_state_close();
> cpr_transfer_source_destroy(s);
> }
>
> Should only apply to cpr-transfer, not normal migrations..
>
> E.g. consider one qmp_migrate() on top of tcp socket:
>
> - qmp_migrate
> - migration_connect_outgoing
> - socket_connect_outgoing
> - ... waiting for worker to invoke socket_outgoing_migration()
> - ... some time later
> - qmp_migrate_cancel
> - above check hits, update CANCELLING->CANCELLED, unregister yank etc.
> - ... some time later
> - socket_outgoing_migration() invoked with an error
> - migration_connect_error_propagate() should see CANCELLED state, which
> is illegal.
So the 1st thing I did here is I tried to reproduce this issue with
firewall dropping packets over some port:
$ sudo iptables -A OUTPUT -p tcp --dport 44444 -j DROP
Then I can reproduce this by migrating to port 44444, and this patch does
fix it, but then I verified I can hit the above illegal CANCELLED state
error:
2026-04-20T21:30:21.669131Z migrate_set_state new state setup
2026-04-20T21:30:27.281824Z migrate_set_state new state cancelling
2026-04-20T21:30:27.281862Z migrate_set_state new state cancelled
...
2026-04-20T21:32:37.638382Z qemu-system-x86_64: migration_connect_error_propagate: Illegal migration status (cancelled) detected
After that if I retry with a correct connection migration will actually
succeed. I believe it's because migration_connect_error_propagate() was
careful on skipping the extra migration_cleanup() when hitting something
weird:
/*
* This really shouldn't happen. Just be careful to not crash a VM
* just for this. Instead, dump something.
*/
error_report("%s: Illegal migration status (%s) detected",
__func__, MigrationStatus_str(current));
return;
I'll try out the other approach and share the patch if it will pass all
tests.
>
> So IMHO for non-cpr cases we should always get ourselves covered with
> migration_connect_error_propagate(), except CPR where it may have the HUP
> source registered via migration_connect_error_propagate().
>
> IIUC, maybe we could change this check to be an explicit one, checking
> whether cpr_transfer_add_hup_watch() has the source attached but not yet
> executed (if executed, migration_connect_outgoing_cb() will also properly
> invoke migration_connect_error_propagate()).
>
> We could check against both "mode==CPR_TRANSFER && s->hup_source", but then
> we need early release of hup_resource too (currently it is deferred until
> migration_cleanup(), which is slightly hackish; after all the source
> returns always with G_SOURCE_REMOVE for migration_connect_outgoing_cb, so
> it was removed long time ago, rather than until the end of migration).
>
> Thanks,
>
> --
> Peter Xu
--
Peter Xu
Hi
On Mon, Apr 20, 2026 at 11:40 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Apr 17, 2026 at 10:47:42PM +0400, marcandre.lureau@redhat.com wrote:
> > From: Marc-André Lureau <marcandre.lureau@redhat.com>
> >
> > When a migration is cancelled during the early SETUP phase (before
> > migration_connect_outgoing() has set s->to_dst_file), migration_cancel()
> > takes a fast path that transitions the state from CANCELLING to
> > Cancelled without calling migration_cleanup(). This leaves the migration
> > yank instance registered.
> >
> > A subsequent qmp_migrate() call passes the migration_is_running() guard
> > (since CANCELLED is a terminal state) and then calls
> > yank_register_instance(MIGRATION_YANK_INSTANCE, &error_abort), which
> > finds the stale entry and aborts with "duplicate yank instance".
> >
> > #6 0x000056028b5e17fc in error_setg_internal
> > (errp=errp@entry=0x56028cc8cb18 <error_abort>, src=src@entry=0x56028ba87fa5 "../util/yank.c", line=line@entry=87, func=func@entry=0x56028bb77140 <__func__.5> "yank_register_instance", fmt=fmt@entry=0x56028ba87f8d "duplicate yank instance") at ../util/error.c:100
> > #7 0x000056028b601b2a in yank_register_instance (instance=instance@entry=0x7ffea0cf36b0, errp=0x56028cc8cb18 <error_abort>) at ../util/yank.c:87
> > #8 0x000056028b25221e in migrate_prepare (s=0x5602b0a7db90, resume=<optimized out>, errp=0x7ffea0cf3718) at ../migration/migration.c:2001
> > #9 qmp_migrate (uri=<optimized out>, has_channels=<optimized out>, channels=<optimized out>, has_resume=<optimized out>, resume=<optimized out>, errp=errp@entry=0x7ffea0cf3718) at ../migration/migration.c:2039
> > #10 0x000056028b5891be in qmp_marshal_migrate (args=<optimized out>, ret=<optimized out>, errp=0x7f63392e0ee0) at qapi/qapi-commands-migration.c:459
> >
> > Add missing yank_unregister_instance. Alternatively, it seems
> > migration_cleanup() should be safe in this context too.
> >
> > Fixes: 624e6e654e11 ("migration: cpr-transfer mode")
> > Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
> > ---
> > migration/migration.c | 1 +
> > 1 file changed, 1 insertion(+)
> >
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 5c9aaa6e58f..bbdd91ee7ee 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -1477,6 +1477,7 @@ void migration_cancel(void)
> > MIGRATION_STATUS_CANCELLED);
> > cpr_state_close();
> > cpr_transfer_source_destroy(s);
> > + yank_unregister_instance(MIGRATION_YANK_INSTANCE);
> > }
> > }
>
> Thanks for the report.
>
> I had a feeling that this is not enough.. and I had a feeling that this
> special casing was something fast merged for cpr-transfer work to be able
> to cancel the HUP source when waiting, however did it wrong. IOW, at least
> this chunk:
>
> if (setup && !s->to_dst_file) {
> migrate_set_state(&s->state, MIGRATION_STATUS_CANCELLING,
> MIGRATION_STATUS_CANCELLED);
> cpr_state_close();
> cpr_transfer_source_destroy(s);
> }
>
> Should only apply to cpr-transfer, not normal migrations..
>
> E.g. consider one qmp_migrate() on top of tcp socket:
>
> - qmp_migrate
> - migration_connect_outgoing
> - socket_connect_outgoing
> - ... waiting for worker to invoke socket_outgoing_migration()
> - ... some time later
> - qmp_migrate_cancel
> - above check hits, update CANCELLING->CANCELLED, unregister yank etc.
> - ... some time later
> - socket_outgoing_migration() invoked with an error
> - migration_connect_error_propagate() should see CANCELLED state, which
> is illegal.
>
> So IMHO for non-cpr cases we should always get ourselves covered with
> migration_connect_error_propagate(), except CPR where it may have the HUP
> source registered via migration_connect_error_propagate().
>
> IIUC, maybe we could change this check to be an explicit one, checking
> whether cpr_transfer_add_hup_watch() has the source attached but not yet
> executed (if executed, migration_connect_outgoing_cb() will also properly
> invoke migration_connect_error_propagate()).
>
> We could check against both "mode==CPR_TRANSFER && s->hup_source", but then
> we need early release of hup_resource too (currently it is deferred until
> migration_cleanup(), which is slightly hackish; after all the source
> returns always with G_SOURCE_REMOVE for migration_connect_outgoing_cb, so
> it was removed long time ago, rather than until the end of migration).
This is a bit over my head, I am not familiar with CPR. If you can
take it from there, I am fine. I just wanted to fix the crash :)
On Mon, Apr 20, 2026 at 11:50:24PM +0400, Marc-André Lureau wrote: > This is a bit over my head, I am not familiar with CPR. If you can > take it from there, I am fine. I just wanted to fix the crash :) Sure :) I'll have a closer look at this. -- Peter Xu
© 2016 - 2026 Red Hat, Inc.