[PATCH] migration: fix abort when re-migrating after cancel during SETUP

marcandre.lureau@redhat.com posted 1 patch 3 days, 18 hours ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20260417184742.293061-1-marcandre.lureau@redhat.com
Maintainers: Peter Xu <peterx@redhat.com>, Fabiano Rosas <farosas@suse.de>
migration/migration.c | 1 +
1 file changed, 1 insertion(+)
[PATCH] migration: fix abort when re-migrating after cancel during SETUP
Posted by marcandre.lureau@redhat.com 3 days, 18 hours ago
From: Marc-André Lureau <marcandre.lureau@redhat.com>

When a migration is cancelled during the early SETUP phase (before
migration_connect_outgoing() has set s->to_dst_file), migration_cancel()
takes a fast path that transitions the state from CANCELLING to
Cancelled without calling migration_cleanup(). This leaves the migration
yank instance registered.

A subsequent qmp_migrate() call passes the migration_is_running() guard
(since CANCELLED is a terminal state) and then calls
yank_register_instance(MIGRATION_YANK_INSTANCE, &error_abort), which
finds the stale entry and aborts with "duplicate yank instance".

  #6  0x000056028b5e17fc in error_setg_internal
      (errp=errp@entry=0x56028cc8cb18 <error_abort>, src=src@entry=0x56028ba87fa5 "../util/yank.c", line=line@entry=87, func=func@entry=0x56028bb77140 <__func__.5> "yank_register_instance", fmt=fmt@entry=0x56028ba87f8d "duplicate yank instance") at ../util/error.c:100
  #7  0x000056028b601b2a in yank_register_instance (instance=instance@entry=0x7ffea0cf36b0, errp=0x56028cc8cb18 <error_abort>) at ../util/yank.c:87
  #8  0x000056028b25221e in migrate_prepare (s=0x5602b0a7db90, resume=<optimized out>, errp=0x7ffea0cf3718) at ../migration/migration.c:2001
  #9  qmp_migrate (uri=<optimized out>, has_channels=<optimized out>, channels=<optimized out>, has_resume=<optimized out>, resume=<optimized out>, errp=errp@entry=0x7ffea0cf3718) at ../migration/migration.c:2039
  #10 0x000056028b5891be in qmp_marshal_migrate (args=<optimized out>, ret=<optimized out>, errp=0x7f63392e0ee0) at qapi/qapi-commands-migration.c:459

Add missing yank_unregister_instance. Alternatively, it seems
migration_cleanup() should be safe in this context too.

Fixes: 624e6e654e11 ("migration: cpr-transfer mode")
Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
---
 migration/migration.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/migration/migration.c b/migration/migration.c
index 5c9aaa6e58f..bbdd91ee7ee 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1477,6 +1477,7 @@ void migration_cancel(void)
                           MIGRATION_STATUS_CANCELLED);
         cpr_state_close();
         cpr_transfer_source_destroy(s);
+        yank_unregister_instance(MIGRATION_YANK_INSTANCE);
     }
 }
 
-- 
2.53.0


Re: [PATCH] migration: fix abort when re-migrating after cancel during SETUP
Posted by Peter Xu 17 hours ago
On Fri, Apr 17, 2026 at 10:47:42PM +0400, marcandre.lureau@redhat.com wrote:
> From: Marc-André Lureau <marcandre.lureau@redhat.com>
> 
> When a migration is cancelled during the early SETUP phase (before
> migration_connect_outgoing() has set s->to_dst_file), migration_cancel()
> takes a fast path that transitions the state from CANCELLING to
> Cancelled without calling migration_cleanup(). This leaves the migration
> yank instance registered.
> 
> A subsequent qmp_migrate() call passes the migration_is_running() guard
> (since CANCELLED is a terminal state) and then calls
> yank_register_instance(MIGRATION_YANK_INSTANCE, &error_abort), which
> finds the stale entry and aborts with "duplicate yank instance".
> 
>   #6  0x000056028b5e17fc in error_setg_internal
>       (errp=errp@entry=0x56028cc8cb18 <error_abort>, src=src@entry=0x56028ba87fa5 "../util/yank.c", line=line@entry=87, func=func@entry=0x56028bb77140 <__func__.5> "yank_register_instance", fmt=fmt@entry=0x56028ba87f8d "duplicate yank instance") at ../util/error.c:100
>   #7  0x000056028b601b2a in yank_register_instance (instance=instance@entry=0x7ffea0cf36b0, errp=0x56028cc8cb18 <error_abort>) at ../util/yank.c:87
>   #8  0x000056028b25221e in migrate_prepare (s=0x5602b0a7db90, resume=<optimized out>, errp=0x7ffea0cf3718) at ../migration/migration.c:2001
>   #9  qmp_migrate (uri=<optimized out>, has_channels=<optimized out>, channels=<optimized out>, has_resume=<optimized out>, resume=<optimized out>, errp=errp@entry=0x7ffea0cf3718) at ../migration/migration.c:2039
>   #10 0x000056028b5891be in qmp_marshal_migrate (args=<optimized out>, ret=<optimized out>, errp=0x7f63392e0ee0) at qapi/qapi-commands-migration.c:459
> 
> Add missing yank_unregister_instance. Alternatively, it seems
> migration_cleanup() should be safe in this context too.
> 
> Fixes: 624e6e654e11 ("migration: cpr-transfer mode")
> Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
> ---
>  migration/migration.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index 5c9aaa6e58f..bbdd91ee7ee 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -1477,6 +1477,7 @@ void migration_cancel(void)
>                            MIGRATION_STATUS_CANCELLED);
>          cpr_state_close();
>          cpr_transfer_source_destroy(s);
> +        yank_unregister_instance(MIGRATION_YANK_INSTANCE);
>      }
>  }

Thanks for the report.

I had a feeling that this is not enough.. and I had a feeling that this
special casing was something fast merged for cpr-transfer work to be able
to cancel the HUP source when waiting, however did it wrong.  IOW, at least
this chunk:

    if (setup && !s->to_dst_file) {
        migrate_set_state(&s->state, MIGRATION_STATUS_CANCELLING,
                          MIGRATION_STATUS_CANCELLED);
        cpr_state_close();
        cpr_transfer_source_destroy(s);
    }

Should only apply to cpr-transfer, not normal migrations..

E.g. consider one qmp_migrate() on top of tcp socket:

- qmp_migrate
  - migration_connect_outgoing
    - socket_connect_outgoing
      - ... waiting for worker to invoke socket_outgoing_migration()
- ... some time later
- qmp_migrate_cancel
  - above check hits, update CANCELLING->CANCELLED, unregister yank etc.
- ... some time later
- socket_outgoing_migration() invoked with an error
  - migration_connect_error_propagate() should see CANCELLED state, which
    is illegal.

So IMHO for non-cpr cases we should always get ourselves covered with
migration_connect_error_propagate(), except CPR where it may have the HUP
source registered via migration_connect_error_propagate().

IIUC, maybe we could change this check to be an explicit one, checking
whether cpr_transfer_add_hup_watch() has the source attached but not yet
executed (if executed, migration_connect_outgoing_cb() will also properly
invoke migration_connect_error_propagate()).

We could check against both "mode==CPR_TRANSFER && s->hup_source", but then
we need early release of hup_resource too (currently it is deferred until
migration_cleanup(), which is slightly hackish; after all the source
returns always with G_SOURCE_REMOVE for migration_connect_outgoing_cb, so
it was removed long time ago, rather than until the end of migration).

Thanks,

-- 
Peter Xu


Re: [PATCH] migration: fix abort when re-migrating after cancel during SETUP
Posted by Peter Xu 15 hours ago
On Mon, Apr 20, 2026 at 03:40:27PM -0400, Peter Xu wrote:
> On Fri, Apr 17, 2026 at 10:47:42PM +0400, marcandre.lureau@redhat.com wrote:
> > From: Marc-André Lureau <marcandre.lureau@redhat.com>
> > 
> > When a migration is cancelled during the early SETUP phase (before
> > migration_connect_outgoing() has set s->to_dst_file), migration_cancel()
> > takes a fast path that transitions the state from CANCELLING to
> > Cancelled without calling migration_cleanup(). This leaves the migration
> > yank instance registered.
> > 
> > A subsequent qmp_migrate() call passes the migration_is_running() guard
> > (since CANCELLED is a terminal state) and then calls
> > yank_register_instance(MIGRATION_YANK_INSTANCE, &error_abort), which
> > finds the stale entry and aborts with "duplicate yank instance".
> > 
> >   #6  0x000056028b5e17fc in error_setg_internal
> >       (errp=errp@entry=0x56028cc8cb18 <error_abort>, src=src@entry=0x56028ba87fa5 "../util/yank.c", line=line@entry=87, func=func@entry=0x56028bb77140 <__func__.5> "yank_register_instance", fmt=fmt@entry=0x56028ba87f8d "duplicate yank instance") at ../util/error.c:100
> >   #7  0x000056028b601b2a in yank_register_instance (instance=instance@entry=0x7ffea0cf36b0, errp=0x56028cc8cb18 <error_abort>) at ../util/yank.c:87
> >   #8  0x000056028b25221e in migrate_prepare (s=0x5602b0a7db90, resume=<optimized out>, errp=0x7ffea0cf3718) at ../migration/migration.c:2001
> >   #9  qmp_migrate (uri=<optimized out>, has_channels=<optimized out>, channels=<optimized out>, has_resume=<optimized out>, resume=<optimized out>, errp=errp@entry=0x7ffea0cf3718) at ../migration/migration.c:2039
> >   #10 0x000056028b5891be in qmp_marshal_migrate (args=<optimized out>, ret=<optimized out>, errp=0x7f63392e0ee0) at qapi/qapi-commands-migration.c:459
> > 
> > Add missing yank_unregister_instance. Alternatively, it seems
> > migration_cleanup() should be safe in this context too.
> > 
> > Fixes: 624e6e654e11 ("migration: cpr-transfer mode")
> > Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
> > ---
> >  migration/migration.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 5c9aaa6e58f..bbdd91ee7ee 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -1477,6 +1477,7 @@ void migration_cancel(void)
> >                            MIGRATION_STATUS_CANCELLED);
> >          cpr_state_close();
> >          cpr_transfer_source_destroy(s);
> > +        yank_unregister_instance(MIGRATION_YANK_INSTANCE);
> >      }
> >  }
> 
> Thanks for the report.
> 
> I had a feeling that this is not enough.. and I had a feeling that this
> special casing was something fast merged for cpr-transfer work to be able
> to cancel the HUP source when waiting, however did it wrong.  IOW, at least
> this chunk:
> 
>     if (setup && !s->to_dst_file) {
>         migrate_set_state(&s->state, MIGRATION_STATUS_CANCELLING,
>                           MIGRATION_STATUS_CANCELLED);
>         cpr_state_close();
>         cpr_transfer_source_destroy(s);
>     }
> 
> Should only apply to cpr-transfer, not normal migrations..
> 
> E.g. consider one qmp_migrate() on top of tcp socket:
> 
> - qmp_migrate
>   - migration_connect_outgoing
>     - socket_connect_outgoing
>       - ... waiting for worker to invoke socket_outgoing_migration()
> - ... some time later
> - qmp_migrate_cancel
>   - above check hits, update CANCELLING->CANCELLED, unregister yank etc.
> - ... some time later
> - socket_outgoing_migration() invoked with an error
>   - migration_connect_error_propagate() should see CANCELLED state, which
>     is illegal.

So the 1st thing I did here is I tried to reproduce this issue with
firewall dropping packets over some port:

$ sudo iptables -A OUTPUT -p tcp --dport 44444 -j DROP

Then I can reproduce this by migrating to port 44444, and this patch does
fix it, but then I verified I can hit the above illegal CANCELLED state
error:

2026-04-20T21:30:21.669131Z migrate_set_state new state setup
2026-04-20T21:30:27.281824Z migrate_set_state new state cancelling
2026-04-20T21:30:27.281862Z migrate_set_state new state cancelled
...
2026-04-20T21:32:37.638382Z qemu-system-x86_64: migration_connect_error_propagate: Illegal migration status (cancelled) detected

After that if I retry with a correct connection migration will actually
succeed.  I believe it's because migration_connect_error_propagate() was
careful on skipping the extra migration_cleanup() when hitting something
weird:

        /*
         * This really shouldn't happen. Just be careful to not crash a VM
         * just for this.  Instead, dump something.
         */
        error_report("%s: Illegal migration status (%s) detected",
                     __func__, MigrationStatus_str(current));
        return;

I'll try out the other approach and share the patch if it will pass all
tests.

> 
> So IMHO for non-cpr cases we should always get ourselves covered with
> migration_connect_error_propagate(), except CPR where it may have the HUP
> source registered via migration_connect_error_propagate().
> 
> IIUC, maybe we could change this check to be an explicit one, checking
> whether cpr_transfer_add_hup_watch() has the source attached but not yet
> executed (if executed, migration_connect_outgoing_cb() will also properly
> invoke migration_connect_error_propagate()).
> 
> We could check against both "mode==CPR_TRANSFER && s->hup_source", but then
> we need early release of hup_resource too (currently it is deferred until
> migration_cleanup(), which is slightly hackish; after all the source
> returns always with G_SOURCE_REMOVE for migration_connect_outgoing_cb, so
> it was removed long time ago, rather than until the end of migration).
> 
> Thanks,
> 
> -- 
> Peter Xu

-- 
Peter Xu


Re: [PATCH] migration: fix abort when re-migrating after cancel during SETUP
Posted by Marc-André Lureau 17 hours ago
Hi

On Mon, Apr 20, 2026 at 11:40 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Apr 17, 2026 at 10:47:42PM +0400, marcandre.lureau@redhat.com wrote:
> > From: Marc-André Lureau <marcandre.lureau@redhat.com>
> >
> > When a migration is cancelled during the early SETUP phase (before
> > migration_connect_outgoing() has set s->to_dst_file), migration_cancel()
> > takes a fast path that transitions the state from CANCELLING to
> > Cancelled without calling migration_cleanup(). This leaves the migration
> > yank instance registered.
> >
> > A subsequent qmp_migrate() call passes the migration_is_running() guard
> > (since CANCELLED is a terminal state) and then calls
> > yank_register_instance(MIGRATION_YANK_INSTANCE, &error_abort), which
> > finds the stale entry and aborts with "duplicate yank instance".
> >
> >   #6  0x000056028b5e17fc in error_setg_internal
> >       (errp=errp@entry=0x56028cc8cb18 <error_abort>, src=src@entry=0x56028ba87fa5 "../util/yank.c", line=line@entry=87, func=func@entry=0x56028bb77140 <__func__.5> "yank_register_instance", fmt=fmt@entry=0x56028ba87f8d "duplicate yank instance") at ../util/error.c:100
> >   #7  0x000056028b601b2a in yank_register_instance (instance=instance@entry=0x7ffea0cf36b0, errp=0x56028cc8cb18 <error_abort>) at ../util/yank.c:87
> >   #8  0x000056028b25221e in migrate_prepare (s=0x5602b0a7db90, resume=<optimized out>, errp=0x7ffea0cf3718) at ../migration/migration.c:2001
> >   #9  qmp_migrate (uri=<optimized out>, has_channels=<optimized out>, channels=<optimized out>, has_resume=<optimized out>, resume=<optimized out>, errp=errp@entry=0x7ffea0cf3718) at ../migration/migration.c:2039
> >   #10 0x000056028b5891be in qmp_marshal_migrate (args=<optimized out>, ret=<optimized out>, errp=0x7f63392e0ee0) at qapi/qapi-commands-migration.c:459
> >
> > Add missing yank_unregister_instance. Alternatively, it seems
> > migration_cleanup() should be safe in this context too.
> >
> > Fixes: 624e6e654e11 ("migration: cpr-transfer mode")
> > Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
> > ---
> >  migration/migration.c | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 5c9aaa6e58f..bbdd91ee7ee 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -1477,6 +1477,7 @@ void migration_cancel(void)
> >                            MIGRATION_STATUS_CANCELLED);
> >          cpr_state_close();
> >          cpr_transfer_source_destroy(s);
> > +        yank_unregister_instance(MIGRATION_YANK_INSTANCE);
> >      }
> >  }
>
> Thanks for the report.
>
> I had a feeling that this is not enough.. and I had a feeling that this
> special casing was something fast merged for cpr-transfer work to be able
> to cancel the HUP source when waiting, however did it wrong.  IOW, at least
> this chunk:
>
>     if (setup && !s->to_dst_file) {
>         migrate_set_state(&s->state, MIGRATION_STATUS_CANCELLING,
>                           MIGRATION_STATUS_CANCELLED);
>         cpr_state_close();
>         cpr_transfer_source_destroy(s);
>     }
>
> Should only apply to cpr-transfer, not normal migrations..
>
> E.g. consider one qmp_migrate() on top of tcp socket:
>
> - qmp_migrate
>   - migration_connect_outgoing
>     - socket_connect_outgoing
>       - ... waiting for worker to invoke socket_outgoing_migration()
> - ... some time later
> - qmp_migrate_cancel
>   - above check hits, update CANCELLING->CANCELLED, unregister yank etc.
> - ... some time later
> - socket_outgoing_migration() invoked with an error
>   - migration_connect_error_propagate() should see CANCELLED state, which
>     is illegal.
>
> So IMHO for non-cpr cases we should always get ourselves covered with
> migration_connect_error_propagate(), except CPR where it may have the HUP
> source registered via migration_connect_error_propagate().
>
> IIUC, maybe we could change this check to be an explicit one, checking
> whether cpr_transfer_add_hup_watch() has the source attached but not yet
> executed (if executed, migration_connect_outgoing_cb() will also properly
> invoke migration_connect_error_propagate()).
>
> We could check against both "mode==CPR_TRANSFER && s->hup_source", but then
> we need early release of hup_resource too (currently it is deferred until
> migration_cleanup(), which is slightly hackish; after all the source
> returns always with G_SOURCE_REMOVE for migration_connect_outgoing_cb, so
> it was removed long time ago, rather than until the end of migration).

This is a bit over my head, I am not familiar with CPR. If you can
take it from there, I am fine. I just wanted to fix the crash :)
Re: [PATCH] migration: fix abort when re-migrating after cancel during SETUP
Posted by Peter Xu 16 hours ago
On Mon, Apr 20, 2026 at 11:50:24PM +0400, Marc-André Lureau wrote:
> This is a bit over my head, I am not familiar with CPR. If you can
> take it from there, I am fine. I just wanted to fix the crash :)

Sure :) I'll have a closer look at this.

-- 
Peter Xu