[PATCH v2] migration: Handle block device inactivation failures better

Eric Blake posted 1 patch 1 year ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/qemu tags/patchew/20230414153358.1452040-1-eblake@redhat.com
Maintainers: Juan Quintela <quintela@redhat.com>, "Dr. David Alan Gilbert" <dgilbert@redhat.com>
migration/migration.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
[PATCH v2] migration: Handle block device inactivation failures better
Posted by Eric Blake 1 year ago
Consider what happens when performing a migration between two host
machines connected to an NFS server serving multiple block devices to
the guest, when the NFS server becomes unavailable.  The migration
attempts to inactivate all block devices on the source (a necessary
step before the destination can take over); but if the NFS server is
non-responsive, the attempt to inactivate can itself fail.  When that
happens, the destination fails to get the migrated guest (good,
because the source wasn't able to flush everything properly):

  (qemu) qemu-kvm: load of migration failed: Input/output error

at which point, our only hope for the guest is for the source to take
back control.  With the current code base, the host outputs a message, but then appears to resume:

  (qemu) qemu-kvm: qemu_savevm_state_complete_precopy_non_iterable: bdrv_inactivate_all() failed (-1)

  (src qemu)info status
   VM status: running

but a second migration attempt now asserts:

  (src qemu) qemu-kvm: ../block.c:6738: int bdrv_inactivate_recurse(BlockDriverState *): Assertion `!(bs->open_flags & BDRV_O_INACTIVE)' failed.

Whether the guest is recoverable on the source after the first failure
is debatable, but what we do not want is to have qemu itself fail due
to an assertion.  It looks like the problem is as follows:

In migration.c:migration_completion(), the source sets 'inactivate' to
true (since COLO is not enabled), then tries
savevm.c:qemu_savevm_state_complete_precopy() with a request to
inactivate block devices.  In turn, this calls
block.c:bdrv_inactivate_all(), which fails when flushing runs up
against the non-responsive NFS server.  With savevm failing, we are
now left in a state where some, but not all, of the block devices have
been inactivated; but migration_completion() then jumps to 'fail'
rather than 'fail_invalidate' and skips an attempt to reclaim those
those disks by calling bdrv_activate_all().  Even if we do attempt to
reclaim disks, we aren't taking note of failure there, either.

Thus, we have reached a state where the migration engine has forgotten
all state about whether a block device is inactive, because we did not
set s->block_inactive in enough places; so migration allows the source
to reach vm_start() and resume execution, violating the block layer
invariant that the guest CPUs should not be restarted while a device
is inactive.  Note that the code in migration.c:migrate_fd_cancel()
will also try to reactivate all block devices if s->block_inactive was
set, but because we failed to set that flag after the first failure,
the source assumes it has reclaimed all devices, even though it still
has remaining inactivated devices and does not try again.  Normally,
qmp_cont() will also try to reactivate all disks (or correctly fail if
the disks are not reclaimable because NFS is not yet back up), but the
auto-resumption of the source after a migration failure does not go
through qmp_cont().  And because we have left the block layer in an
inconsistent state with devices still inactivated, the later migration
attempt is hitting the assertion failure.

Since it is important to not resume the source with inactive disks,
this patch marks s->block_inactive before attempting inactivation,
rather than after succeeding, in order to prevent any vm_start() until
it has successfully reactivated all devices.

See also https://bugzilla.redhat.com/show_bug.cgi?id=2058982

Signed-off-by: Eric Blake <eblake@redhat.com>

---

v2: Set s->block_inactive sooner [Juan]
---
 migration/migration.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/migration/migration.c b/migration/migration.c
index bda47891933..cb0d42c0610 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -3444,13 +3444,11 @@ static void migration_completion(MigrationState *s)
                                             MIGRATION_STATUS_DEVICE);
             }
             if (ret >= 0) {
+                s->block_inactive = inactivate;
                 qemu_file_set_rate_limit(s->to_dst_file, INT64_MAX);
                 ret = qemu_savevm_state_complete_precopy(s->to_dst_file, false,
                                                          inactivate);
             }
-            if (inactivate && ret >= 0) {
-                s->block_inactive = true;
-            }
         }
         qemu_mutex_unlock_iothread();

@@ -3522,6 +3520,7 @@ fail_invalidate:
         bdrv_activate_all(&local_err);
         if (local_err) {
             error_report_err(local_err);
+            s->block_inactive = true;
         } else {
             s->block_inactive = false;
         }

base-commit: 7dbd6f8a27e30fe14adb3d5869097cddf24038d6
-- 
2.39.2
Re: [PATCH v2] migration: Handle block device inactivation failures better
Posted by Kevin Wolf 1 year ago
Hi Eric,

you asked me for a review downstream, but since you would have to bring
back any problem to upstream anyway, let's discuss it here. For the
start, let me state that (a) I don't fully understand why this patch
fixes things and (b) I hate this function. More below.

Am 14.04.2023 um 17:33 hat Eric Blake geschrieben:
> Consider what happens when performing a migration between two host
> machines connected to an NFS server serving multiple block devices to
> the guest, when the NFS server becomes unavailable.  The migration
> attempts to inactivate all block devices on the source (a necessary
> step before the destination can take over); but if the NFS server is
> non-responsive, the attempt to inactivate can itself fail.  When that
> happens, the destination fails to get the migrated guest (good,
> because the source wasn't able to flush everything properly):
> 
>   (qemu) qemu-kvm: load of migration failed: Input/output error
> 
> at which point, our only hope for the guest is for the source to take
> back control.  With the current code base, the host outputs a message, but then appears to resume:
> 
>   (qemu) qemu-kvm: qemu_savevm_state_complete_precopy_non_iterable: bdrv_inactivate_all() failed (-1)
> 
>   (src qemu)info status
>    VM status: running
> 
> but a second migration attempt now asserts:
> 
>   (src qemu) qemu-kvm: ../block.c:6738: int bdrv_inactivate_recurse(BlockDriverState *): Assertion `!(bs->open_flags & BDRV_O_INACTIVE)' failed.
> 
> Whether the guest is recoverable on the source after the first failure
> is debatable, but what we do not want is to have qemu itself fail due
> to an assertion.  It looks like the problem is as follows:
> 
> In migration.c:migration_completion(), the source sets 'inactivate' to
> true (since COLO is not enabled), then tries
> savevm.c:qemu_savevm_state_complete_precopy() with a request to
> inactivate block devices.  In turn, this calls
> block.c:bdrv_inactivate_all(), which fails when flushing runs up
> against the non-responsive NFS server.  With savevm failing, we are
> now left in a state where some, but not all, of the block devices have
> been inactivated; but migration_completion() then jumps to 'fail'
> rather than 'fail_invalidate' and skips an attempt to reclaim those
> those disks by calling bdrv_activate_all().  Even if we do attempt to
> reclaim disks, we aren't taking note of failure there, either.

Why do we even jump to 'fail'? In other words, should 'fail_inactivate'
really be called 'fail' and everything should jump there?

Greg added the 'fail_inactivate' label in fe904ea8242, but the commit
message doesn't seem to tell why he left one goto. I see no reason why
we wouldn't want to reactivate in this case, too. Maybe it's just for
the colo case?

> Thus, we have reached a state where the migration engine has forgotten
> all state about whether a block device is inactive, because we did not
> set s->block_inactive in enough places; so migration allows the source
> to reach vm_start() and resume execution, violating the block layer
> invariant that the guest CPUs should not be restarted while a device
> is inactive.  Note that the code in migration.c:migrate_fd_cancel()
> will also try to reactivate all block devices if s->block_inactive was
> set, but because we failed to set that flag after the first failure,
> the source assumes it has reclaimed all devices, even though it still
> has remaining inactivated devices and does not try again.  Normally,
> qmp_cont() will also try to reactivate all disks (or correctly fail if
> the disks are not reclaimable because NFS is not yet back up), but the
> auto-resumption of the source after a migration failure does not go
> through qmp_cont().  And because we have left the block layer in an
> inconsistent state with devices still inactivated, the later migration
> attempt is hitting the assertion failure.
> 
> Since it is important to not resume the source with inactive disks,
> this patch marks s->block_inactive before attempting inactivation,
> rather than after succeeding, in order to prevent any vm_start() until
> it has successfully reactivated all devices.

Here's the part that I don't understand: Even if you set
s->block_inactive, where do we actually use this value and reactivate
the image?

The only reader of the field is migrate_fd_cancel(), which is only
called by migration_cancel() (a very small wrapper, it's a mystery why
this exists when it's the only caller). migration_cancel() in turn is
called in very few places:

* qmp_migrate_cancel: In our case, migration fails by itself, it's not
  cancelled from QMP. So this is not where we're coming from.

* ram_mig_ram_block_resized: This one is an internal error during
  migration, but what we're seeing is not related to RAM at all. So this
  isn't where we're coming from either.

* migration_shutdown: Only called while shutting down QEMU. Doesn't look
  like our case either.

So while this patch fixes some state inconsistencies, how is it fixing
anything for the reported bug when this state is never used in the
relevant places?

(That I don't understand the fix is what blocks my downstream review.
The rest of my points are really only for upstream anyway.)

> See also https://bugzilla.redhat.com/show_bug.cgi?id=2058982
> 
> Signed-off-by: Eric Blake <eblake@redhat.com>
> 
> ---
> 
> v2: Set s->block_inactive sooner [Juan]
> ---
>  migration/migration.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index bda47891933..cb0d42c0610 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -3444,13 +3444,11 @@ static void migration_completion(MigrationState *s)
>                                              MIGRATION_STATUS_DEVICE);
>              }
>              if (ret >= 0) {
> +                s->block_inactive = inactivate;
>                  qemu_file_set_rate_limit(s->to_dst_file, INT64_MAX);
>                  ret = qemu_savevm_state_complete_precopy(s->to_dst_file, false,
>                                                           inactivate);
>              }
> -            if (inactivate && ret >= 0) {
> -                s->block_inactive = true;
> -            }

This part of the code has now really become unintuitive. After commit
f07fa4cbf0b we had perfectly intuitive code:

    ret = bdrv_inactivate_all();
    if (ret >= 0) {
        s->block_inactive = true;
    }

Since then, the bdrv_inactivate_all() call has been moved to
qemu_savevm_state_complete_precopy_non_iterable(), and now you changed
the order because even on failure, we could end up with some inactivated
nodes. I'm not arguing that either was a bad change, but the assignment
to s->block_inactive looks really random now.

I think this desperately needs a comment.

>          }
>          qemu_mutex_unlock_iothread();
> 
> @@ -3522,6 +3520,7 @@ fail_invalidate:
>          bdrv_activate_all(&local_err);
>          if (local_err) {
>              error_report_err(local_err);
> +            s->block_inactive = true;

bdrv_activate_all() never inactivates a node that was active before. So
it seems that this line only ever comes into play if s->block_inactive
was incorrect before.

I feel what we should do here is only try to activate if
s->block_inactive was set above, and then have a single 'fail' label
that always runs the re-activation code.

>          } else {
>              s->block_inactive = false;
>          }

Kevin
Re: [PATCH v2] migration: Handle block device inactivation failures better
Posted by Eric Blake 1 year ago
On Tue, May 02, 2023 at 11:54:12AM +0200, Kevin Wolf wrote:
> Hi Eric,
> 
> you asked me for a review downstream, but since you would have to bring
> back any problem to upstream anyway, let's discuss it here. For the
> start, let me state that (a) I don't fully understand why this patch
> fixes things and (b) I hate this function. More below.

Sadly, I also fall into (a) I don't know if this patch fully fixes
things, or if we're playing whack-a-mole and another bug is still
lurking, and (b) the control flow is indeed horrendous, where I'm also
not sure if I'm fully understanding how migration is supposed to be
working.

> 
> Am 14.04.2023 um 17:33 hat Eric Blake geschrieben:
> > Consider what happens when performing a migration between two host
> > machines connected to an NFS server serving multiple block devices to
> > the guest, when the NFS server becomes unavailable.  The migration
> > attempts to inactivate all block devices on the source (a necessary
> > step before the destination can take over); but if the NFS server is
> > non-responsive, the attempt to inactivate can itself fail.  When that
> > happens, the destination fails to get the migrated guest (good,
> > because the source wasn't able to flush everything properly):
> > 
> >   (qemu) qemu-kvm: load of migration failed: Input/output error
> > 
> > at which point, our only hope for the guest is for the source to take
> > back control.  With the current code base, the host outputs a message, but then appears to resume:
> > 
> >   (qemu) qemu-kvm: qemu_savevm_state_complete_precopy_non_iterable: bdrv_inactivate_all() failed (-1)
> > 
> >   (src qemu)info status
> >    VM status: running
> > 
> > but a second migration attempt now asserts:
> > 
> >   (src qemu) qemu-kvm: ../block.c:6738: int bdrv_inactivate_recurse(BlockDriverState *): Assertion `!(bs->open_flags & BDRV_O_INACTIVE)' failed.
> > 
> > Whether the guest is recoverable on the source after the first failure
> > is debatable, but what we do not want is to have qemu itself fail due
> > to an assertion.  It looks like the problem is as follows:
> > 
> > In migration.c:migration_completion(), the source sets 'inactivate' to
> > true (since COLO is not enabled), then tries
> > savevm.c:qemu_savevm_state_complete_precopy() with a request to
> > inactivate block devices.  In turn, this calls
> > block.c:bdrv_inactivate_all(), which fails when flushing runs up
> > against the non-responsive NFS server.  With savevm failing, we are
> > now left in a state where some, but not all, of the block devices have
> > been inactivated; but migration_completion() then jumps to 'fail'
> > rather than 'fail_invalidate' and skips an attempt to reclaim those
> > those disks by calling bdrv_activate_all().  Even if we do attempt to
> > reclaim disks, we aren't taking note of failure there, either.
> 
> Why do we even jump to 'fail'? In other words, should 'fail_inactivate'
> really be called 'fail' and everything should jump there?
> 
> Greg added the 'fail_inactivate' label in fe904ea8242, but the commit
> message doesn't seem to tell why he left one goto. I see no reason why
> we wouldn't want to reactivate in this case, too. Maybe it's just for
> the colo case?

At the time of fe904ea8, all actions done by fail_invalidate: were
guarded by a mere
 if (s->state == MIGRATION_STATUS_ACTIVE)
Later, in 6039dd5b1c the guard was expanded to
 if (s->state == MIGRATION_STATUS_ACTIVE || s->state == MIGRATION_STATUS_DEVICE)

But reading the rest of the function (at either those prior commit
points, or at the present), I'm inclined to agree that all remaining
'goto fail' either happened at a point where s->state cannot match in
the first place (so consolidating labels is harmless), or is the ONE
place after we already checked s->state == MIGRATION_STATUS_ACTIVE or
just changed s->state to MIGRATION_STATUS_DEVICE, and our failure
could be from one of vm_stop_force_state(RUN_STATE_FINISH_MIGRATE)
(nothing to reactivate, as we haven't inactivated yet),
migration_maybe_pause() (where we change state, but still haven't
inactivated), or from qemu_savevm_state_complete_precopy() (where
'goto fail' does skip out on the reactivation, so consolidating the
labels would make sense to me, although at the time I wrote the patch,
I was too afraid to change the code that drastically).

Are we sure that bdrv_activate_all() is safe to call no matter what?
We already know that a request to inactivate when something is already
inactive asserts, but a request to activate something that is already
active is generally a no-op.  If so, I'm tending to agree with you
that having a single label where we ALWAYS attempt reactivation after
failure makes sense.

I'll post another patch along those lines.

> 
> > Thus, we have reached a state where the migration engine has forgotten
> > all state about whether a block device is inactive, because we did not
> > set s->block_inactive in enough places; so migration allows the source
> > to reach vm_start() and resume execution, violating the block layer
> > invariant that the guest CPUs should not be restarted while a device
> > is inactive.  Note that the code in migration.c:migrate_fd_cancel()
> > will also try to reactivate all block devices if s->block_inactive was
> > set, but because we failed to set that flag after the first failure,
> > the source assumes it has reclaimed all devices, even though it still
> > has remaining inactivated devices and does not try again.  Normally,
> > qmp_cont() will also try to reactivate all disks (or correctly fail if
> > the disks are not reclaimable because NFS is not yet back up), but the
> > auto-resumption of the source after a migration failure does not go
> > through qmp_cont().  And because we have left the block layer in an
> > inconsistent state with devices still inactivated, the later migration
> > attempt is hitting the assertion failure.
> > 
> > Since it is important to not resume the source with inactive disks,
> > this patch marks s->block_inactive before attempting inactivation,
> > rather than after succeeding, in order to prevent any vm_start() until
> > it has successfully reactivated all devices.
> 
> Here's the part that I don't understand: Even if you set
> s->block_inactive, where do we actually use this value and reactivate
> the image?
> 
> The only reader of the field is migrate_fd_cancel(), which is only
> called by migration_cancel() (a very small wrapper, it's a mystery why
> this exists when it's the only caller). migration_cancel() in turn is
> called in very few places:
> 
> * qmp_migrate_cancel: In our case, migration fails by itself, it's not
>   cancelled from QMP. So this is not where we're coming from.
> 
> * ram_mig_ram_block_resized: This one is an internal error during
>   migration, but what we're seeing is not related to RAM at all. So this
>   isn't where we're coming from either.
> 
> * migration_shutdown: Only called while shutting down QEMU. Doesn't look
>   like our case either.
> 
> So while this patch fixes some state inconsistencies, how is it fixing
> anything for the reported bug when this state is never used in the
> relevant places?
> 
> (That I don't understand the fix is what blocks my downstream review.
> The rest of my points are really only for upstream anyway.)

I may get to respin the downstream patch based on my upcoming upstream
patch...

> 
> > See also https://bugzilla.redhat.com/show_bug.cgi?id=2058982
> > 
> > Signed-off-by: Eric Blake <eblake@redhat.com>
> > 
> > ---
> > 
> > v2: Set s->block_inactive sooner [Juan]
> > ---
> >  migration/migration.c | 5 ++---
> >  1 file changed, 2 insertions(+), 3 deletions(-)
> > 
> > diff --git a/migration/migration.c b/migration/migration.c
> > index bda47891933..cb0d42c0610 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -3444,13 +3444,11 @@ static void migration_completion(MigrationState *s)
> >                                              MIGRATION_STATUS_DEVICE);
> >              }
> >              if (ret >= 0) {
> > +                s->block_inactive = inactivate;
> >                  qemu_file_set_rate_limit(s->to_dst_file, INT64_MAX);
> >                  ret = qemu_savevm_state_complete_precopy(s->to_dst_file, false,
> >                                                           inactivate);
> >              }
> > -            if (inactivate && ret >= 0) {
> > -                s->block_inactive = true;
> > -            }
> 
> This part of the code has now really become unintuitive. After commit
> f07fa4cbf0b we had perfectly intuitive code:
> 
>     ret = bdrv_inactivate_all();
>     if (ret >= 0) {
>         s->block_inactive = true;
>     }
> 
> Since then, the bdrv_inactivate_all() call has been moved to
> qemu_savevm_state_complete_precopy_non_iterable(), and now you changed
> the order because even on failure, we could end up with some inactivated
> nodes. I'm not arguing that either was a bad change, but the assignment
> to s->block_inactive looks really random now.
> 
> I think this desperately needs a comment.

Sure, I can add it.

> 
> >          }
> >          qemu_mutex_unlock_iothread();
> > 
> > @@ -3522,6 +3520,7 @@ fail_invalidate:
> >          bdrv_activate_all(&local_err);
> >          if (local_err) {
> >              error_report_err(local_err);
> > +            s->block_inactive = true;
> 
> bdrv_activate_all() never inactivates a node that was active before. So
> it seems that this line only ever comes into play if s->block_inactive
> was incorrect before.
> 
> I feel what we should do here is only try to activate if
> s->block_inactive was set above, and then have a single 'fail' label
> that always runs the re-activation code.
> 
> >          } else {
> >              s->block_inactive = false;
> >          }
> 
> Kevin
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org
Re: [PATCH v2] migration: Handle block device inactivation failures better
Posted by Juan Quintela 1 year ago
Eric Blake <eblake@redhat.com> wrote:
> Consider what happens when performing a migration between two host
> machines connected to an NFS server serving multiple block devices to
> the guest, when the NFS server becomes unavailable.  The migration
> attempts to inactivate all block devices on the source (a necessary
> step before the destination can take over); but if the NFS server is
> non-responsive, the attempt to inactivate can itself fail.  When that
> happens, the destination fails to get the migrated guest (good,
> because the source wasn't able to flush everything properly):
>
>   (qemu) qemu-kvm: load of migration failed: Input/output error
>
> at which point, our only hope for the guest is for the source to take
> back control.  With the current code base, the host outputs a message, but then appears to resume:
>
>   (qemu) qemu-kvm: qemu_savevm_state_complete_precopy_non_iterable: bdrv_inactivate_all() failed (-1)
>
>   (src qemu)info status
>    VM status: running
>
> but a second migration attempt now asserts:
>
>   (src qemu) qemu-kvm: ../block.c:6738: int bdrv_inactivate_recurse(BlockDriverState *): Assertion `!(bs->open_flags & BDRV_O_INACTIVE)' failed.
>
> Whether the guest is recoverable on the source after the first failure
> is debatable, but what we do not want is to have qemu itself fail due
> to an assertion.  It looks like the problem is as follows:
>
> In migration.c:migration_completion(), the source sets 'inactivate' to
> true (since COLO is not enabled), then tries
> savevm.c:qemu_savevm_state_complete_precopy() with a request to
> inactivate block devices.  In turn, this calls
> block.c:bdrv_inactivate_all(), which fails when flushing runs up
> against the non-responsive NFS server.  With savevm failing, we are
> now left in a state where some, but not all, of the block devices have
> been inactivated; but migration_completion() then jumps to 'fail'
> rather than 'fail_invalidate' and skips an attempt to reclaim those
> those disks by calling bdrv_activate_all().  Even if we do attempt to
> reclaim disks, we aren't taking note of failure there, either.
>
> Thus, we have reached a state where the migration engine has forgotten
> all state about whether a block device is inactive, because we did not
> set s->block_inactive in enough places; so migration allows the source
> to reach vm_start() and resume execution, violating the block layer
> invariant that the guest CPUs should not be restarted while a device
> is inactive.  Note that the code in migration.c:migrate_fd_cancel()
> will also try to reactivate all block devices if s->block_inactive was
> set, but because we failed to set that flag after the first failure,
> the source assumes it has reclaimed all devices, even though it still
> has remaining inactivated devices and does not try again.  Normally,
> qmp_cont() will also try to reactivate all disks (or correctly fail if
> the disks are not reclaimable because NFS is not yet back up), but the
> auto-resumption of the source after a migration failure does not go
> through qmp_cont().  And because we have left the block layer in an
> inconsistent state with devices still inactivated, the later migration
> attempt is hitting the assertion failure.
>
> Since it is important to not resume the source with inactive disks,
> this patch marks s->block_inactive before attempting inactivation,
> rather than after succeeding, in order to prevent any vm_start() until
> it has successfully reactivated all devices.
>
> See also https://bugzilla.redhat.com/show_bug.cgi?id=2058982
>
> Signed-off-by: Eric Blake <eblake@redhat.com>

Reviewed-by: Juan Quintela <quintela@redhat.com>

queued.

>
> ---
>
> v2: Set s->block_inactive sooner [Juan]
> ---
>  migration/migration.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/migration/migration.c b/migration/migration.c
> index bda47891933..cb0d42c0610 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -3444,13 +3444,11 @@ static void migration_completion(MigrationState *s)
>                                              MIGRATION_STATUS_DEVICE);
>              }
>              if (ret >= 0) {
> +                s->block_inactive = inactivate;
>                  qemu_file_set_rate_limit(s->to_dst_file, INT64_MAX);
>                  ret = qemu_savevm_state_complete_precopy(s->to_dst_file, false,
>                                                           inactivate);
>              }
> -            if (inactivate && ret >= 0) {
> -                s->block_inactive = true;
> -            }
>          }
>          qemu_mutex_unlock_iothread();

And I still have to look at the file to understand this "simple" patch.
(simple in size, not in what it means).

I will add this to my queue, but if you are in the "mood", I would like
to remove the declaration of inactivate and change this to something like:

             if (ret >= 0) {
                 /* Colo don't stop disks in normal operation */
                 s->block_inactive = !migrate_colo_enabled();
                 qemu_file_set_rate_limit(s->to_dst_file, INT64_MAX);
                 ret = qemu_savevm_state_complete_precopy(s->to_dst_file, false,
                                                          s->block_inactive);
             }

Or something around that lines?

> @@ -3522,6 +3520,7 @@ fail_invalidate:
>          bdrv_activate_all(&local_err);
>          if (local_err) {
>              error_report_err(local_err);
> +            s->block_inactive = true;
>          } else {
>              s->block_inactive = false;
>          }
> base-commit: 7dbd6f8a27e30fe14adb3d5869097cddf24038d6

Just wondering, what git magic creates this line?

Thanks, Juan.
Re: [PATCH v2] migration: Handle block device inactivation failures better
Posted by Eric Blake 1 year ago
On Thu, Apr 20, 2023 at 12:41:25PM +0200, Juan Quintela wrote:
> Eric Blake <eblake@redhat.com> wrote:

...lots of lines...

> > ---
> >  migration/migration.c | 5 ++---
> >  1 file changed, 2 insertions(+), 3 deletions(-)

...describing a tiny change ;)

> >
> > diff --git a/migration/migration.c b/migration/migration.c
> > index bda47891933..cb0d42c0610 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -3444,13 +3444,11 @@ static void migration_completion(MigrationState *s)
> >                                              MIGRATION_STATUS_DEVICE);
> >              }
> >              if (ret >= 0) {
> > +                s->block_inactive = inactivate;
> >                  qemu_file_set_rate_limit(s->to_dst_file, INT64_MAX);
> >                  ret = qemu_savevm_state_complete_precopy(s->to_dst_file, false,
> >                                                           inactivate);
> >              }
> > -            if (inactivate && ret >= 0) {
> > -                s->block_inactive = true;
> > -            }
> >          }
> >          qemu_mutex_unlock_iothread();
> 
> And I still have to look at the file to understand this "simple" patch.
> (simple in size, not in what it means).

Indeed - hence the long commit message!

> 
> I will add this to my queue, but if you are in the "mood", I would like
> to remove the declaration of inactivate and change this to something like:
> 
>              if (ret >= 0) {
>                  /* Colo don't stop disks in normal operation */
>                  s->block_inactive = !migrate_colo_enabled();
>                  qemu_file_set_rate_limit(s->to_dst_file, INT64_MAX);
>                  ret = qemu_savevm_state_complete_precopy(s->to_dst_file, false,
>                                                           s->block_inactive);
>              }
> 
> Or something around that lines?

Yes, that looks like a trivial refactoring that preserves the same
semantics.

> 
> > @@ -3522,6 +3520,7 @@ fail_invalidate:
> >          bdrv_activate_all(&local_err);
> >          if (local_err) {
> >              error_report_err(local_err);
> > +            s->block_inactive = true;
> >          } else {
> >              s->block_inactive = false;
> >          }
> > base-commit: 7dbd6f8a27e30fe14adb3d5869097cddf24038d6
> 
> Just wondering, what git magic creates this line?

git send-email --base=COMMIT_ID

or even 'git config format.useAutoBase whenAble' to try and automate
the use of this.  (If my own git habits were cleaner, of always
sticking patches in fresh branches, --base=auto is handy; but in
practice, I tend to send one-off patches like this in the middle of
'git rebase' of a larger series, at which point I'm not on a clean
branch where --base=auto works, so I end up having to manually specify
one at the command line.  Either way, including the base-commit: info
can be very informative for applying a patch at the branch point then
rebasing it locally, when attempting to apply the patch sent through
email hits merge conflicts when applying it directly to changes on
master in the meantime; I believe 'git am -3' is even able to exploit
the comment when present to make smarter decisions about which parent
commit it tries for doing 3-way patch resolution)

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org
git-format-patch useAutoBase (was Re: [PATCH v2] migration: Handle block ... )
Posted by Juan Quintela 1 year ago
Eric Blake <eblake@redhat.com> wrote:
> On Thu, Apr 20, 2023 at 12:41:25PM +0200, Juan Quintela wrote:
>> Eric Blake <eblake@redhat.com> wrote:
>
> ...lots of lines...
>
>> > ---
>> >  migration/migration.c | 5 ++---
>> >  1 file changed, 2 insertions(+), 3 deletions(-)
>
> ...describing a tiny change ;)
>
>> >
>> > diff --git a/migration/migration.c b/migration/migration.c
>> > index bda47891933..cb0d42c0610 100644
>> > --- a/migration/migration.c
>> > +++ b/migration/migration.c
>> > @@ -3444,13 +3444,11 @@ static void migration_completion(MigrationState *s)
>> >                                              MIGRATION_STATUS_DEVICE);
>> >              }
>> >              if (ret >= 0) {
>> > +                s->block_inactive = inactivate;
>> >                  qemu_file_set_rate_limit(s->to_dst_file, INT64_MAX);
>> >                  ret = qemu_savevm_state_complete_precopy(s->to_dst_file, false,
>> >                                                           inactivate);
>> >              }
>> > -            if (inactivate && ret >= 0) {
>> > -                s->block_inactive = true;
>> > -            }
>> >          }
>> >          qemu_mutex_unlock_iothread();
>> 
>> And I still have to look at the file to understand this "simple" patch.
>> (simple in size, not in what it means).
>
> Indeed - hence the long commit message!
>
>> 
>> I will add this to my queue, but if you are in the "mood", I would like
>> to remove the declaration of inactivate and change this to something like:
>> 
>>              if (ret >= 0) {
>>                  /* Colo don't stop disks in normal operation */
>>                  s->block_inactive = !migrate_colo_enabled();
>>                  qemu_file_set_rate_limit(s->to_dst_file, INT64_MAX);
>>                  ret = qemu_savevm_state_complete_precopy(s->to_dst_file, false,
>>                                                           s->block_inactive);
>>              }
>> 
>> Or something around that lines?
>
> Yes, that looks like a trivial refactoring that preserves the same
> semantics.
>
>> 
>> > @@ -3522,6 +3520,7 @@ fail_invalidate:
>> >          bdrv_activate_all(&local_err);
>> >          if (local_err) {
>> >              error_report_err(local_err);
>> > +            s->block_inactive = true;
>> >          } else {
>> >              s->block_inactive = false;
>> >          }
>> > base-commit: 7dbd6f8a27e30fe14adb3d5869097cddf24038d6
>> 
>> Just wondering, what git magic creates this line?
>
> git send-email --base=COMMIT_ID
>
> or even 'git config format.useAutoBase whenAble' to try and automate
> the use of this.  (If my own git habits were cleaner, of always
> sticking patches in fresh branches, --base=auto is handy; but in
> practice, I tend to send one-off patches like this in the middle of
> 'git rebase' of a larger series, at which point I'm not on a clean
> branch where --base=auto works, so I end up having to manually specify
> one at the command line.  Either way, including the base-commit: info
> can be very informative for applying a patch at the branch point then
> rebasing it locally, when attempting to apply the patch sent through
> email hits merge conflicts when applying it directly to changes on
> master in the meantime; I believe 'git am -3' is even able to exploit
> the comment when present to make smarter decisions about which parent
> commit it tries for doing 3-way patch resolution)

Thanks a lot.

It does the right thing for "trivial" stuff, i.e. when I sent a single
patch or a series against qemu/master.

I am not completely sure that it does the right thing when I send a
series on top of my previous pull request.

097387873b (HEAD -> atomic_counters) migration: Make dirty_bytes_last_sync atomic
3f699a13b2 migration: Make dirty_pages_rate atomic
276a275895 (next) migration: Move qmp_migrate_set_parameters() to options.c
ab13b47801 migration: Move migrate_use_tls() to options.c
ecf5c18eac MAINTAINERS: Add Leonardo and Peter as reviewers
6e5dda696c migration: Disable postcopy + multifd migration
ac5f7bf8e2 (qemu/staging, qemu/master, qemu/HEAD) Merge tag 'migration-20230424-pull-request' of https://gitlab.com/juan.quintela/qemu into staging

where the branchs are:

qemu/master: obvious
next: whatever is going to be on next migration PULL request, I will
      rename this to migration-$date and send this series to the list
      1st. I.e. assume they are on list but still not on master.
HEAD/atomic_counters: This is the series that I am sending

I have done:


>>                  ret = qemu_savevm_state_complete_precopy(s->to_dst_file, false,
>>                                                           s->block_inactive);
>>              }
>> 
>> Or something around that lines?
>
> Yes, that looks like a trivial refactoring that preserves the same
> semantics.
>
>> 
>> > @@ -3522,6 +3520,7 @@ fail_invalidate:
>> >          bdrv_activate_all(&local_err);
>> >          if (local_err) {
>> >              error_report_err(local_err);
>> > +            s->block_inactive = true;
>> >          } else {
>> >              s->block_inactive = false;
>> >          }
>> > base-commit: 7dbd6f8a27e30fe14adb3d5869097cddf24038d6
>> 
>> Just wondering, what git magic creates this line?
>
> git send-email --base=COMMIT_ID
>
> or even 'git config format.useAutoBase whenAble' to try and automate
> the use of this.  (If my own git habits were cleaner, of always
> sticking patches in fresh branches, --base=auto is handy; but in
> practice, I tend to send one-off patches like this in the middle of
> 'git rebase' of a larger series, at which point I'm not on a clean
> branch where --base=auto works, so I end up having to manually specify
> one at the command line.  Either way, including the base-commit: info
> can be very informative for applying a patch at the branch point then
> rebasing it locally, when attempting to apply the patch sent through
> email hits merge conflicts when applying it directly to changes on
> master in the meantime; I believe 'git am -3' is even able to exploit
> the comment when present to make smarter decisions about which parent
> commit it tries for doing 3-way patch resolution)

Thanks a lot.

It does the right thing for "trivial" stuff, i.e. when I sent a single
patch or a series against qemu/master.

I am not completely sure that it does the right thing when I send a
series on top of my previous pull request.

097387873b (HEAD -> atomic_counters) migration: Make dirty_bytes_last_sync atomic
3f699a13b2 migration: Make dirty_pages_rate atomic
276a275895 (next) migration: Move qmp_migrate_set_parameters() to options.c
ab13b47801 migration: Move migrate_use_tls() to options.c
ecf5c18eac MAINTAINERS: Add Leonardo and Peter as reviewers
6e5dda696c migration: Disable postcopy + multifd migration
ac5f7bf8e2 (qemu/staging, qemu/master, qemu/HEAD) Merge tag 'migration-20230424-pull-request' of https://gitlab.com/juan.quintela/qemu into staging

where the branchs are:

qemu/master: obvious
next: whatever is going to be on next migration PULL request, I will
      rename this to migration-$date and send this series to the list
      1st. I.e. assume they are on list but still not on master.
HEAD/atomic_counters: This is the series that I am sending


I have done first:

git branch --set-upstream-to=qemu/master

Because that is what the "real" master is, migration-$date is just an
intermediate "state".

git format-patch --cover-letter next -o /tmp/kk

In this case both useAutoBase=whenAble and useAutoBase=true do the same
"right" thing.

From 097387873b2ef1894d5713fdfda8a7b2492476e5 Mon Sep 17 00:00:00 2001
...

Juan Quintela (2):
  migration: Make dirty_pages_rate atomic
  migration: Make dirty_bytes_last_sync atomic

 migration/migration.c | 10 +++++++---
 migration/ram.c       |  8 +++++---
 migration/ram.h       |  4 ++--
 3 files changed, 14 insertions(+), 8 deletions(-)


base-commit: ac5f7bf8e208cd7893dbb1a9520559e569a4677c
prerequisite-patch-id: 08dd37c2ffd8463398e00cade80765b017200b68
prerequisite-patch-id: 3a1d25d5e4f1f615b6e2c6749dcf891959ca48b5
prerequisite-patch-id: 5607c75cc228370df8953987c390682de3093b65
prerequisite-patch-id: ccb4d94973bd111380e4b50f781eeb6cfa7ce5ff

In obvious cases I do the rebase on top of qemu/master, but that is when
there is no dependencies with the PULL request, and that is not the
"interesting" case.

Thanks again, Juan.
Re: [PATCH v2] migration: Handle block device inactivation failures better
Posted by Lukas Straub 1 year ago
On Fri, 14 Apr 2023 10:33:58 -0500
Eric Blake <eblake@redhat.com> wrote:

> Consider what happens when performing a migration between two host
> machines connected to an NFS server serving multiple block devices to
> the guest, when the NFS server becomes unavailable.  The migration
> attempts to inactivate all block devices on the source (a necessary
> step before the destination can take over); but if the NFS server is
> non-responsive, the attempt to inactivate can itself fail.  When that
> happens, the destination fails to get the migrated guest (good,
> because the source wasn't able to flush everything properly):
> 
>   (qemu) qemu-kvm: load of migration failed: Input/output error
> 
> at which point, our only hope for the guest is for the source to take
> back control.  With the current code base, the host outputs a message, but then appears to resume:
> 
>   (qemu) qemu-kvm: qemu_savevm_state_complete_precopy_non_iterable: bdrv_inactivate_all() failed (-1)
> 
>   (src qemu)info status
>    VM status: running
> 
> but a second migration attempt now asserts:
> 
>   (src qemu) qemu-kvm: ../block.c:6738: int bdrv_inactivate_recurse(BlockDriverState *): Assertion `!(bs->open_flags & BDRV_O_INACTIVE)' failed.
> 
> Whether the guest is recoverable on the source after the first failure
> is debatable, but what we do not want is to have qemu itself fail due
> to an assertion.  It looks like the problem is as follows:
> 
> In migration.c:migration_completion(), the source sets 'inactivate' to
> true (since COLO is not enabled), then tries
> savevm.c:qemu_savevm_state_complete_precopy() with a request to
> inactivate block devices.  In turn, this calls
> block.c:bdrv_inactivate_all(), which fails when flushing runs up
> against the non-responsive NFS server.  With savevm failing, we are
> now left in a state where some, but not all, of the block devices have
> been inactivated; but migration_completion() then jumps to 'fail'
> rather than 'fail_invalidate' and skips an attempt to reclaim those
> those disks by calling bdrv_activate_all().  Even if we do attempt to
> reclaim disks, we aren't taking note of failure there, either.
> 
> Thus, we have reached a state where the migration engine has forgotten
> all state about whether a block device is inactive, because we did not
> set s->block_inactive in enough places; so migration allows the source
> to reach vm_start() and resume execution, violating the block layer
> invariant that the guest CPUs should not be restarted while a device
> is inactive.  Note that the code in migration.c:migrate_fd_cancel()
> will also try to reactivate all block devices if s->block_inactive was
> set, but because we failed to set that flag after the first failure,
> the source assumes it has reclaimed all devices, even though it still
> has remaining inactivated devices and does not try again.  Normally,
> qmp_cont() will also try to reactivate all disks (or correctly fail if
> the disks are not reclaimable because NFS is not yet back up), but the
> auto-resumption of the source after a migration failure does not go
> through qmp_cont().  And because we have left the block layer in an
> inconsistent state with devices still inactivated, the later migration
> attempt is hitting the assertion failure.
> 
> Since it is important to not resume the source with inactive disks,
> this patch marks s->block_inactive before attempting inactivation,
> rather than after succeeding, in order to prevent any vm_start() until
> it has successfully reactivated all devices.
> 
> See also https://bugzilla.redhat.com/show_bug.cgi?id=2058982
> 
> Signed-off-by: Eric Blake <eblake@redhat.com>

Looks good to me from colo side of things.

Acked-by: Lukas Straub <lukasstraub2@web.de>
Tested-by: Lukas Straub <lukasstraub2@web.de>

Best Regards,
Lukas Straub

> ---
> 
> v2: Set s->block_inactive sooner [Juan]
> ---
>  migration/migration.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index bda47891933..cb0d42c0610 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -3444,13 +3444,11 @@ static void migration_completion(MigrationState *s)
>                                              MIGRATION_STATUS_DEVICE);
>              }
>              if (ret >= 0) {
> +                s->block_inactive = inactivate;
>                  qemu_file_set_rate_limit(s->to_dst_file, INT64_MAX);
>                  ret = qemu_savevm_state_complete_precopy(s->to_dst_file, false,
>                                                           inactivate);
>              }
> -            if (inactivate && ret >= 0) {
> -                s->block_inactive = true;
> -            }
>          }
>          qemu_mutex_unlock_iothread();
> 
> @@ -3522,6 +3520,7 @@ fail_invalidate:
>          bdrv_activate_all(&local_err);
>          if (local_err) {
>              error_report_err(local_err);
> +            s->block_inactive = true;
>          } else {
>              s->block_inactive = false;
>          }
> 
> base-commit: 7dbd6f8a27e30fe14adb3d5869097cddf24038d6



--